VIVID Visually Impaired Vision-Integrated Device

YOLOv8
DeepSORT
MiDaS
PyTorch
IMU Sensors
PyQt5

A real-time assistive technology solution combining computer vision and sensor fusion - with all its challenges and learnings

Presented By:VIVEKTEJA SAPAVATH

Why This Project?

Initial Goal

I chose to create something impactful that can help many people ,So, I chose to make a solution to assist the visually impaired in navigating their environment safely.

Since almost everyone uses a smartphone today, I aimed to create an application that works solely on a smartphone with reasonable accuracy.

Wanted to combine:

  • Real-time object detection
  • Distance and speed estimation
  • Smart audio guidance
  • Mobile friendly

Research Phase

Explored various computer vision approaches:

  • Compared YOLO vs other CNN models (speed/accuracy tradeoffs)
  • Studied depth estimation techniques (stereo vs monocular)
  • Investigated tracking algorithms
  • Researched sensor fusion possibilities

Key Realizations

Initial Assumptions

  • Thought depth estimation would be straightforward
  • Expected IMU data would provide accurate position tracking
  • Assumed phone sensor access would be simple

Reality Check

  • Monocular depth is relative without calibration(not Accurate)
  • IMU drift makes position tracking unreliable
  • Android restrictions on sensor access(I tried to make an Kotlin based app for the application but, do to android restriction,it didn't work out

System Architecture

πŸ“± Phone Sensors
(TCP/UDP)
β†’
πŸ–₯️ IMU Processor
↙
πŸŽ₯ Camera Feed
β†’
πŸ” YOLOv8
β†’
πŸ†” DeepSORT
β†˜
πŸ“ MiDaS Depth
↓
🧩 Data Fusion
β†’
πŸ—£οΈ Guidance

Parallel Processing

The system runs multiple components simultaneously:

  • Camera processing (30 FPS)
  • IMU data collection (100Hz)
  • Object detection and tracking
  • Depth estimation

Implementation Challenges

  • Thread synchronization between components
  • Latency in phone-to-PC data transfer
  • Variable processing times for different modules

Tech Stack Evolution

Changed approaches multiple times:

  1. Started with Streamlit (Not made for real time)
  2. Tried direct Android app (Play Store and phone system restrictions)
  3. Settled on PyQt5 for final interface
  4. Tried to scale the relative depths into real world based on user movement(but it was In accurate)

Data Flow

Information moves through three parallel pipelines that merge at the fusion module:

  1. Visual pipeline (YOLO β†’ DeepSORT)
  2. Depth pipeline (MiDaS estimation)
  3. Sensor pipeline (IMU processing)

Object Detection with YOLOv8(nano model)

 class YOLOv8Detector:
    def __init__(self, model_path="yolov8n.pt"):
        # Load model
        self.model = YOLO(model_path)
        self.model.to('cuda' if torch.cuda.is_available() else 'cpu')
        self.classes = self.model.names
        
    def detect(self, frame):
        # Run inference
        results = self.model(frame, imgsz=640, conf=0.5)
        
        # Process results
        detections = []
        for result in results:
            boxes = result.boxes.xyxy.cpu().numpy()
            scores = result.boxes.conf.cpu().numpy()
            class_ids = result.boxes.cls.cpu().numpy().astype(int)
            
            for box, score, cls_id in zip(boxes, scores, class_ids):
                detections.append({
                    'bbox': box,
                    'score': score,
                    'class_id': cls_id,
                    'class_name': self.classes[cls_id]
                })
                
        return detections
    

Why YOLO?

After researching alternatives (R-CNN, SSD, etc.):

  • Speed: 30+ FPS vs 5-10 FPS for R-CNN
  • Accuracy: Good enough for this use case
  • Single Stage: Simpler implementation

Challenges Faced

  • Initial confusion with bounding box formats (xywh vs xyxy)
  • GPU memory issues with larger models
  • Class imbalance in COCO dataset

Key Learnings

  • Importance of confidence threshold tuning
  • How non-max suppression works internally
  • Model quantization for mobile deployment

Object Tracking with DeepSORT

                      
class DeepSortTracker:
    def __init__(self):
        # Feature extractor
        self.encoder = FeatureExtractor()
        
        # Tracking parameters
        self.max_age = 50  # frames
        self.n_init = 3    # confirmations needed
        self.metric = NearestNeighborDistanceMetric("cosine", 0.2)
        
        self.tracker = Tracker(
            self.metric,
            max_age=self.max_age,
            n_init=self.n_init)
    
    def update(self, detections, frame):
        # Extract features for each detection
        features = []
        for det in detections:
            x1, y1, x2, y2 = map(int, det['bbox'])
            crop = frame[y1:y2, x1:x2]
            features.append(self.encoder(crop) if crop.size > 0 else np.zeros(512))
        
        # Convert to numpy arrays
        bboxes = np.array([d['bbox'] for d in detections])
        confidences = np.array([d['score'] for d in detections])
        features = np.array(features)
        
        # Update tracker
        self.tracker.predict()
        self.tracker.update(bboxes, confidences, features)
        
        # Return active tracks
        return [
            {
                'track_id': track.track_id,
                'bbox': track.to_tlbr(),
                'class_id': getattr(track, 'class_id', -1)
            }
            for track in self.tracker.tracks
            if track.is_confirmed() and track.time_since_update <= 1
        ]
     

Why DeepSORT?

Needed persistent object IDs across frames:

  • Basic SORT had no appearance model
  • FairMOT was too complex for our needs
  • DeepSORT offered good balance between accuracy and computation

Implementation Issues

  • ID switches when objects crossed paths
  • Tuning max_age and n_init parameters

Key Learnings

  • Kalman filter fundamentals
  • Importance of motion and appearance cues
  • Tradeoff between track longevity and false positives

Depth Estimation with MiDaS

                       
class DepthEstimator:
    def __init__(self, model_type="DPT_Large"):
        # Load MiDaS model
        self.model = torch.hub.load("intel-isl/MiDaS", model_type)
        self.model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
        self.model.eval()
        
        # Load transforms
        self.transform = torch.hub.load("intel-isl/MiDaS", "transforms").dpt_transform
    
    def estimate(self, frame):
        # Convert and preprocess
        img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        input_tensor = self.transform(img).to(self.device)
        
        # Predict depth
        with torch.no_grad():
            prediction = self.model(input_tensor.unsqueeze(0))
            prediction = torch.nn.functional.interpolate(
                prediction.unsqueeze(1),
                size=img.shape[:2],
                mode="bicubic",
                align_corners=False).squeeze()
        
        # Convert to numpy and normalize
        depth_map = prediction.cpu().numpy()
        return cv2.normalize(depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)

    def get_object_depth(self, depth_map, bbox):
        x1, y1, x2, y2 = map(int, bbox)
        obj_region = depth_map[y1:y2, x1:x2]
        
        if obj_region.size == 0:
            return 0
            
        median_depth = np.median(obj_region)
        
        # Convert to relative distance
        if median_depth > 200: return "very close"
        elif median_depth > 150: return "close"
        elif median_depth > 100: return "medium"
        else: return "far"
    

Depth Estimation Challenges

Major hurdles faced:

  • Monocular depth is inherently relative
  • Attempted scaling with known object sizes (unreliable)
  • Different lighting conditions affected results
  • Performance issues with larger models
  • problematic when we calculate velcity cause it's inn a different scale than dx ,dy

Workarounds Implemented

  • Switched to smaller MiDaS model
  • Used relative depth categories instead of absolute values
  • Added temporal smoothing(to reduce high fluctuations

Key Learnings

  • Limitations of monocular depth estimation
  • Importance of proper normalization
  • How transformer-based models work for vision tasks

Network Server

 
    class NetworkServer:
    initialize TCP and UDP ports
    initialize server state and IMU data queue
    initialize sockets and client list

    function start_servers():
        set running = True
        start TCP server in separate thread
        start UDP server in separate thread
        print server start info

    function _tcp_server():
        create and bind TCP socket
        listen for incoming connections
        while running:
            accept new client
            start thread to handle client
        on shutdown, close TCP socket

    function _handle_tcp_client(client_socket, address):
        while running:
            receive data from client
            if no data, break
            try:
                parse data as JSON
                process sensor data
                send "OK"
            except parse error:
                send "ERROR"
        close client socket

    function _udp_server():
        create and bind UDP socket
        while running:
            receive data from client
            try:
                parse data as JSON
                process sensor data
            except parse error:
                continue
        on shutdown, close UDP socket

    function _process_sensor_data(data):
        extract timestamp, accel, gyro from JSON
        create IMUData object
        add IMUData to queue

    function get_imu_data():
        collect all IMUData from queue and return as list

    function stop_servers():
        set running = False
        close TCP and UDP sockets if open

            

Unexpected Difficulties

  • TCP data chunks often split JSON mid-message, causing decoding errors
  • UDP packets occasionally dropped, leading to inconsistent IMU streams
  • Multiple threads created scalability issues with many TCP clients
  • Debugging network latency issues without proper instrumentation was challenging

Solutions Attempted

  • Used UDP for high-frequency data and TCP for control messages
  • Buffered incoming TCP data and applied manual packet framing
  • Implemented timestamp checks to detect and discard duplicate or stale packets
  • Switched from raw sockets to higher-level abstractions for easier debugging

Key Learnings

  • TCP is not ideal for real-time sensor streams without message framing
  • UDP is faster but unreliableβ€”must handle packet loss manually
  • Thread-per-client model can bottleneck; async I/O scales better
  • Robust error handling and logging are essential for network applications

IMU Sensor Integration

                      
class IMUProcessor:
    initialize motion state (position, velocity, orientation, etc.)
    initialize calibration manager and filters
    initialize bias and buffers

    function set_calibration_mode(enable):
        if enable:
            start collecting calibration data
        else:
            stop calibration

    function add_imu_data(imu_data):
        if in calibration mode:
            collect calibration data
            return

        if calibration complete:
            apply calibration to imu_data

        extract accel and gyro from imu_data
        if still collecting bias:
            update accel/gyro bias
            return

        subtract bias from accel and gyro
        store corrected accel and gyro in buffers

        if not enough data in buffer:
            return

        apply low-pass filter to accel
        update orientation using gyro
        rotate accel to world frame and remove gravity

        update motion state:
            acceleration β†’ velocity β†’ position
            apply damping to velocity

        update confidence based on variance in recent accel data

    function _apply_filter(buffer):
        apply low-pass filter to buffer and return latest value

    function _update_orientation(gyro):
        convert gyro to quaternion rotation
        apply to current orientation
        normalize orientation

    function _quaternion_multiply(q1, q2):
        return product of two quaternions

    function _rotate_vector(vector, quaternion):
        rotate vector using quaternion

    function _update_confidence():
        calculate variance in recent accel
        map variance to confidence score

            

Major Challenges

Unexpected difficulties:

  • Play Store banned direct sensor access apps
  • Thermal throttling warnings on Android when tried to build own kotlin app
  • 120ms latency in phone-to-PC data transfer
  • Drift made position tracking useless after 10 seconds

Solutions Attempted

  • Tried UDP instead of TCP for lower latency
  • Added calibration routine
  • Implemented complementary filter for orientation
  • Used PyQt5 instead of Streamlit for better real-time

Key Learnings

  • Sensor data characteristics and noise patterns
  • Double integration pitfalls
  • Android security restrictions
  • Importance of sensor fusion

Data Fusion & Guidance

                     
class DataFusion:
    def __init__(self):
        self.tracks = {}
        self.max_age = 2.0  # seconds
    
    def update(self, visual_tracks, depth_map, imu_data):
        current_time = time.time()
        risk_items = []
        
        # Update tracks with new data
        for track in visual_tracks:
            track_id = track['track_id']
            
            if track_id in self.tracks:
                # Update existing track
                self.tracks[track_id].update({
                    'bbox': track['bbox'],
                    'class': track['class_name'],
                    'depth': self._get_depth(track['bbox'], depth_map),
                    'last_seen': current_time
                })
            else:
                # New track
                self.tracks[track_id] = {
                    'bbox': track['bbox'],
                    'class': track['class_name'],
                    'depth': self._get_depth(track['bbox'], depth_map),
                    'first_seen': current_time,
                    'last_seen': current_time
                }
        
        # Remove stale tracks
        self.tracks = {
            k: v for k, v in self.tracks.items()
            if current_time - v['last_seen'] < self.max_age
        }
        
        # Assess risk for each track
        for track_id, track in self.tracks.items():
            risk = self._assess_risk(track, imu_data)
            risk_items.append({
                'track_id': track_id,
                'class': track['class'],
                'bbox': track['bbox'],
                'depth': track['depth'],
                'risk': risk
            })
        
        return risk_items

    def _assess_risk(self, track, imu_data):
        # Simple risk assessment
        depth_priority = {
            "very close": 3,
            "close": 2,
            "medium": 1,
            "far": 0
        }
        
        class_priority = {
            "person": 3, "car": 3,
            "bicycle": 2, "dog": 2,
            "chair": 1, "bench": 1
        }
        
        risk_score = (depth_priority.get(track['depth'], 0) *
                     class_priority.get(track['class'], 0))
        
        if risk_score >= 6: return "HIGH"
        elif risk_score >= 3: return "MEDIUM"
        else: return "LOW"
             

Fusion Challenges

Difficulties in combining data:

  • Different coordinate systems
  • Variable update rates
  • Noisy and sometimes conflicting data
  • Temporal alignment issues

Guidance System

Smart alert features:

  • Priority-based queuing
  • Cooldown period between alerts
  • Relative position indication (left/center/right)
  • Simple natural language generation

Key Learnings

  • Importance of data synchronization
  • How to design multi-modal systems
  • User experience considerations
  • Tradeoffs in alert frequency vs annoyance

Project Evolution

Research Phase

Explored computer vision models, depth estimation techniques, and sensor options. Settled on YOLOv8 + MiDaS + IMU approach.

Initial Implementation

Built basic YOLO detection pipeline. Attempted to integrate depth estimation. First attempts at Android sensor access.

Major Roadblocks

Discovered Play Store restrictions on sensor access. Depth scaling not working as expected. IMU drift issues became apparent.

Pivot Points

Switched from Streamlit to PyQt5. Changed from absolute to relative depth. Added DeepSORT for better tracking.

Final Integration

Combined all components. Added calibration routines. Implemented basic guidance system.

Major Turning Points

Technical Pivots

  • Streamlit β†’ PyQt5 for real-time
  • Absolute β†’ Relative depth
  • Direct Android app β†’ PC processing

Conceptual Shifts

  • Perfect system β†’ "Good enough" prototype
  • Focus on core functionality first
  • Accept limitations with clear explanations

Challenges & Key Learnings

Technical Challenges

  • Network Latency: 120ms delay in phone-to-laptop streaming
  • Depth Scaling: MiDaS provides only relative depth
  • IMU Drift: Position estimates diverge quickly
  • Thread Synchronization: Shared resource contention
  • Real-time Performance: Balancing speed vs accuracy
  • Android Restrictions: Play Store blocked sensor access

Key Learnings

  • Practical implementation of YOLO and DeepSORT
  • Kalman filtering fundamentals
  • Thread synchronization techniques
  • Sensor data characteristics and processing
  • Real-world limitations of theoretical concepts
  • Importance of iterative development

Current System Status

βœ“
Real-time detection (30 FPS)
βœ“
Multi-object tracking
⚠️
Absolute depth
⚠️
IMU position accuracy
βœ“
Basic guidance system

Future Improvements

Technical Limitations

  • Absolute Depth: Need stereo camera or LiDAR
  • Position Tracking: Combine with visual odometry
  • Latency: Edge deployment on phone
  • Power Consumption: Optimize model size

Planned Enhancements

  • Implement visual-inertial odometry
  • Add SLAM for better mapping
  • Quantize models for mobile deployment
  • Improve guidance with more natural language
  • Add user customization options

Lessons for Next Time

What Went Wrong

  • Underestimated sensor integration complexity
  • Assumed depth estimation would be straightforward
  • Didn't account for Android restrictions
  • Overly ambitious scope for timeline

What I'd Do Differently

  • Start with simpler prototype
  • Research platform limitations earlier
  • Build more validation steps
  • Plan for more iteration time

Conclusion & Takeaways

πŸ’‘

Real-World Engineering is Messy

Theoretical concepts often don't translate directly to practice. Real systems have limitations and edge cases that aren't apparent until implementation.

🧠

Research Before Implementation

Spending more time upfront understanding platform limitations and model capabilities would have saved weeks of rework.

πŸ”„

Iterative Development is Key

Building a minimal viable product first, then enhancing it, leads to better outcomes than trying to implement everything at once.

🎯

Scope Appropriately

As a first-year student, this project was extremely ambitious. While I learned a lot, a narrower focus might have yielded more polished results.

Final Thoughts

Despite the challenges and limitations, this project provided invaluable hands-on experience with real-world computer vision and sensor systems. The lessons learned will inform all my future projects.

πŸš€
1 / 12