VIVID Visually Impaired Vision-Integrated Device

YOLOv8

DeepSORT

MiDaS

PyTorch

IMU Sensors

PyQt5

A real-time assistive technology solution combining computer vision and sensor fusion - with all its challenges and learnings

Presented By:VIVEKTEJA SAPAVATH

Why This Project?

Initial Goal

I chose to create something impactful that can help many people ,So, I chose to make a solution to assist the visually impaired in navigating their environment safely.

Since almost everyone uses a smartphone today, I aimed to create an application that works solely on a smartphone with reasonable accuracy.

Wanted to combine:

Real-time object detection
Distance and speed estimation
Smart audio guidance
Mobile friendly

Research Phase

Explored various computer vision approaches:

Compared YOLO vs other CNN models (speed/accuracy tradeoffs)
Studied depth estimation techniques (stereo vs monocular)
Investigated tracking algorithms
Researched sensor fusion possibilities

Key Realizations

Initial Assumptions

Thought depth estimation would be straightforward
Expected IMU data would provide accurate position tracking
Assumed phone sensor access would be simple

Reality Check

Monocular depth is relative without calibration(not Accurate)
IMU drift makes position tracking unreliable
Android restrictions on sensor access(I tried to make an Kotlin based app for the application but, do to android restriction,it didn't work out

System Architecture

📱 Phone Sensors
(TCP/UDP)

→

🖥️ IMU Processor

↙

🎥 Camera Feed

→

🔍 YOLOv8

→

🆔 DeepSORT

↘

📏 MiDaS Depth

↓

🧩 Data Fusion

→

🗣️ Guidance

Parallel Processing

The system runs multiple components simultaneously:

Camera processing (30 FPS)
IMU data collection (100Hz)
Object detection and tracking
Depth estimation

Implementation Challenges

Thread synchronization between components
Latency in phone-to-PC data transfer
Variable processing times for different modules

Tech Stack Evolution

Changed approaches multiple times:

Started with Streamlit (Not made for real time)
Tried direct Android app (Play Store and phone system restrictions)
Settled on PyQt5 for final interface
Tried to scale the relative depths into real world based on user movement(but it was In accurate)

Data Flow

Information moves through three parallel pipelines that merge at the fusion module:

Visual pipeline (YOLO → DeepSORT)
Depth pipeline (MiDaS estimation)
Sensor pipeline (IMU processing)

Object Detection with YOLOv8(nano model)

 class YOLOv8Detector:
    def __init__(self, model_path="yolov8n.pt"):
        # Load model
        self.model = YOLO(model_path)
        self.model.to('cuda' if torch.cuda.is_available() else 'cpu')
        self.classes = self.model.names
        
    def detect(self, frame):
        # Run inference
        results = self.model(frame, imgsz=640, conf=0.5)
        
        # Process results
        detections = []
        for result in results:
            boxes = result.boxes.xyxy.cpu().numpy()
            scores = result.boxes.conf.cpu().numpy()
            class_ids = result.boxes.cls.cpu().numpy().astype(int)
            
            for box, score, cls_id in zip(boxes, scores, class_ids):
                detections.append({
                    'bbox': box,
                    'score': score,
                    'class_id': cls_id,
                    'class_name': self.classes[cls_id]
                })
                
        return detections

Why YOLO?

After researching alternatives (R-CNN, SSD, etc.):

Speed: 30+ FPS vs 5-10 FPS for R-CNN
Accuracy: Good enough for this use case
Single Stage: Simpler implementation

Challenges Faced

Initial confusion with bounding box formats (xywh vs xyxy)
GPU memory issues with larger models
Class imbalance in COCO dataset

Key Learnings

Importance of confidence threshold tuning
How non-max suppression works internally
Model quantization for mobile deployment

Object Tracking with DeepSORT

                      
class DeepSortTracker:
    def __init__(self):
        # Feature extractor
        self.encoder = FeatureExtractor()
        
        # Tracking parameters
        self.max_age = 50  # frames
        self.n_init = 3    # confirmations needed
        self.metric = NearestNeighborDistanceMetric("cosine", 0.2)
        
        self.tracker = Tracker(
            self.metric,
            max_age=self.max_age,
            n_init=self.n_init)
    
    def update(self, detections, frame):
        # Extract features for each detection
        features = []
        for det in detections:
            x1, y1, x2, y2 = map(int, det['bbox'])
            crop = frame[y1:y2, x1:x2]
            features.append(self.encoder(crop) if crop.size > 0 else np.zeros(512))
        
        # Convert to numpy arrays
        bboxes = np.array([d['bbox'] for d in detections])
        confidences = np.array([d['score'] for d in detections])
        features = np.array(features)
        
        # Update tracker
        self.tracker.predict()
        self.tracker.update(bboxes, confidences, features)
        
        # Return active tracks
        return [
            {
                'track_id': track.track_id,
                'bbox': track.to_tlbr(),
                'class_id': getattr(track, 'class_id', -1)
            }
            for track in self.tracker.tracks
            if track.is_confirmed() and track.time_since_update <= 1
        ]

Why DeepSORT?

Needed persistent object IDs across frames:

Basic SORT had no appearance model
FairMOT was too complex for our needs
DeepSORT offered good balance between accuracy and computation

Implementation Issues

ID switches when objects crossed paths
Tuning max_age and n_init parameters

Key Learnings

Kalman filter fundamentals
Importance of motion and appearance cues
Tradeoff between track longevity and false positives

Depth Estimation with MiDaS

                       
class DepthEstimator:
    def __init__(self, model_type="DPT_Large"):
        # Load MiDaS model
        self.model = torch.hub.load("intel-isl/MiDaS", model_type)
        self.model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
        self.model.eval()
        
        # Load transforms
        self.transform = torch.hub.load("intel-isl/MiDaS", "transforms").dpt_transform
    
    def estimate(self, frame):
        # Convert and preprocess
        img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        input_tensor = self.transform(img).to(self.device)
        
        # Predict depth
        with torch.no_grad():
            prediction = self.model(input_tensor.unsqueeze(0))
            prediction = torch.nn.functional.interpolate(
                prediction.unsqueeze(1),
                size=img.shape[:2],
                mode="bicubic",
                align_corners=False).squeeze()
        
        # Convert to numpy and normalize
        depth_map = prediction.cpu().numpy()
        return cv2.normalize(depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U)

    def get_object_depth(self, depth_map, bbox):
        x1, y1, x2, y2 = map(int, bbox)
        obj_region = depth_map[y1:y2, x1:x2]
        
        if obj_region.size == 0:
            return 0
            
        median_depth = np.median(obj_region)
        
        # Convert to relative distance
        if median_depth > 200: return "very close"
        elif median_depth > 150: return "close"
        elif median_depth > 100: return "medium"
        else: return "far"

Depth Estimation Challenges

Major hurdles faced:

Monocular depth is inherently relative
Attempted scaling with known object sizes (unreliable)
Different lighting conditions affected results
Performance issues with larger models
problematic when we calculate velcity cause it's inn a different scale than dx ,dy

Workarounds Implemented

Switched to smaller MiDaS model
Used relative depth categories instead of absolute values
Added temporal smoothing(to reduce high fluctuations

Key Learnings

Limitations of monocular depth estimation
Importance of proper normalization
How transformer-based models work for vision tasks

Network Server

 
    class NetworkServer:
    initialize TCP and UDP ports
    initialize server state and IMU data queue
    initialize sockets and client list

    function start_servers():
        set running = True
        start TCP server in separate thread
        start UDP server in separate thread
        print server start info

    function _tcp_server():
        create and bind TCP socket
        listen for incoming connections
        while running:
            accept new client
            start thread to handle client
        on shutdown, close TCP socket

    function _handle_tcp_client(client_socket, address):
        while running:
            receive data from client
            if no data, break
            try:
                parse data as JSON
                process sensor data
                send "OK"
            except parse error:
                send "ERROR"
        close client socket

    function _udp_server():
        create and bind UDP socket
        while running:
            receive data from client
            try:
                parse data as JSON
                process sensor data
            except parse error:
                continue
        on shutdown, close UDP socket

    function _process_sensor_data(data):
        extract timestamp, accel, gyro from JSON
        create IMUData object
        add IMUData to queue

    function get_imu_data():
        collect all IMUData from queue and return as list

    function stop_servers():
        set running = False
        close TCP and UDP sockets if open

Unexpected Difficulties

TCP data chunks often split JSON mid-message, causing decoding errors
UDP packets occasionally dropped, leading to inconsistent IMU streams
Multiple threads created scalability issues with many TCP clients
Debugging network latency issues without proper instrumentation was challenging

Solutions Attempted

Used UDP for high-frequency data and TCP for control messages
Buffered incoming TCP data and applied manual packet framing
Implemented timestamp checks to detect and discard duplicate or stale packets
Switched from raw sockets to higher-level abstractions for easier debugging

Key Learnings

TCP is not ideal for real-time sensor streams without message framing
UDP is faster but unreliable—must handle packet loss manually
Thread-per-client model can bottleneck; async I/O scales better
Robust error handling and logging are essential for network applications

IMU Sensor Integration

                      
class IMUProcessor:
    initialize motion state (position, velocity, orientation, etc.)
    initialize calibration manager and filters
    initialize bias and buffers

    function set_calibration_mode(enable):
        if enable:
            start collecting calibration data
        else:
            stop calibration

    function add_imu_data(imu_data):
        if in calibration mode:
            collect calibration data
            return

        if calibration complete:
            apply calibration to imu_data

        extract accel and gyro from imu_data
        if still collecting bias:
            update accel/gyro bias
            return

        subtract bias from accel and gyro
        store corrected accel and gyro in buffers

        if not enough data in buffer:
            return

        apply low-pass filter to accel
        update orientation using gyro
        rotate accel to world frame and remove gravity

        update motion state:
            acceleration → velocity → position
            apply damping to velocity

        update confidence based on variance in recent accel data

    function _apply_filter(buffer):
        apply low-pass filter to buffer and return latest value

    function _update_orientation(gyro):
        convert gyro to quaternion rotation
        apply to current orientation
        normalize orientation

    function _quaternion_multiply(q1, q2):
        return product of two quaternions

    function _rotate_vector(vector, quaternion):
        rotate vector using quaternion

    function _update_confidence():
        calculate variance in recent accel
        map variance to confidence score

Major Challenges

Unexpected difficulties:

Play Store banned direct sensor access apps
Thermal throttling warnings on Android when tried to build own kotlin app
120ms latency in phone-to-PC data transfer
Drift made position tracking useless after 10 seconds

Solutions Attempted

Tried UDP instead of TCP for lower latency
Added calibration routine
Implemented complementary filter for orientation
Used PyQt5 instead of Streamlit for better real-time

Key Learnings

Sensor data characteristics and noise patterns
Double integration pitfalls
Android security restrictions
Importance of sensor fusion

Data Fusion & Guidance

                     
class DataFusion:
    def __init__(self):
        self.tracks = {}
        self.max_age = 2.0  # seconds
    
    def update(self, visual_tracks, depth_map, imu_data):
        current_time = time.time()
        risk_items = []
        
        # Update tracks with new data
        for track in visual_tracks:
            track_id = track['track_id']
            
            if track_id in self.tracks:
                # Update existing track
                self.tracks[track_id].update({
                    'bbox': track['bbox'],
                    'class': track['class_name'],
                    'depth': self._get_depth(track['bbox'], depth_map),
                    'last_seen': current_time
                })
            else:
                # New track
                self.tracks[track_id] = {
                    'bbox': track['bbox'],
                    'class': track['class_name'],
                    'depth': self._get_depth(track['bbox'], depth_map),
                    'first_seen': current_time,
                    'last_seen': current_time
                }
        
        # Remove stale tracks
        self.tracks = {
            k: v for k, v in self.tracks.items()
            if current_time - v['last_seen'] < self.max_age
        }
        
        # Assess risk for each track
        for track_id, track in self.tracks.items():
            risk = self._assess_risk(track, imu_data)
            risk_items.append({
                'track_id': track_id,
                'class': track['class'],
                'bbox': track['bbox'],
                'depth': track['depth'],
                'risk': risk
            })
        
        return risk_items

    def _assess_risk(self, track, imu_data):
        # Simple risk assessment
        depth_priority = {
            "very close": 3,
            "close": 2,
            "medium": 1,
            "far": 0
        }
        
        class_priority = {
            "person": 3, "car": 3,
            "bicycle": 2, "dog": 2,
            "chair": 1, "bench": 1
        }
        
        risk_score = (depth_priority.get(track['depth'], 0) *
                     class_priority.get(track['class'], 0))
        
        if risk_score >= 6: return "HIGH"
        elif risk_score >= 3: return "MEDIUM"
        else: return "LOW"

Fusion Challenges

Difficulties in combining data:

Different coordinate systems
Variable update rates
Noisy and sometimes conflicting data
Temporal alignment issues

Guidance System

Smart alert features:

Priority-based queuing
Cooldown period between alerts
Relative position indication (left/center/right)
Simple natural language generation

Key Learnings

Importance of data synchronization
How to design multi-modal systems
User experience considerations
Tradeoffs in alert frequency vs annoyance

Project Evolution

Research Phase

Explored computer vision models, depth estimation techniques, and sensor options. Settled on YOLOv8 + MiDaS + IMU approach.

Initial Implementation

Built basic YOLO detection pipeline. Attempted to integrate depth estimation. First attempts at Android sensor access.

Major Roadblocks

Discovered Play Store restrictions on sensor access. Depth scaling not working as expected. IMU drift issues became apparent.

Pivot Points

Switched from Streamlit to PyQt5. Changed from absolute to relative depth. Added DeepSORT for better tracking.

Final Integration

Combined all components. Added calibration routines. Implemented basic guidance system.

Major Turning Points

Technical Pivots

Streamlit → PyQt5 for real-time
Absolute → Relative depth
Direct Android app → PC processing

Conceptual Shifts

Perfect system → "Good enough" prototype
Focus on core functionality first
Accept limitations with clear explanations

Challenges & Key Learnings

Technical Challenges

Network Latency: 120ms delay in phone-to-laptop streaming
Depth Scaling: MiDaS provides only relative depth
IMU Drift: Position estimates diverge quickly
Thread Synchronization: Shared resource contention
Real-time Performance: Balancing speed vs accuracy
Android Restrictions: Play Store blocked sensor access

Key Learnings

Practical implementation of YOLO and DeepSORT
Kalman filtering fundamentals
Thread synchronization techniques
Sensor data characteristics and processing
Real-world limitations of theoretical concepts
Importance of iterative development

Current System Status

✓

Real-time detection (30 FPS)

✓

Multi-object tracking

⚠️

Absolute depth

⚠️

IMU position accuracy

✓

Basic guidance system

Future Improvements

Technical Limitations

Absolute Depth: Need stereo camera or LiDAR
Position Tracking: Combine with visual odometry
Latency: Edge deployment on phone
Power Consumption: Optimize model size

Planned Enhancements

Implement visual-inertial odometry
Add SLAM for better mapping
Quantize models for mobile deployment
Improve guidance with more natural language
Add user customization options

Lessons for Next Time

What Went Wrong

Underestimated sensor integration complexity
Assumed depth estimation would be straightforward
Didn't account for Android restrictions
Overly ambitious scope for timeline

What I'd Do Differently

Start with simpler prototype
Research platform limitations earlier
Build more validation steps
Plan for more iteration time

Conclusion & Takeaways

💡

Real-World Engineering is Messy

Theoretical concepts often don't translate directly to practice. Real systems have limitations and edge cases that aren't apparent until implementation.

🧠

Research Before Implementation

Spending more time upfront understanding platform limitations and model capabilities would have saved weeks of rework.

🔄

Iterative Development is Key

Building a minimal viable product first, then enhancing it, leads to better outcomes than trying to implement everything at once.

🎯

Scope Appropriately

As a first-year student, this project was extremely ambitious. While I learned a lot, a narrower focus might have yielded more polished results.

Final Thoughts

Despite the challenges and limitations, this project provided invaluable hands-on experience with real-world computer vision and sensor systems. The lessons learned will inform all my future projects.

🚀