Overview
This project explores how to deploy a modern deep learning vision model under real-world edge constraints. I built a real-time system to detect driver drowsiness and distraction (e.g. texting, drinking, reaching behind) while optimizing the model to run efficiently on CPU-only hardware with limited memory.
Rather than focusing purely on accuracy, the core challenge was model efficiency, stability, and deployability.
Problem
Most deep learning models assume access to GPUs and abundant memory. In contrast, automotive and IoT deployments often operate with:
- No GPU acceleration
- Tight memory budgets
- Strict latency requirements
The baseline model achieved strong accuracy but was too large and inefficient for edge deployment. The goal was to retain model fidelity while significantly reducing model size and improving CPU inference performance.
Model & Dataset
- Backbone: MobileNetV3-Large
- Classes: 10 driver behaviors (safe driving, texting, phone usage, drinking, etc.)
- Input: 224×224 RGB images normalized with ImageNet statistics
Initial experiments used MobileNetV3-Small, but it struggled with fine-grained distraction classes. MobileNetV3-Large provided stronger feature representation, making it a better candidate for post-training optimization.
Optimization: Post-Training Static Quantization
To enable edge deployment, I applied Post-Training Static Quantization (INT8) in PyTorch. I chose static quantization because vision models are dominated by activation memory, and dynamic quantization alone is often insufficient.
Pipeline
- Layer Fusion: Fused (Conv–BatchNorm–Activation) layers to reduce memory access overhead.
- Instrumentation: Inserted QuantStub/DeQuantStub.
- Calibration: Calibrated with real validation data to estimate activation ranges.
- Conversion: Converted to a fully INT8 model using PyTorch’s FBGEMM backend.
Results
| Metric | Float32 | INT8 (Quantized) |
|---|---|---|
| Model Size | 16.2 MB | 4.49 MB |
| Compression | — | 3.6× smaller |
| Accuracy Loss | — | < 0.5% |
| Hardware | GPU / Server | CPU-only edge |
Real-Time Inference
I built a real-time inference pipeline to validate deployment readiness. The system performs webcam capture at >30 FPS, handles preprocessing on the CPU, and executes fully quantized INT8 inference. The result is a system that runs smoothly on standard x86 CPUs without any GPU acceleration.
Key Learnings
- High-accuracy models can be made edge-deployable without retraining.
- Quantization is a systems problem, not just a modeling trick.
- Calibration quality is critical to preserving accuracy.
- Architectural choices strongly influence quantization robustness.
Future Work
- Quantization-Aware Training (QAT) to recover that last 0.5% accuracy.
- ONNX export for C++ deployment.
- Testing on ARM-based edge platforms (Raspberry Pi, Jetson Nano).