Vehicle Detection
Vehicle detection from a monocular RGB video input using two different approaches - Supervised Learning (Support Vector Machine) and Deep Learning.
The Deep Learning implementation is the more successful of the two since it is considerably faster.
Images are fed into the classifier via sliding windows and the output is processed via a heat map.
GitHub repo
Quick links to final results:
Project Files
The initial data exploration and extraction is carried out in the notebook data_exploration.ipynb
.
The labeled data came from a combination of the GTI vehicle image database and the KITTI vision benchmark suite. Training data for the neural network as well as additional negative examples for the SVM classification where extracted from the Udacity labeled dataset. data_exploration.ipynb
also contains data extraction for the Udacity set.
The final pipeline is implemented in python. Included in the project are:
vehicleDetect.py
Vehicle Detection using a previously trained classifier.vehicleDetect_classify.py
Handles training of the classifiersvehicleDetectUtil.py
Contains common functions needed to run the vehicle detection and classificationvehicleDetect_svmVar.py
Settings for the Support Vector Machine input features
Usage
For deep learning or support vector machine method, run one of the following:
python vehicleDetect.py cnn fileName
python vehicleDetect.py svm fileName
For training, you have to set variables at the top of vehicleDetect_classify.py
and run:
python vehicleDetect_classify.py fileName
Classifiers
Support Vector Machine
The Support Vector Machine used in this is scikit-learn’s LinearSVC. It runs on modest hardware at about 1.5 seconds/frame and has lots of room for additional performance gains. It was trained on subsets of the GTI and KITTI, as well as manually extracted negative examples from the Udacity set. The latter helped in reducing false positives in areas with a lot of information (trees and complex shadows).
Training
I am using all channels, as well as histogram binning of the YCrCb colorspace as features. Initial data visualization pointed to YCrCb as a color space with useful and different information in all channels. After many trials, I arrived at the settings in vehicleDetect_svmVar.py
. For the HOG features, I am using 9 orientation bins, 8 pixels per cell and 2 cells per block - I started out with these fairly standard settings and none of my tests showed noticeably improved performance with different values. Using a small amount of color (16x16) and histogram features (16 bins) helped false positive detection in particular.
Colorspace exploration (vehicle / non-vehicle)
The feature vectors are normalized via scikit-learn’s StandardScaler. The final feature vector size is 3000. It’s original size of 6108 was reduced via scikit-learn’s principal component analysis.
Deep Learning
I am using a fairly slim convolutional neural network. It runs at 8fps on modest hardware. Since all I need here is a binary decision on small images, I expected it to perform reasonably well. It is implemented in Keras in lines 224-252 of vehicleDetectUtil.py
.
Training
I extended the training data used for the SVM with bounding box data extracted from the Udacity set. Total training set size is 118,493 64x64 images, evenly split between car and non-car samples. Data augmentation further strengthens the training set - it includes mirroring, random translation and random brightness.
The final model was trained for 100 epochs.
Vehicle Labeling
Both SVM and CNN methods use the same approach for finding and labeling vehicles. Sliding windows are used to feed images to the classifier. A heat map is generated over several frames, then gets filtered and labelled.
Sliding Window Search
I perform a simple sliding window search to extract cropped images for classification. I use different sizes of windows at different locations in the image.
The SVM method uses 457 windows (overlap: 0.8), while the CNN detection works efficiently with only 76 (overlap: 0.7).
Sliding windows (CNN). Larger window sizes closer to the bottom and 0.7 overlap. 76 total.
Heat Map
For each frame, every bounding box that is detected as part of a vehicle by two or more bounding boxes adds “heat” to a map. The result is smoothed over eight frames, further thresholded and then fed into scipy’s label function to get a single bounding box for each distinct “island” in the heat map.
False Positive Suppression
I am doing two things to prevent false positives:
- As discussed above, the heat map is thresholded twice
- Additionally, I am ignoring bounding boxes with unreasonable aspect ratios that come out of the label function
Bounding Box Smoothing
I implemented a Car class that keeps track of positions over time. Bounding box coordinates are smoothed over six frames. Additionally, I reuse the previous location if the classifier does not detect the car for up to two frames.
Example frame with labeled image (top left), heat map (top right) and unfiltered sliding-windows detection (bottom right)
Video Results
Discussion
- Setting up a neural network and training it from scratch for this turned out to be a very straight-forward process that yielded great results fairly quickly.
- The SVM method should be able to run a lot faster. In order to speed it up, I started scaling the image instead of scaling the individual crops - also to set up extracting HOG features from only a single image. In hindsight, this turned out fairly messy and slowed down my iteration time of tweaking the sliding window boundaries due to the unintuitive scaled image sizes. It started to be too time-consuming and I left it where it is right now in order to focus on the deep learning solution.
- Keeping track of individual cars in screen space was impractical to do with scipy’s label method. It does not consistently apply the same label to similar islands in the heat map, so I had to put additional checks in my car class to reset it in case labels flip. A custom label function that gets to draw on more information about the existing car instances and can make reasonable assumptions about how traffic moves should perform well here.
- Neither method fits a tight bounding box around the cars at all times. In order to increase accuracy and speed, I would look into making it a two-step process: A quick initial detection of candidate areas, followed by a very localized sliding-window search. Previously detected cars could be re-used on each frame for additional speedup.
- A search on the entire image using a large neural net would be very interesting to look into more. I tried a pre-trained YOLO-tiny with some success, and it runs very fast (20fps+). My PC did run out of GPU RAM due to the size of the neural net, so accuracy suffered considerably. I should try this again on an AWS instance…