Traffic congestion has become a major issue all around the world, giving policymakers and city managers huge challenges to improve the quality of life in urban areas. It is a widely occurring problem in metropolitan areas due to the increasing number of road users and is characterized by slow vehicle speeds, longer travel times, and long queues.
The Smart-Traffic system for real-time traffic prediction [1,2,3] has been proposed as a method for predicting congested road-vehicle traffic on a given roadway within a region. This computerized method takes real-time images from traffic cameras as input and utilizes automated processing and machine learning to predict the level of congestion. The accuracy and efficiency of this system in predicting congestion levels can be improved over time by studying the outputs while reviewing the corresponding inputs and tuning the hyperparameters (see Fig. 1).
The problem of traffic congestion prediction can be formulated as an image classification problem: given an image captured by a traffic camera, determine whether the level of traffic congestion in the image is High (sufficiently congested that vehicles can only travel very slowly), Low (vehicles can travel completely freely), or Medium (intermediate between High and Low congestion levels). As a starting point for this project, the team was provided with a base Tensorflow-based neural network model for this classification problem, which was derived from Adetiloye’s doctoral research [1,2,3] and was trained on a collection of traffic camera images from the Montreal region. The initial goals of this project were:
- To get API traffic image data for a new city in Europe or North America, or implement an end-to-end solution.
- To improve model prediction accuracy by eliminating noise, such as trees and dual lanes, in terms of camera object focus.
The pipeline for this project is shown in Fig. 2 and comprises data collection, data annotation, data preprocessing, modeling, and deployment. We first collected raw traffic camera images. Then, we annotated a subset of them with traffic congestion levels to create training and test datasets. These datasets were used to build and evaluate new classification models. We also developed blurry-image and traffic-light filters and created mask polygons, intended to be used in deployment to filter out images that are too blurry or have traffic lights and to mask the “noise” in the image, such as surrounding buildings and trees, before feeding them to a classification model.
1. Data collection
In order to achieve the first goal of this project, we identified multiple public APIs for traffic cameras, including one in London and a few locations in North America. We collected images from multiple cameras from each API at 5-30 min intervals over a period of 24 hours, yielding a total of 100,000+ raw traffic camera images.
Numerous online platforms host a large number of live traffic cameras. However, access to the APIs for these cameras is often limited. Some platforms require private API key requests through a designated form. If a request is made, the private API key is typically emailed to the requester within a few minutes to a week. We collected images using APIs from Ottawa, Canada [7,8], New York, US [9,10], Illinois, US [11,12], and London, UK [13,14]. A summary of the collected images can be found in Table 1.
Table 1: Collected Image Summary
Number of Images
Image Capture Interval
Public API Available
5 min – 1 hr
Nova Scotia, Canada
No (used web scraping)
New York, US
Highway + City
50 (randomly selected)
Highway + City
25,115 (front/back labeled)
The collected traffic camera images were from both highways and cities (Fig. 3). Some API providers (i.e., the City of Ottawa) applied a restriction for their API calls to be no more than 1 request per 60 seconds. To collect enough variations of highway images throughout the day and night, we decided to collect images over 24 hours at 5-30 min intervals. However, when the computer was in sleep mode (mostly during the night), the frequency decreased. Therefore, the initially established interval changed at times to an interval of up to 1.5 hours. Also, it is worth noting that these images were collected around the festive (Christmas) period; hence, they may not correctly reflect the seasonality of the traffic conditions.
The project focused on the prediction of highway traffic conditions rather than city conditions. For this reason, we narrowed down our analysis focus to the images from Ottawa. Within these images, there were not a lot of variations observed during the night, as expected. Also, they were mostly low-congestion images. High- and medium-congestion images were found rarely, usually around morning and evening commuting hours or during rush hours.
To get an initial insight into the images in this dataset, we examined the predictions from the base model on these images. As shown in Table 2 and Fig. 4, each of the 5,191 images were classified either as High (3.8%), Medium (2.3%), or Low (91.8%), or the model did not return any congestion prediction (2.1%). The “None” label was returned when the prediction confidence level was lower than 50%. The “Blank” label was returned with no clear reason, but it is suspected that it is due to the latency of image retrieval either from the API, truncated images, or the computer entering into sleep mode.
Table 2: Predicted congestion classes of the Ottawa images (using the base model)
2. Data annotation
To annotate the Ottawa dataset, we leveraged the Labelbox platform, which allowed the team to collaborate on the manual labeling of the 5,183 images we collected. Each image was labeled High, Medium, or Low congestion whenever possible. If the image was too blurry to determine the traffic congestion level, we labeled it “Blurry”. If no useful image was available from the camera (most likely because the camera was off for some reason), we labeled it “Camera-off”. To speed up the process, we uploaded the predictions of the base model to the Labelbox platform as pre-labels, so that a labeler can simply click “approve” if the pre-label looks correct. Utilizing Labelbox’s annotation interface (see Fig. 5), each image was initially labeled by one person, and then reviewed either by one other person or by a group of team members through discussion in a Zoom meeting.
Since distinguishing High and Medium congestion images (or Medium and Low congestion images) can be subjective, the group adopted a set of criteria to label the images in a consistent manner. A sample image for each label is shown in Figs. 6 and 7.
Once the manual labeling was completed, a set of scripts was written to automatically download the resulting annotations through Labelbox’s Python SDK, sort the images into folders according to the labels, and select a balanced test set for evaluation (leaving the rest as the training set). This produced a labeled Ottawa dataset consisting of 5,183 images, which we subsequently used to test the generalizability of the base model and to build new models for the classification problem. Table 3 shows the breakdown of the images in the dataset.
3. Data preprocessing
Addressing the issue of potential confusion and misclassification caused by blurry images in traffic congestion classification models, we devised two separate solutions: a YOLOv5-based binary classification model [5,6] and an anomaly detection technique for effectively identifying and filtering out blurry images. In addition, we developed a model based on YOLOv7 [15,16] for detecting traffic lights in the image, as they are also likely to confuse traffic congestion models. The idea is to apply these models as pre-filters, so that no image that is too blurry or has traffic lights gets passed to the traffic congestion model. Moreover, we hypothesized that masking the image so as to isolate lanes from the background will improve the performance of the traffic congestion model and thus address the second project goal. To test this hypothesis, we created mask polygons for the Ottawa dataset through a combination of manual annotation and automatic image clustering.
a. YOLO-based Blurry-image Filter
We utilized the YOLOv5-small model , which was re-trained with a custom dataset to develop a blurry image filter (Fig. 8). YOLOv5 model  is based on a variant of the convolutional neural network architecture EfficientNet , which serves as its backbone. We chose the “small” version of this model because it offers faster processing time with minimal loss of accuracy due to its smaller number of convolutional layers. To train this model, we created a custom dataset from a subset of Ottawa images, manually labeling them as “blurry” or “clear” using the Roboflow platform . The images were then reduced to a smaller size (416 x 416) for faster processing. White paddings were added to maintain the original aspect ratio of 704 x 480. A total of 182 images were used for the model building, and they were split into train, validation, and test sets at 70%, 20%, and 10%, respectively. The improvement of model performance was observed when increasing the number of epochs of training from 61 to 100.
While the model predictions on the 36 validation set appeared to have 22 incorrectly predicted images, upon further investigation it was found that these were actually correct predictions with misclassified labels except for one image.
b. Anomaly Detection-based Blurry-image Filter
Anomaly detection can be used to detect invalid camera images, such as blurry and “camera-off” images, by identifying them as anomalies.
In this method, an anomaly detection network called LUNAR [17,18] was trained on the features extracted from the images of valid traffic camera images. LUNAR is a graph neural network that learns information from the nearest neighbors of each node in a trainable way to find anomalies. Pre-trained networks such as VGGNet, MobileNetV3, and EfficientNet were evaluated for feature extraction performance. MobileNetV3 produced the best F1-score and thus was selected for the final implementation. The extracted features were then used for training the LUNAR anomaly detection network.
After the training, the following results were obtained when using the Ottawa test set: a precision of 0.97862, recall of 0.97859, and F1-score of 0.97851.
This is an unsupervised method, and only valid samples of traffic camera samples are required for training. LUNAR is fast, with a training time of about 2 min for 1000 epochs and an inference time of 0.22 sec for a single image (on a single GPU Nvidia Quadro T1000).
The following confusion matrix in Fig. 9 was obtained with the trained network on the Ottawa test set. Fig. 10 shows its predictions on sample images.
c. Mask Creation and Inference Results on Masked Images
This approach utilizes masked images containing isolated road lanes as input to the base classification model. The mask creation process involved segregating images by the camera location and view using the Image Clustering Library . The pre-trained VGG16 network  was used in this library to extract image features, which were then passed to a hierarchical clustering algorithm to cluster images based on their similarity.
After clustering according to camera locations and then according to different views for each location, a Labelbox project was set up to manually create mask polygons for road lanes. Because we clustered the images, only about 70 masks were needed for a set of around 4,500 images. Once initial masks were available, the masks along with dilations were used to obtain traffic congestion inference from the base traffic congestion model, as illustrated in Fig. 11.
In Fig. 12, we present the classification performance for different amounts of mask dilation. The results show that masks that closely enclose the lanes (i.e., little to no dilation) yielded worse results than those that more loosely enclose the lanes (i.e., more dilation, up to a kernel size of 25 pixels). This suggests that smaller masks tend to have noisy pixels, dominating the inference and yielding worse results. As the mask size increases, the results improve up to a certain limit, beyond which masks overlay with other lanes or noisy elements in the scenes.
Since the base model was trained on complete scenes rather than masked ones, training a new model on masked images is expected to improve the results.
d. YOLOv7 Object Detection-based Filter for Traffic Lights
Traffic commonly occurs in road intersections, and to resolve that there are traffic lights present at those points. At these points, people may experience temporary wait times. In order to filter out these images from being passed to the traffic congestion model, we developed a filter to detect the presence of traffic lights using YOLOv7 models .
The YOLO models are trained on the COCO object detection dataset  and traffic lights are present there, so no training was required. We used the yolov7-tiny model , as inference is faster with very good accuracy. Fig. 13 shows examples of applying the filter to images from Illinois, US.
We took two separate approaches for addressing the second project goal (to improve model prediction accuracy by eliminating noise):
- We trained EfficientNet  and other neural network models for better classification performance.
- We developed a method for estimating traffic density by detecting vehicles using YOLO models.
a. EfficientNet B3 Model Trained on Original Dataset
The EfficientNet B3 model was trained on the original dataset (the one on which the base model was trained) using the Keras framework with improvements on the dataset. The pre-trained model was from the Tensorflow module tensorflow.keras.applications. We employed methods from this module to train the EfficientNet model. The three traffic congestion class labels used were “hc” (= High), “lc” (= Low) and “mc” (= Medium).
We applied transfer learning in which training was done on only the top layers while freezing all the other layers. For this step, a relatively large learning rate of 0.0007 was used. The model was saved after the ModelCheckpoint callback and model.fit().
We used an image size of 300 x 300, which is the standard size for EfficientNet B3 models. The batch size used was 32 and the train-test split ratio was 75:25.
The original dataset had a class imbalance in addition to being a small dataset. As the medium congestion data samples were low in number, three types of augmentation were applied to this class to increase the samples in this set. To further improve the performance by increasing the total number of samples, just one type of augmentation was applied to other classes. Corrupted images that had 0 pixels were removed from the dataset. A total of 1,074 images after augmentation were used for training.
Fig. 14 shows the confusion matrix that was obtained on the test set after the final training on the augmented dataset. Fig.15 presents the plots of the training and validation accuracy over 30 epochs. A training accuracy of 96.99%, validation accuracy of 97.23%, and test accuracy of 80.64% were achieved. A summary of the results is presented in Table 4.
b. EfficientNet B3 Model Trained on Unmasked Ottawa Dataset
The EfficientNet B3 model was trained on the unmasked Ottawa dataset using the Keras framework to predict the level of traffic congestion. The same image size (300 x 300) and batch size (32) were used as in the previous section, while the train-test split ratio was 80:20 here. The training steps were the following:
1. Use transfer learning to freeze all layers and train only the top layers after rebuilding the top. For this step, a relatively large learning rate of 0.0007 was used. The
ModelCheckpoint callback was used in conjunction with the training and fit the model using
model.fit() and finally saving the model.
2. Unfreeze a number of layers and fit the model using a smaller learning rate.
The pre-trained model was from the module tensorflow.keras.applications; hence, we employed methods from this module for training the EfficientNet model.
The class imbalance was handled using augmentation because the High and Medium congestion data samples were fewer in number. Augmentation was applied to these classes to increase the number of samples in this set. Corrupted images that had 0 pixels were removed from the dataset.
There were a total of 1,706 images after augmentation, which were then used for training.
Fig. 16, Fig. 17, and Table 5 contain the accuracy plots, confusion matrix, and the classification report obtained on the test set after the final training on the unmasked Ottawa dataset. There were 221 test images. The training accuracy of over 90% did not generalize well to the test set (training accuracy: 89.00%, validation accuracy: 86.25%, test accuracy: 75%). Table 6 presents an analysis of miss-classified data samples.
Table 5: Classification report on the 221 test samples from the unmasked Ottawa dataset
We also tested an ensembled model constructed by combining the model trained on the original dataset (after augmentation) and the model trained on the Ottawa dataset. The average ensemble method was used. Ensembling did not improve the performance very much, though more Low images were classified correctly by the ensembled model. However, the individual models performed much better on the test set taken from the dataset on which they were trained. The accuracy of the ensembled model on the original and Ottawa test sets was 68% and 69%, respectively.
c. YOLOv7 Vehicle Detection
For vehicle detection, we used the YOLOv7 object detection model and trained it in two steps:
1. We first used transfer learning to train the model for vehicle detection on a Roboflow vehicle detection dataset , using the pre-trained COCO dataset model as a starting point. In this step, the model was trained to detect vehicles in general, without considering their directions, to get an idea of what it has to detect (Fig. 18).
2. After training the model for vehicle detection, we used another smaller Roboflow vehicle direction detection dataset  to train it further to detect the orientation of the detected vehicles, based on the direction it was facing in the images (Fig. 19). This involved another round of transfer learning, using the vehicle detection model obtained in the first step as a starting point.
So our final model was detecting two types of vehicles:
1. Front-facing vehicles
2. Back-facing vehicles
After the second-stage training, we achieved the following results on the Roboflow dataset:
d. Slicing Aided Hyper Inference
We explored the use of Slicing Aided Hyper Inference (SAHI) , which is designed for detecting small objects in conjunction with object detection algorithms such as YOLO, MMDetection, and Faster RCNN. Sliced Inference means performing inference over smaller sliced images of the original image and merging the results together, similar to a sliding window technique. Fig. 20 provides a comparison of Yolov7 and SAHI  for vehicle object detection.
e. YOLOv8 Lane Segmentation
For lane segmentation, we used the YOLOv8 instance segmentation model, training it in two stages:
1. First, we trained the model using a lane instance segmentation dataset on Roboflow . As shown in Fig. 21, we used this dataset for our model to get an idea of the segmentation task and how the lanes were structured.
2. In our next stage of training, we used the Ottawa lane segmentation dataset constructed above through automatic clustering and manual mask creation (See Fig. 22). In this dataset, the lanes were joined together, and the segmentation was done only to separate the two sides of the road.
The data annotation was done in a slightly different format from the YOLO format, and for that, we had to process the JSON files a bit for our final training. The final training was done for 22 epochs and the results are shown in Fig. 23.
We used this YOLOv8 lane segmentation model and the YOLOv7 vehicle detection model together in the next step to calculate the density of vehicles in the road to visualize and analyze our annotated data.
f. Vehicle Count Model
We conducted an initial exploratory data analysis on the data by analyzing the number of cars in the categorized photo dataset. We also used lane segmentation and vehicle identification models to objectively classify images as having either medium or high traffic based on quantitative criteria.
To calculate the density, we used the following formula:
We multiplied the density by 1000 because the area computed in numbers of pixels tends to be a very large number compared to the number of vehicles detected. We then applied this formula to the labeled data. Fig. 24 shows a violin plot summarizing the result.
For each image of the three classes, there are usually two lanes present (front and back). So taking them separately and together lead to a total of nine plots shown in the figure. We see that the density range for Low images has a narrower distribution due to the lower numbers of cars present. The density for High images ranges from 0 to almost 5. Based on this analysis, we can use a threshold to categorize each lane in each image. Here we classified lanes with a density over 1.4 as High and below 0.75 as Low traffic congestion level while classifying them as Medium otherwise.
g. YOLOv7-StrongSORT Tracker for Traffic Flow Rate Detection
The traffic flow rate can be defined as the number of vehicles passing through the tracker zone (a pre-defined rectangular area) every 5 sec. For the purpose of vehicle detection, we used the YOLOv7 pre-trained model. Once the vehicle is detected, the results are passed to a StrongSORT tracker . The tracker counts the number of detected vehicles that enter and exit the tracker zone, crossing the line drawn across the middle of the rectangular tracker zone.
In order to improve the speed of tracking, all the displays used for marking the tracker zone can be turned off. Fig. 25 captures the StrongSORT tracker for traffic flow rate measurement.
We integrated the Ottawa traffic camera API, the YOLOv5-based blurry-image filter, and the traffic-light detection filter with the base traffic congestion model.
a. Web Scraping
Initially, web scraping was used in this project for data collection because we could not find a readily available public API from the traffic camera hosting websites. However, API calls were preferred over web scraping due to efficiency, data quality, and data governance considerations.
b. New API Connection
To demonstrate the new capabilities developed in this project, we integrated the Ottawa highway and city cameras [7,8] into a Flask application. A sample of the Smart-Traffic application user interface is presented in Fig. 26.
c. Blurry-image Filter
In our implementation, the developed YOLOv5-based blurry-image filter was applied before the images are sent to the base model for traffic congestion prediction (Fig. 27). When this system was tested with live images from the Ottawa API, it performed well overall, except for a few edge cases such as predictions obtained during foggy weather conditions. The foggy images were never seen by the model and were misclassified as blurry, even though human eyes could still determine the traffic congestion level. Images captured in more diverse weather conditions need to be added to the training set to re-train the model to be more robust in production. It was also observed that applying the blurry-image filter doubled the overall processing time from 2 minutes to 4 minutes for the 29 cameras, amounting to about 8 sec per camera. The next step would be to evaluate other blurry-image filters, such as the anomaly detection-based filter, comparing it with the YOLO-based filter in terms of generalizability and processing speed.
d. Traffic-light Filter
We also applied the YOLOv7-based traffic-light filter to the image before sending it to the base traffic prediction model. If a traffic light is detected, the system would display, “🚦detected. You may experience temporary wait times.”
The system was configured to detect the presence of traffic lights and filter out the image only when the confidence score is 30% or higher. The system was tested on the COCO dataset , where it achieved a 97% accuracy. The deployment code was also tested on the original New South Wales API, with an inference time of 2 to 3 seconds per image. The model was traced at the very start of the application to have a faster inference time (Fig. 28)
6. Conclusion and future work
The main goals of this project were to obtain API traffic image data for a new city in North America or Europe, or to put in place a complete solution and to remove noise from the images, such as trees, dual lanes, etc., to increase the forecast accuracy of the traffic congestion model. We successfully obtained a large number of traffic images from several locations in multiple countries and used a subset to create manually labeled training and test datasets. We explored the idea of masking the traffic camera images to remove noise. We also developed ML-based filters to detect blurry and camera-off images (another type of noise in the classification problem). We improved the dataset on which the base classification model was trained by data augmentation. We used that dataset as well as the newly labeled dataset to train several new models, which showed performance improvements over the base model. An additional approach based on traffic density estimation via vehicle counting and lane segmentation was explored, with some encouraging results.
While these results are promising, key insights and future work can include:
- Addressing dual-lane situations via hardware enhancement, i.e., using unidirectional lane-focused cameras.
- Enabling cleaner image captures with an improved camera lens for night traffic and foggy weather conditions.
- Collaborating with potential partners in the traffic and logistics industry.
- Software integration with third-party apps.
- Building alert notification systems.
We would like to express our gratitude to our colleagues Jaelin Lee, Dhanya Sethumadhavan, Raza Ahmad, Jay Mody, Harini Suresh, Shine Minn Kha, Htwe Eaint Phyu, and other collaborators in this project for all their hard work. Their contributions were essential in shaping our understanding of the problem, collecting and analyzing data, developing and improving the ML models, and writing the final project report (which was the basis for this article).
1. T. O. Adetiloye (2018). “Smart Traffic”. https://github.com/taiwotman/Smart-Traffic
2. T. O. Adetiloye (2021). “Predicting Short-Term Traffic Flow Congestion On Urban Motorway Networks”. Patent No US11,195,412 B2. U.S. Patent and Trademark Office.
3. T. O. Adetiloye (2018). “Predicting Short-Term Traffic Congestion on Urban Motorway Networks”. PhD thesis, Concordia University.
4. S. Schmerler(2019). “Imagecluster”. GitHub repository: https://github.com/elcorto/imagecluster
5. G. Jocher (2020). “YOLOv5 by Ultralytics”. https://doi.org/10.5281/zenodo.3908559
6. G. Jocher(2020). “YOLOv5 SOTA Realtime Instance Segmentation”. GitHub repository: https://github.com/ultralytics/yolov5
7. Government of Canada. (n.d.). Ottawa traffic camera map. https://www.arcgis.com/home/item.html?id=7567f27085814487ae6df41170ea2ebf
8. Ottawa Traffic Open Data. (n.d.). Certificate request form. http://trafficopendata.ottawa.ca/ts/rsadmin/certificate.jsp
9. New York API Documentation | 511NY Developers – help. https://511ny.org/developers/help
10. New York API Documentation | 511NY API key request. https://511ny.org/developers/help/api/get-api-getcameras_key_format
11. Illinois Department of Transportation. (n.d.) gateway traffic cameras information. https://gis-idot.opendata.arcgis.com/datasets/IDOT::illinois-gateway-traffic-cameras/about
12. Illinois Department of Transportation. (n.d.) gateway traffic cameras Public API. https://gis-idot.opendata.arcgis.com/datasets/IDOT::illinois-gateway-traffic-cameras/api
13. Transport for London. (n.d.). London Traffic Cameras – Live TfL JamCam Feeds. https://www.tfljamcams.net/
14. Transport for London. (n.d.). London Traffic Cameras – Live TfL JamCam Public API. https://api.tfl.gov.uk/Place/Type/JamCam
15. C. Y. Wang, A. Bochkovskiy, and H. Y. M. Liao (2022). “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors”. arXiv preprint. https://doi.org/10.48550/arXiv.2207.02696
16. G. Jocher (2022). “YOLOv7”. GitHub repository: https://github.com/wongkinyiu/yolov7
17. A. Goodge, B. Hooi, S. K. Ng, and W. S. Ng (2022). “LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks”. arXiv preprint. https://doi.org/10.48550/arXiv.2112.05355
18. A. Goodge(2022). “LUNAR”. GitHub repository: https://github.com/agoodge/LUNAR
19. M. Brostrom(2022). “Real-time multi-camera multi-object tracker using YOLOv7 and StrongSORT with OSNet”. GitHub repository: https://github.com/mikel-brostrom/Yolov7_StrongSORT_OSNet
20. Microsoft COCO: Common Objects in Context Object detection dataset. http://cocodataset.org
21. Roboflow vehicle detection dataset. https://universe.roboflow.com/roboflow-100/vehicles-q0x2v
22. Roboflow vehicle direction detection dataset. https://universe.roboflow.com/thien-tan/test-2-g3mkp
23. F. C. Akyon, S. O. Altinuc, and A. Temizel (2022). “Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection”. arXiv preprint. https://doi.org/10.48550/arXiv.2202.06934
24. Hamzalopode(2021). “SAHI: Slicing Aided Hyper Inference”. GitHub repository: https://github.com/obss/sahi
25. Roboflow Road Lane Instance Segmentation Computer Vision Project. https://universe.roboflow.com/demarcationbased-road-lane-segmentation/road-lane-instance-segmentation
26. K. Simonyan and A. Zisserman (2014). “Very deep convolutional networks for large-scale image recognition”. arXiv preprint. https://doi.org/10.48550/arXiv.1409.1556
27. Roboflow. https://roboflow.com
28. M. Tan and Q. V. Le (2019). “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”. arXiv preprint. https://doi.org/10.48550/arXiv.1905.11946