The main objective of this project is:
- To create a people counting system that leverages existing CCTV cameras installed on the retail floor to provide customer journey analysis to store owners.
- The system will be capable of detecting and differentiating people based on their age and gender, as well as tracking and tallying those entering and exiting the store entrance.
- The solution was deployed on Jetson Nano, using a customized model that was trained on the client’s data.
The project consists of two main components:
- The machine learning pipeline.
- Setting up the hardware environment.
The machine learning pipeline
Data collection & data analysis
- Samples required to train the model were created using data collected from various cameras.
- The collected data was in video format, with frame sizes varying based on the camera resolution.
- The presence of repetitive backgrounds and redundant objects could potentially impact model training.
- To overcome this issue and improve the dataset’s generalization, additional images from the open-source mall dataset and CCTV-captured background images from another open-source dataset were included.
- The Age and Gender dataset contained over 4,000 images in total.
- The videos were resized and annotated.
- Both video annotations and image annotations were used in this project. Bounding boxes for object detection were used for annotating the objects.
- The final format used was image format. As we had decided to use the Yolo models for object detection, the labels were retrieved in Yolov5 format. Initially, only labeled images were retrieved. Later, background images were included in the dataset to reduce false positives.
- Age and Gender Model: Three classes were considered: Child -0, Female_adult – 1, and Male_adult – 2.
- Once annotation was done, the frequency of each class in the entire dataset was checked, and some of the redundant /duplicate images were removed.
Model training and testing
In this project, complete retraining is done as the data is totally different from the coco dataset on which the yolov5 model is trained with. Even though the coco model has a superclass called person, the appearance and features differ due to the cultural difference in the countries where this project will be implemented.
- Later different versions of Yolo were retrained using the custom data and compared.
- Out of them, the best version was selected and deployed to Jetson nano. Where other models crossed 1 ms plus inference time per frame, the best model maintained a speed below 0.75 ms per frame.
- The training was done multiple times by assigning different values to the hyperparameters. The aim of the training was to obtain a model which has a size compatible to run on edge devices with a decent inference speed but maintains a good mAP.
The app uses a tracking method called Strongsort which is a two-stage tracker.
- In the first stage, the objects are detected by the Yolov5-nano model, and in the second stage, these detected objects are passed to the tracker model.
- The detected objects are assigned an ID to be maintained for a defined period.
- Strongsort is highly configurable and adjustable to different deployment scenarios.
The tracking and counting logic
Line counting and transformation
- Line counting is implemented as a separate function. Only one object is considered at a time in the function.
- The direction of the movement is based on if the difference between the current frames and previous frames’ values is greater than or less than zero.
The pipeline is added to the tracker code for continuous learning to retrieve data continuously from the live stream. The data collected includes images with bounding boxes, images without bounding boxes, the coordinates, and the class as a text file. The annotations are in Yolo format. The tracker code has data saved in MOT format.
Testing and deployment
The code was deployed on the device and tested on single and multiple cameras.
Challenges the team was able to overcome
- Data imbalance issues, for example, having the male class as the majority class and age and gender imbalance. Ignoring this issue would impact the model performance as The high mAP value of training will be based on the high individual mAP score achieved by the male class. The model will start overfitting on a male class if we try to increase training epochs.
- A person is taking a U-turn. It happens when a person enters and exits like a U-turn or takes rounds casually without intending to enter the place.
- Optimizing the model performance and speed to match the edge device memory & resources.