Exploratory Data Analysis (EDA) is a methodology (e.g., cluster analysis, box plots, etc) for analyzing datasets to summarize their main characteristics . Therefore, it provides a foundation for further pertinent data collection, and for suggesting hypotheses to test rather than to confirm. It is an important step when applying a data-centric AI approach in Machine Learning (ML) projects, as the improvement of data quality is equally effective as doubling the training set with new samples .
The advancement of Deep Learning (DL) techniques has brought further life to the field of computer vision by revolutionizing different domains such as medicine, security, remote sensing, etc. Most of the state-of-the-art models utilize DL as their backbone to extract features from input images (or videos) . However, despite that Convolutional Neural Networks (CNN) are inspired by the human visual cortex when working with images, EDA does not become intuitive as when using tabular data .
In this article, some recommendations are shared to obtain lots of information when analyzing images. Also, it is discussed how some of these were applied during the AI Innovation Challenge .
Insurance can be defined as a policy in which a person or entity (insurer) protects another person or entity against losses from a specific occurrence or eventuality. For example, insuring a car in the event of an accident. The insurance industry can be categorized as a business that comprises people and organizations that develop, sell, administrate, and regulate insurance policies.
After an insured property (e.g., a car) has been damaged in an accident, insurance claims can be made. The insurer takes several important steps, including assessing the damage to determine whether repair or replacement is necessary, being the goal to restore the property to its state at the time of insurance. Furthermore, they will take several measures to mitigate fraud, which has been a persistent problem over time.
The project focused on building an Artificial Intelligence (AI) solution for validating vehicle images, identifying damage, and classifying its severity to determine a restoration price.
I. Project Pipeline
The pipeline has 2 stages (Figure 2):
Stage 1: Fraud model
Consists of 3 models: the first model (YOLOv8) detects license plates. Next, that detection is cropped and passed to a second model (OCR) which recognizes and reads the text on those license plates.
A third model (VisionTransformer + Cosine similarity) is used to compare how similar two car images might be.
Stage 2: Car damage
Consists of 2 models, the first model detects which car’s exterior parts show defects, and the second model evaluates the severity of those damages. Based on that, a total price estimation for the insurance claim is calculated.
II. Steps to do EDA on image data
EDA is a mandatory part of the development process in AI to achieve optimal inference results. It is part of a continuous process: analyze data, formulate hypotheses, build models, make inferences, validate, compare, and return back to data analysis until achieving satisfying results .
It consists of the following strategy:
- Data analysis: It means studying the characteristics of the datasets. (e.g., image acquisition processes, labeling quality, size, and area of bounding boxes, image quality, number of samples, etc.)
- Data cleaning: It is the process to identify incomplete, incorrect, or inaccurate data, and then replacing, modifying, or deleting it.
- Data splitting: When splitting the dataset (train, validation) it is important to take into account the equitable distribution of classes. ex. stratified k folds splitting.
- Data augmentation: Most Deep Learning algorithms need huge quantities of data to achieve interesting results. Data augmentation can improve the generalization and reduce the over-fitting of models by making different variations of the same image. Ex: flipping, rotations, padding, cropping, gaussian noise injection, random erasing, etc.
Some recommended EDA steps for image data are:
STEP 1: Assessing data quality
For image data, the simplest method of EDA is by visualizing a sample of images from each class (Figure 3). This is very useful to get familiar with the data and consequently adapt it to the algorithm to achieve better performance.
- Visualize multiple images at the same time. Focus on size, orientation, brightness, background variations, etc.
- Make sure there are no corrupted files (ex. images that cannot be opened).
- Take note of the different extensions: jpg, jpeg, jpe, png, tif, tiff, bmp, ppm, pbm, pgm, sr, ras, webp.
- Verify that all images share the same color model: RGB, grayscale, etc.
- From the scrapped images, only the ones that had a sufficiently high quality and at the same time showed damages within economic repair were selected. Also, duplicates were removed and all images were converted to the chosen format: JPG.
STEP 2: Visualize image size and aspect ratio
In the real world, datasets do not have images of the same size and shape. Furthermore, it is usual to combine multiple datasets to acquire more samples for a given AI task.
- Make a histogram to visualize the distribution of image size and aspect ratio (Figure 4).
- If the majority of images have a uniform distribution (same dimensions), then it is up to you to decide how much to alter them. You can start by using the average size and aspect ratio or taking into account the minimum image size accepted by your chosen algorithm.
- If the distribution is bimodal (has two peaks), then you can alter the images by adding some padding.
- If the distribution is random (images very wide and very narrow), it is better to use advanced techniques to avoid altering the aspect ratio.
- Pick a consistent image size: large enough to keep features distinguishable, but not too large to run out of memory.
- A histogram to visualize the distribution of size and aspect ratio (all images) was prepared (Figure 5). From the visualization, it was decided to resize the images to 640×480. No further techniques were applied.
STEP 3: Verify that all images have been annotated
- Everything that has not been annotated, will be considered as background. Therefore, leaving unannotated images will only send conflicting signals to the training model.
- For compliance validation, between the collected images and their respective annotation files, a script was developed. The results displayed the number of images within each group, as well as the files that have to be deleted or fixed (Figure 6).
STEP 4: Class imbalance
This is a very common problem. When training a model, it is expected to have uniformly distributed classes . Otherwise, the class with a higher number of data points would tend to create bias in the model.
- Performance metrics 
To avoid misinterpreting biased models as performing well, you can use metrics such as F1 score (Dice coefficient) , Jaccard index (Intersection Over Union (IoU)) , or Mean Average Precision (mAP) .
- Data augmentation
This is the most widely used regularization technique. But, when applying certain transformations (ex. cropping, rotations, etc) to the images, there is a high probability of altering their annotated bounding boxes as well. That is why, these transformations also have to be updated into the respective annotations (Figure 7).
- Merging classes 
Ideally, depending on the problem to solve, this technique should be done by domain experts. The high-resolution image in Figure 8 clearly consists mostly of buildings. So if you want to detect buildings, trees, cars, buses, and trucks; you will have a huge class imbalance. In order to solve the issue, you can take zoomed-in tiles and merge similar classes (ex. cars, trucks, buses) into one category ‘cars’. This will reduce the number of classes (from five to three) and increase the number of ‘cars’ labels.
- The agreed evaluation metric was mean average precision (mAP), in particular mAP50 . As it is a standard metric for object detection. The best performance was achieved by the YOLOv8 model (Figure 9).
- The distribution of classes was highly imbalanced (Figure 10). The overrepresented classes belonged mostly to parts located in front of the car. Therefore, an additional image search targeted at underrepresented classes was carried out and augmented via the application of horizontal and vertical flipping.
STEP 5: Verify the size and shape of annotated bounding boxes
Most computer vision models are anchor-based (Figure 11) . In other words, there is a stage in which these anchors have to match with the ground truth bounding boxes (the annotations). Consequently, if these anchor boxes have not been tuned, the neural network will not know that a certain object exists .
Also, it is usual to execute multiple models in sequence for complex problems. The output of a certain model A is cropped and used as the input of model B. In this case, once again, it highlighted the importance of knowing the distribution of size and shape of bounding boxes; as these will be used to tune accordingly the anchors for the next algorithm .
- Prepare a histogram to visualize the size, shape, and aspect ratio of the annotated bounding boxes . This is useful to get a rough estimate of the smallest and biggest bounding box you want to detect (specify a threshold considering the range of expected objects of interest). Another option is to learn the anchor box configuration .
- The project pipeline consisted of 2 sequential models: a) fraud model (YOLOv8 and OCR) and b) car damage model (YOLOV8 and MobileNetV2) (Figure 2). For each one, it was essential to know the average size and aspect ratio of the bounding boxes in order to achieve a good performance. Therefore, annotated images were visualized and a histogram was prepared (Figure 12).
EDA is a crucial step in any ML project . However, when dealing with images, the methodology to follow is a bit different when compared to tabular data. On the other hand, AI pipeline to prevent illegal insurance claims consists of 2 stages:
a) Fraud model.
b) Car damage model.
In this article, some EDA techniques for processing image data have been discussed, and how they were applied to the AI Innovation Challenge. Some of the presented recommendations are:
- Assess data quality: plot multiple images at the same time, identify incomplete and incorrect data to then replace or delete it.
- Visualize image size and aspect ratio: prepare a histogram and accordingly decide on a consistent image size to be fed into the chosen algorithm.
- Verify that all images have been annotated: unannotated images are considered as background, which will only send conflicting signals to the training model.
- Ensure there is no class imbalance: to avoid bias in the model use appropriate performance metrics, augment data or merge classes.
- Check the size and shape of annotated bounding boxes: most computer vision models are anchor-based. By visualizing the size, shape, and aspect ratio of bounding boxes, you will be able to tune the anchors and achieve better model performance.
 J. W. Tukey, Exploratory data analysis, Reading, MA: Addison-Wesley, 1979.
 “Data-Centric AI | What is Data-Centric AI & Why Does It Matter?” https://landing.ai/data-centric-ai/ (accessed Apr. 24, 2023).
 L. Jiao et al., “A Survey of Deep Learning-based Object Detection.”
 Aung R, “Do convolutional neural networks mimic the human visual system?,” MSAIL. https://msail.github.io/post/cnn_human_visual/ (accessed Apr. 24, 2023).
 Omdena, “Vehicle Recognition and Inspection System Using Computer Vision,” Omdena, 2023. https://omdena.com/projects/buiding-a-vehicle-recognition-and-inspection-system/ (accessed Apr. 21, 2023).
 “These are the Most Generous Auto Insurers During Quarantine (2020),” Insurify. https://insurify.com/insights/most-generous-auto-insurers-quarantine/ (accessed Apr. 24, 2023).
 Fourati F, Souidene W, and Attia R, “An original framework for wheat head detection using deep, semi-supervised and ensemble learning within global wheat head detection (gwhd) dataset” 2021. Accessed: Apr. 24, 2023. [Online]. Available: https://arxiv.org/pdf/2009.11977.pdf.
 Cieślik Jakub, “How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way),” Neptune.AI, Apr. . https://neptune.ai/blog/data-exploration-for-image-segmentation-and-object-detection (accessed Apr. 24, 2023).
 K. Oksuz, B. Can Cam, S. Kalkan, and E. Akbas, “Imbalance Problems in Object Detection: A Review,” 2020, [Online]. Available: https://arxiv.org/pdf/1909.00169.pdf.
 L. Reitsam, “Image Segmentation — Choosing the Correct Metric | by Laurenz Reitsam | Towards Data Science,” Medium, Aug. 12, 2020. https://towardsdatascience.com/image-segmentation-choosing-the-correct-metric-aa21fd5751af (accessed Apr. 24, 2023).
 “Sørensen–Dice coefficient,” Wikipedia. https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient (accessed Apr. 24, 2023).
 “Jaccard index,” Wikipedia. https://en.wikipedia.org/wiki/Jaccard_index (accessed Apr. 24, 2023).
 D. Shah, “Mean Average Precision (mAP) Explained: Everything You Need to Know,” v7labs, Mar. 07, 2022. https://www.v7labs.com/blog/mean-average-precision (accessed Apr. 24, 2023).
 A. Kathuria, “Data Augmentation for object detection: How to Rotate Bounding Boxes,” PaperspaceBlog, 2018. https://blog.paperspace.com/data-augmentation-for-object-detection-rotation-and-shearing/ (accessed Apr. 24, 2023).
 T. Agrawal, “Imbalanced Data in Object Detection Computer Vision Projects,” Neptune.AI, Apr. 20, 2023. https://neptune.ai/blog/imbalanced-data-in-object-detection-computer-vision (accessed Apr. 24, 2023).
 K. E. Koech, “Object Detection Metrics With Worked Example | by Kiprono Elijah Koech | Towards Data Science,” Medium, Aug. 26, 2020. https://towardsdatascience.com/on-object-detection-metrics-with-worked-example-216f173ed31e (accessed Apr. 24, 2023).
 “computer vision – anchor box or bounding boxes in Yolo or Faster RCNN – Stack Overflow,” Stackoverflow. https://stackoverflow.com/questions/50450998/anchor-box-or-bounding-boxes-in-yolo-or-faster-rcnn#53833095 (accessed Apr. 24, 2023).
 A. Christiansen, “Anchor Boxes — The key to quality object detection | Towards Data Science,” Medium, Oct. 15, 2018. https://towardsdatascience.com/anchor-boxes-the-key-to-quality-object-detection-ddf9d612d4f9 (accessed Apr. 24, 2023).
 D. Pacassi Torrico, “Face detection – An overview and comparison of different solutions · Blog,” LIIP, Aug. 15, 2018. https://www.liip.ch/en/blog/face-detection-an-overview-and-comparison-of-different-solutions-part1 (accessed Apr. 24, 2023).
 T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun, “MetaAnchor: Learning to Detect Objects with Customized Anchors.”