Advanced EDA of UK’s Road Safety Data using Python
February 19, 2022
How to identify the number of car accidents and casualties through the day, and around the hour? This is a guide to performing advanced EDA (Exploratory Data Analysis) on UK’s road crash data around the hour using Python. This case study was a part of Omdena’s AI for Road Safety challenge in collaboration with iRAP.
In November 2020 I began volunteering with Omdena’s iRAP Challenge as a Machine Learning Engineer. This was my first Omdena challenge and I was looking for an enriching experience where I can learn a few new skills, also apply my own expertise, and make new friends. I belong to the finance sector, so my work experience is generally around payments data while that of iRAP’s use case deals with more geospatial data. This got me excited about this challenge as I knew I would learn something new here.
In this task, I collaborated with over forty machine learning enthusiasts. We began with looking for datasets available publicly and stumbled upon the United Kingdom’s dataset on road accidents, casualties and vehicles involved. In this article, I’d like to walk you through my exploratory data analysis process that helped in understanding the nature of the data present and also the data quality.
About the Datasets
The datasets could be downloaded from here. From this page, the required files are –
- Road Safety Data — Accidents 2019 — This dataset contains an indexed record of the accidents that occurred in 2019 across the UK. The Accident Index is the primary key used in the other two datasets to link the casualties and vehicles to a specific accident.
- STATS19 Variable lookup data guide — The above dataset contains only categorically encoded data. For example, accident severity is 1, 2, or 3. The category labels are in this dataset. Each excel sheet corresponds to a particular column in the accidents’ dataset.
Exploratory Data Analysis
Data Preparation
Now, let’s begin with the exploratory data analysis of this data. The dataset contains 117536 records and 32 features. Among these features, there are four geospatial specific features, namely, the latitude and longitude of the accident and the corresponding eastings and northings of the accidents. The accidents were all indexed in the column ‘Accident Index’. Thanks to this project, I got introduced to geographical cartesian coordinates where a point on a map is referenced as an easting/northing pair, similar to a graph’s x and y coordinates. Given the easting and northing of a location, we can calculate the latitude and longitude of the location and vice versa.
Apart from each accident’s index number, location, date, and time, other features are categorical in nature. While the categorically encoded data is useful for machine learning, it was essential to use the category labels in the data to comprehend better. To add each of the 30 feature categories mappings via hardcode is tedious so instead, I matched column names with the corresponding excel sheet names. Below is the code I used :
def create_dict(df_var_look_up): ‘‘‘ To generate a dictionary in the format {code : label}’’’ temp_dict = {} # Since Local_Authority_(Highway) has Sentence Case df_var.columns = [each_col.lower() for each_col in df_var.columns] for each1, each2 in zip(df_var[‘code’], df_var[‘label’]): temp_dict[each1] = each2 return temp_dict %%time # df_accidents: Dataframe for all accidents in the UK in 2019 # var_lookup: Dictionary of dataframes of the Variable data look up excel # df_accidents_decoded: Dataframe for all accidents where the categories are replaced by corresponding labels from var_lookup for column in df_accidents.columns: for sheet in list(var_lookup.keys()): if column.lower().replace(‘_’, ‘ ‘) == sheet.lower(): df_accidents_decoded[column] = df_accidents[column].map( create_dict(var_lookup[sheet]), na_action=’ignore’) break
Once the mapping is done, the data is ready to be explored comfortably.
Accident Severity and Number of Casualties
The first two features of interest are the number of casualties and accident severity. Accident severity is categorized into ‘Slight’, ‘Serious’, and ‘Fatal’. From the graph showing the number of accidents by accident severity in 2019, it is evident that only 1.5% of accidents were ‘Fatal’ while the majority of accidents were ‘Slight’ (78% of total accidents in 2019). Consequently, the total number of casualties in ‘Slight’ accidents is the highest being 77% of the total number of casualties in 2019, followed by ‘Serious’ with 21% and ‘Fatal’ with only 1.89%. However, it is interesting to note that the average number of casualties in ‘Fatal’ accidents is higher than the average number of casualties in ‘Slight’ and ‘Serious’ accidents.
Temporal Features
The next point of interest in the dataset is the date and time of the accidents. As seen in figure 3(b), most accidents occurred during the day, especially during office hours. Consequently, the number of casualties is high and the pattern for 24 hours distributions of total casualties follows the total number of accidents through 24 hours in figure 3(a). This also validates the observation in figures 1 and 2. However, figure 3(c)suggests that on average there were a higher number of casualties per accident during the night, between 11 PM to 6 AM. During the day, there were fewer casualties on average, even though the total number of casualties was higher.
Generating the 24-hour graphs is an interesting data visualization exercise since it is not a function call like the bar charts. Here is a step by step walkthrough to generate this clock bar graph:
1. Group and get the count of accidents per time instant.
grouped_df = df_accidents_decoded[-df_accidents_decoded[‘Time’].isna()].groupby(‘Time’) [‘Accident_Index’].count().reset_index()
2. Get the corresponding radian for the time instant.
time_series = grouped_df[time_col_name].str.split(‘:’) time_series = time_series.apply(get_radian) def get_radian(x): h,m = map(int,x) return 2 * np.pi * (h + m/60)/24
3. Create a plot with polar projection. Setting polar=True also works instead of projection=’polar’.
ax = plt.subplot(111, projection=’polar’) ax.bar(time_series, grouped_df[‘Accident_Index’], width=0.1, alpha=0.3, color=’red’)
4. Set the direction of the plot
ax.set_theta_direction(-1)
5. Set the start of the clock numbering from 90 degrees of the circle, which is pi/2 in terms of radians.
ax.set_theta_offset(np.pi / 2)
6. Set the tick labels and show the plot!
ticks = [‘12 AM’, ‘1 AM’, ‘2 AM’, ‘3 AM’, ‘4 AM’, ‘5 AM’, ‘6 AM’, ‘7 AM’, ‘8 AM’, ‘9 AM’, ’10 AM’, ’11 AM’, ’12 PM’, ‘1 PM’, ‘2 PM’, ‘3 PM’, ‘4 PM’, ‘5 PM’, ‘6 PM’, ‘7 PM’, ‘8 PM’, ‘9 PM’, ’10 PM’, ’11 PM’ ] ax.set_xticklabels(ticks) plt.show()
Following the number of accidents’ exploration by an hour of the day, next in line I explored patterns in accidents by day of the week. On average, there were fewer accidents on the weekends than on the weekdays, the highest number of accidents being on Friday and the lowest on Sunday as seen in figure 4 below.
Also, the number of accidents was consistently higher during the hotter months while the winters have lower accidents. Yet, the most recorded accidents occurred in November and the lowest in February. This could be because December to February is the coldest in the UK which may lead to lesser journeys and consequently fewer accidents. These conclusions are derived from figure 5 above.
Road Features
Following this, I looked at the road conditions and the number of accidents. The different features available were road type, speed limit, junction details, the road surface conditions, and the type of road classified as Motorway, A, B, or C class roads by the UK’s road numbering scheme. It is evident in figure 6(a) that the number of accidents was higher on roads that had a 30 MPH speed limit. This could be so because most of the roads in the UK have 30 MPH speed limits as well. We did not have relevant data to verify this. Also, as in figure 6(b), single carriageway roads had more accidents which are relatively dangerous in terms of road safety due to lack of separation between traffic moving in different directions.
Junction detail is also an important parameter in understanding road accidents. From figure 7(a) below, it appears that most accidents did not occur at a junction but, if aggregated and neglecting missing data, accidents are more likely to occur at some sort of junction since approximately 49k accidents happened nowhere near a junction but the rest of 68k accidents happened with a junction around.
The road classification in the UK is also an interesting feature. ‘A’ class roads had more accidents than any other class as in figure 7(b). Additionally, approximately 40k data points do not have this information which is favorable for us. While data modeling we actually included only ‘A’ and Motorways in our data from 2005 to 2019 since it gave us enough data to train a machine learning model on.
Weather and Road Surface Conditions
The weather conditions are also important to assess road accidents.
From the bar chart above, in figure 8(a), it seems like most accidents occurred when the weather was fine. A considerable amount of accidents happened during rain. Very few accidents happened while snowing and with high winds. The caveat of deriving that accidents are unlikely to occur in these weather conditions is that we don’t know how many total journeys were made during those weather conditions. Besides, weather conditions also translate to road surface conditions and we see most accidents occurred on dry road surfaces while the second-highest number of accidents happened on wet road surfaces, as in figure 8(b).
Since these two features complement each other, I thought it would be interesting to see their joint frequencies and used the below code to generate it:
df_accidents_decoded.groupby([‘Weather’, ‘Road Surface’])[[ ‘Accident_Index’ ]].count().reset_index().sort_values(by=’Weather’)
Here is a snippet of the frequency of accidents while it was raining with high winds or no high winds versus different road surface conditions. In this table, ‘Dry’ road surface condition is recorded during rainy weather conditions which says that the quality of data recorded might not be too reliable.
Urban and Rural Areas
The last feature I looked at in this dataset is the one that says if the accident occurred in an urban or rural area. This feature is interesting to compare urban data against rural data. The figure below shows that most accidents occurred in urban areas, approximately twice the number of accidents that happened in rural areas.
A whole range of comparisons can be drawn from this “Urban vs Rural” study. As in the figures below, I studied the patterns in the number of accidents and number of casualties by the time of the day where the blue bars represent urban and red represent rural areas. The patterns in rural areas are completely similar but diminished by 50% of the heights of the bars of urban areas, which conforms to our conclusion from figure 10.
Conclusion
This exploratory analysis of road accident data of the United Kingdom was conducted to engineer our features to model an automated road scoring system using machine learning techniques. All these features, especially the temporal, weather conditions, and road surface features were important indicators of road score assessment.
Refer to the full code here.
—
This article is written by Nabanita Roy.
Ready to test your skills?
If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects