A Beginner’s Guide to Exploratory Data Analysis with Python
February 27, 2024
Introduction
What is Exploratory Data Analysis?
Exploratory Data Analysis is one of the main components of the Data Science Life Cycle, it is a technique to understand the various aspects of the data. For performing Exploratory Data Analysis (EDA) with python, you will need to get your hands on python’s libraries such as Pandas, NumPy, Matplotlib & Seaborn. Pandas is used for data exploration while matplotlib & Seaborn are used for plotting the data set to get more insights from the data.
Even if you are new to the field and you don’t have much practice with libraries, you can still learn Exploratory Data Analysis (EDA) with python from this article, because for your convenience, each part of the code is described clearly.
DataFrame Built-In Functions
.head() shows up 5 records from top.
.tail() prints 5 records from bottom.
.shape() tells the shape of the dataset in the form of the number of rows and number of columns
.describe() describes the details of the dataset.
Import Libraries
The first step of exploring any dataset is to import the required libraries.
Upload the Dataset
Pandas is the library used to upload the data and data manipulation. So, we will use pandas alias ‘pd’ to access the dataset in the jupyter notebook.
.Shape
The shape of the data shows the number of columns(18) and the number of records (636) in the dataset.
.Describe
This is how .describe shows every detail about the data.
Null value
To see if there are any null values in the data, you can use the following line of code given below in the picture.
Additionally, The below output shows that ‘umpire3’ is entirely null, so we can remove such columns.
Removing a column
Sometimes we find out there are columns in the dataset that are not useful or don’t contain any information, in such a scenario we have to drop the extra columns.
According to my analysis, I found out the column ‘dl_applied’ is not giving any useful information and the column ‘umpire3’ is totally null, there for these two columns are dropped.
Correlation
Correlation is a statical method to see how strongly variables are related to each other. The below correlation chart is showing some positive and some negative values, but in our case, it’s not that information so we have to separately find variables’ relationships.
Pairplot
A pair plot represents plots for every variable in the dataset, so you get to know what columns’ contain more data and you further explore those variables.
If the hue is set then the specific variable’s information is represented according to columns.
Player of match
To see how many times and how many players won the title “player-of-match”
Top 10 players from “Player of match” column
To see the 10 best performers recorded in the dataset.
Bar-plot of top 3 players
As we know that plots are used to visually see the data for better understanding. The given plot shows that “Ch Gayle” has won the most player of the match titles than the other two players.
Type of Match Results
Matches are mostly normal, which means 1 team wins and other losses, a tie means both the teams have the same score, and sometimes we have no records of matches, which means the match has been cancelled due to some reasons.
To see what type of matches have been recorded in the data we will do the following line of code.
Toss Winners
To see the number of tosses won by teams.
Won-by-runs
If you want to see the number of matches won by the team that plays first, you can do the following line of code. This will let you know the best performer and the worst performance team by the number of matches any team has won.
Plot of Win-by-runs
This plot shows the margin of runs with distribution, by seeing in the plot you will get to know that the most matches win-by-run have won by 1 to 10 runs, and the best winning teams win-by-run have won by 140.
Top 5 win-by-run teams
To know the teams that won most matches even after doing batting first.
Win-by-Run Percentage Distribution through Pie Chart
Below is the pie chart you can see the win percentage of every team.
Win-by-Wickets
To see what teams have won the most matches that got second bating and won.
Histogram on Win-by-Wickets
Below you can see the number of matches won by a team that got batting second.
Here you can see the exact value of matches of team win-by-wickets.
Top 3 teams Win-by-Wickets
These are the top 3 teams with the most win-by-wicket matches means teams that got second batting.
Win-by-Wicket Percentage Distribution through Pie Chart
Below is the pie chart you can see the win percentage of every team.
Year
To see the number of matches played every year.
Note: In the data set the year column is named as ‘season’ so if you want to change the name, this can be done directly from excel/CSV file.
Matches won in a City
To see the exact number of matches won in a city.
Toss-Winning V/S Match-Winning
To see if there is any relation between the toss-wining team and match-winning team, the following.
The output clearly says that there is no relation between toss-wining and match-winning
The main idea of EDA is to get maximum useful information from the dataset. In this article, we have tried to see every aspect of data by using libraries, different charts & plots, built-in functions, and methods. We have driven really interesting information about win-by-run, win-by-wicket, top 3 players of the match, the number of matches won in a year, the number of matches won in a city and the relationship between toss-winning and match-winning.
Exploratory Data Analysis live Sessions and Project Code through Omdena Pakistan Chapter
EDA project was completed in a week which was based on six sessions. There was every day a new agenda which was collaboratively decided by the entire team of the Omdena Pakistan Chapter. So the 1st session was on “Introduction to NumPy & Matplotlib”, the 2nd session was on “Introduction to Pandas & Python Dictionaries”, the 3rd session was on “DataFrames & Aggregating Data”, the 4th session was on “Slicing, Indexing, creating & Visualizing data”, the 5th session was on “Joining the data”, and the last was on “Filter join & merging”.
Below are the videos of Live sessions of “Exploratory Data Analysis”
- Introduction to NumPy & Matplotlib
- Introduction to Pandas & Python Dictionaries
- DataFrames & Aggregating Data
- Slicing, Indexing, creating & Visualizing data
- Joining the data
- Filter join & merging
Solved Project
How to contact Omdena Pakistan Chapter?
Omdena Pakistan chapter is highly active on social media platforms to assist you with any concerned issue. You can follow us on bellow mentioned social accounts to stay updated about ongoing projects and workshops.
- Facebook: Omdena Pakistan Chapter
- Linkedin: Omdena Pakistan Chapter
- Github: Qasim Hassan
- Medium: Iqra Anwar