A Beginner’s Guide to Exploratory Data Analysis with Python
February 27, 2024
data:image/s3,"s3://crabby-images/af890/af890291095e64a324b10cb4ed06ae498817421e" alt="article featured image"
Introduction
What is Exploratory Data Analysis?
Exploratory Data Analysis is one of the main components of the Data Science Life Cycle, it is a technique to understand the various aspects of the data. For performing Exploratory Data Analysis (EDA) with python, you will need to get your hands on python’s libraries such as Pandas, NumPy, Matplotlib & Seaborn. Pandas is used for data exploration while matplotlib & Seaborn are used for plotting the data set to get more insights from the data.
Even if you are new to the field and you don’t have much practice with libraries, you can still learn Exploratory Data Analysis (EDA) with python from this article, because for your convenience, each part of the code is described clearly.
DataFrame Built-In Functions
.head() shows up 5 records from top.
.tail() prints 5 records from bottom.
.shape() tells the shape of the dataset in the form of the number of rows and number of columns
.describe() describes the details of the dataset.
Import Libraries
The first step of exploring any dataset is to import the required libraries.
Upload the Dataset
Pandas is the library used to upload the data and data manipulation. So, we will use pandas alias ‘pd’ to access the dataset in the jupyter notebook.
data:image/s3,"s3://crabby-images/177ee/177eed723f0082d46f05821fdfc60ada3c9030ee" alt="Dataset of IPL"
Dataset of IPL
.Shape
The shape of the data shows the number of columns(18) and the number of records (636) in the dataset.
data:image/s3,"s3://crabby-images/14ee0/14ee094e3e9dcf7dfb2caa9897fc531e0773beab" alt="Shape of Dataset"
Shape of Dataset
.Describe
This is how .describe shows every detail about the data.
data:image/s3,"s3://crabby-images/b47cd/b47cd232e4a76e5d054974436d200e3bd9c5c912" alt="Details of Column"
Details of Column
Null value
To see if there are any null values in the data, you can use the following line of code given below in the picture.
Additionally, The below output shows that ‘umpire3’ is entirely null, so we can remove such columns.
data:image/s3,"s3://crabby-images/ad80c/ad80c9393f8e35f3520464fc5da4dd2e9a71e46c" alt="Null Values"
Null Values
Removing a column
Sometimes we find out there are columns in the dataset that are not useful or don’t contain any information, in such a scenario we have to drop the extra columns.
According to my analysis, I found out the column ‘dl_applied’ is not giving any useful information and the column ‘umpire3’ is totally null, there for these two columns are dropped.
data:image/s3,"s3://crabby-images/de570/de57024ba7741d31749284be942f285f6aaef556" alt="Dropping Columns"
Dropping Columns
Correlation
Correlation is a statical method to see how strongly variables are related to each other. The below correlation chart is showing some positive and some negative values, but in our case, it’s not that information so we have to separately find variables’ relationships.
data:image/s3,"s3://crabby-images/3619a/3619a8b9c818ba23f60f2754cda1bb0db96c68bc" alt="Correlation"
Correlation
Pairplot
A pair plot represents plots for every variable in the dataset, so you get to know what columns’ contain more data and you further explore those variables.
data:image/s3,"s3://crabby-images/115c1/115c16c24d44a548f59cfa8b24093b852388455d" alt="Pair-plot"
Pair-plot
If the hue is set then the specific variable’s information is represented according to columns.
data:image/s3,"s3://crabby-images/83cd6/83cd6b356dd7fdbd1b7053e51f3c3de687aa436a" alt="Season’s pair plot"
Season’s pair plot
Player of match
To see how many times and how many players won the title “player-of-match”
data:image/s3,"s3://crabby-images/07c82/07c826c1de711ad3decdc6322e080fa9c2edf32a" alt="Player of the match"
Player of the match
Top 10 players from “Player of match” column
To see the 10 best performers recorded in the dataset.
data:image/s3,"s3://crabby-images/8fd0b/8fd0b315a1fb4cc4de64aa82d68254883d7ba4b3" alt="Player-of-match"
Player-of-match
Bar-plot of top 3 players
As we know that plots are used to visually see the data for better understanding. The given plot shows that “Ch Gayle” has won the most player of the match titles than the other two players.
data:image/s3,"s3://crabby-images/d641c/d641c401241e9a3e52753a9a2637da6cbd812ee8" alt="Bar-plot of 3 Best Players"
Bar-plot of 3 Best Players
Type of Match Results
Matches are mostly normal, which means 1 team wins and other losses, a tie means both the teams have the same score, and sometimes we have no records of matches, which means the match has been cancelled due to some reasons.
To see what type of matches have been recorded in the data we will do the following line of code.
data:image/s3,"s3://crabby-images/1d875/1d8756c91478d68592c8947ba7a53e21d00ea01a" alt="Type of Match Results"
Type of Match Results
Toss Winners
To see the number of tosses won by teams.
data:image/s3,"s3://crabby-images/7bfe7/7bfe733892690a1e77cbcb8f81514e7c8511d012" alt="Toss Winners"
Toss Winners
Won-by-runs
If you want to see the number of matches won by the team that plays first, you can do the following line of code. This will let you know the best performer and the worst performance team by the number of matches any team has won.
data:image/s3,"s3://crabby-images/14157/14157a839592ce1f28e5726b3d738bdb37903c7a" alt="Won-by-Run"
Won-by-Run
Plot of Win-by-runs
This plot shows the margin of runs with distribution, by seeing in the plot you will get to know that the most matches win-by-run have won by 1 to 10 runs, and the best winning teams win-by-run have won by 140.
data:image/s3,"s3://crabby-images/9ced8/9ced893552796367abbda4ebb945009eb632b829" alt="Win-by-Runs"
Win-by-Runs
Top 5 win-by-run teams
To know the teams that won most matches even after doing batting first.
data:image/s3,"s3://crabby-images/0824e/0824e67c7bf8582ab6b41586103254aa598cc7d2" alt="Teams win-by-run"
Teams win-by-run
Win-by-Run Percentage Distribution through Pie Chart
Below is the pie chart you can see the win percentage of every team.
data:image/s3,"s3://crabby-images/48cc8/48cc886147a0271bdaa5748638d07a97629757b3" alt="Pie Chart of Win-by-Run"
Pie Chart of Win-by-Run
Win-by-Wickets
To see what teams have won the most matches that got second bating and won.
data:image/s3,"s3://crabby-images/e1969/e19693a8904cf3034caf7f5cdd25c773cff9f40c" alt="win-by-wicket"
win-by-wicket
Histogram on Win-by-Wickets
Below you can see the number of matches won by a team that got batting second.
Here you can see the exact value of matches of team win-by-wickets.
Top 3 teams Win-by-Wickets
These are the top 3 teams with the most win-by-wicket matches means teams that got second batting.
Win-by-Wicket Percentage Distribution through Pie Chart
Below is the pie chart you can see the win percentage of every team.
Year
To see the number of matches played every year.
Note: In the data set the year column is named as ‘season’ so if you want to change the name, this can be done directly from excel/CSV file.
Matches won in a City
To see the exact number of matches won in a city.
Toss-Winning V/S Match-Winning
To see if there is any relation between the toss-wining team and match-winning team, the following.
The output clearly says that there is no relation between toss-wining and match-winning
The main idea of EDA is to get maximum useful information from the dataset. In this article, we have tried to see every aspect of data by using libraries, different charts & plots, built-in functions, and methods. We have driven really interesting information about win-by-run, win-by-wicket, top 3 players of the match, the number of matches won in a year, the number of matches won in a city and the relationship between toss-winning and match-winning.
Exploratory Data Analysis live Sessions and Project Code through Omdena Pakistan Chapter
EDA project was completed in a week which was based on six sessions. There was every day a new agenda which was collaboratively decided by the entire team of the Omdena Pakistan Chapter. So the 1st session was on “Introduction to NumPy & Matplotlib”, the 2nd session was on “Introduction to Pandas & Python Dictionaries”, the 3rd session was on “DataFrames & Aggregating Data”, the 4th session was on “Slicing, Indexing, creating & Visualizing data”, the 5th session was on “Joining the data”, and the last was on “Filter join & merging”.
Below are the videos of Live sessions of “Exploratory Data Analysis”
- Introduction to NumPy & Matplotlib
- Introduction to Pandas & Python Dictionaries
- DataFrames & Aggregating Data
- Slicing, Indexing, creating & Visualizing data
- Joining the data
- Filter join & merging
Solved Project
How to contact Omdena Pakistan Chapter?
Omdena Pakistan chapter is highly active on social media platforms to assist you with any concerned issue. You can follow us on bellow mentioned social accounts to stay updated about ongoing projects and workshops.
- Facebook: Omdena Pakistan Chapter
- Linkedin: Omdena Pakistan Chapter
- Github: Qasim Hassan
- Medium: Iqra Anwar
data:image/s3,"s3://crabby-images/b6731/b6731e71d8e3d6074ee20bf32aad25433c692bda" alt="media card"
data:image/s3,"s3://crabby-images/cd84f/cd84f4ac5613d67584eef7ffb07c773657815a7f" alt="media card"
data:image/s3,"s3://crabby-images/921bf/921bf7e0f5ce3e54f043c2283ccf0ccda88a5a63" alt="media card"
data:image/s3,"s3://crabby-images/1eea1/1eea1338e98d436b747360ccc014a653fe137444" alt="media card"