Top 10 GitHub Data Science Projects with Source Code in Python
June 15, 2022
Authors: Amber Arif, Iqra Anwar
Peer reviewed by: Anna Koraleva
Introduction
You have always received Omdena’s articles and sessions about programming, automation, and real-life solutions. This article focuses on a slightly different aspect. This article is not only for experienced Data Scientists but also for newbies who have just stepped into the field and those who want to pursue their career in Data Science because it includes GitHub data science projects for beginner and advanced levels.
What is GitHub?
What could be more helpful to you as a developer than keeping track of the different versions of your code and configuration files? You might have already heard of version control systems in the context of managing configuration files or maintaining the source code of programs and scripts. Here, GitHub comes in handy as it allows you to easily roll back when mistakes happen and also helps you collaborate with other stakeholders. That’s why it’s the tool that Data Scientists use for their Data Science Projects.
GitHub portfolios give companies an idea of what projects you’ve worked on and what kind of code you can write. This is how GitHub helps you win your professional interview, and you will always learn something new by contributing to open-source GitHub Projects.
Top 10 GitHub Data Science projects with source code
GitHub is a great place to work on a Data Science project. Below is the list of Data Science projects you can work on! You can also enroll in courses from Omdena School for additional information and practical applications.
5 Data science projects on GitHub for beginners
This section will present a collection of data science project ideas for beginners and newbies in Python and Data Science. These Python data science projects will help you build a strong foundation in Data Science.
1. HARVESTIFY
The economy of a country is highly dependent on agriculture. The growth of the products is directly related to the quantity and quality of crops. This project is a machine learning based website that recommends the farmers:
- The best crops to grow,
- The best fertilizers to use,
- And diseases caught by the crops.
The following are the functions that this website performs.
- Crop Recommendation – When the user provides the soil details, the application predicts which crop he should grow.
- Fertilizer Recommendation – When the user inputs the soil data and the type of the crop he is growing, the application recommends improvements by determining what the soil lacks or has an excess of.
- Plant Disease Prediction – When the user inputs an image of a diseased plant leaf, the application predicts the type of disease, displays the result along with the little background about the disease and suggestions to cure it.
You can learn more about the project using this link.
2. Google Play Store apps and reviews
There are thousands of mobile apps available that are easy to use and can be profitable. Hence, an increasing number of applications are being produced. In this notebook, we have compared over ten thousand apps in Google Play across different categories to conduct a comprehensive research on the Android app industry. A deep search is done for patterns in the data to investigate growth and retention strategies.
The data set for this beginner Data science project contains two files:
1. A CSV file containing all the details of the applications on Google Play. 13 features describe a given app.
2. Another CSV file containing 100 reviews for each app is most helpful first.
Source code: The Android App Market on Google Play | aiwithqasim/datascience-projects (github.com)
3. Diabetes Prediction Application
Diabetes is a serious health issue that is increasingly growing due to our inactive lifestyle. It can be cured through proper medical treatment if it’s detected in time, else we will have to face adverse effects. We can use machine learning very reliably and efficiently in the early detection of diabetes.
This project uses a predictive model to predict whether the person is diabetic or not based on various factors such as:
- Insulin Level
- Age
- Pregnancies
- BMI (Body Mass Index)
Some of the objectives of this project are:
- Data gathering
- Descriptive Analysis
- Data preprocessing and visualization
- Data Modeling
- Model evaluation and deployment
You can learn more about the project using this link.
4. Exploring Bitcoin. Cryptocurrencies
Hundreds of similar initiatives based on blockchain technology have been developed since the debut of Bitcoin in 2008. These are referred to as cryptocurrencies (also coins or cryptos in the Internet slang).
Some are now exceedingly valuable, while others have the potential to become incredibly valuable in the future. Indeed, as of December 6th, 2017, Bitcoin has a more than $200 billion market valuation.
WARNING: The cryptocurrency market is very volatile, and any money you invest might vanish overnight. Cryptocurrencies featured here might have various problems (overvaluation, technical, etc.). Please don’t misinterpret this as investment advice.
Source code: Exploring-the-Bitcoin-Cryptocurrency-Mark | IqraBaluch/datascience-projects (github.com)
5. Naïve Bees Species Prediction
Can a machine spot the difference between a honeybee and a bumble bee? These bees have distinct behaviors and looks, but devices may struggle to distinguish them due to the wide range of backdrops, postures, and picture resolutions.
The ability to recognize bee species from photographs is a job that will help researchers to gather field data more swiftly and efficiently in the future. Pollinating bees play an essential role in ecology and agriculture, and illnesses such as colony collapse disorder threaten their survival. We can better understand the frequency and expansion of these crucial insects by identifying different kinds of bees in the wild.
The notebook will lead you through constructing a model that can automatically recognize honeybees and bumblebees after importing and pre-processing photographs.
Source code: Naive Bees Species Prediction | IqraBaluch/datascience-projects (github.com)
5 Advanced Data science projects on GitHub
Working on Advanced Data Science projects is the ideal method to develop your portfolio from the ground up and launch your own Data Science initiatives. Here are some Data science projects you should start with:
1. Detection of Rotten Fruits (DRF) Using Image Processing in Python
When it comes to Fruits and Vegetables, consumers prefer fresh fruits over decaying ones. To make life easier for humans, an effective fruit detecting system is necessary. So, using Artificial Intelligence (AI) and Computer Vision, a desktop program called “Detection of Rotten Fruits (DRF)” is offered to help the farmers and the fruit vendors in the early detection of the diseased fruits.
2. Real-Time Hand Tracking Project | MediaPipe
MediaPipe is one of the newest introduced technologies. It is Google’s open-source framework used for media processing. It is cross-platform, or we can say it is platform friendly. It can run on Android, iOS, web, and YouTube servers.
You can learn more about the uses of MediaPipe using this link.
Below is the link to the source code of this project.
3. Transaction Fraud Detection
This project predicts whether a transaction is a fraud or not using a machine learning model.
The different steps involved to carry out this project are mentioned below.
- Data Description – At first, data will be collected and pre-processed. Then, some operations of descriptive statistics will be performed such as median, mode, standard deviation, skewness etc.
- Feature Engineering – A mind map helps in the creation of new features, thus improving the exploratory data analysis.
- Data Filtering – In this step, the unnecessary columns and rows are removed that are not part of the business.
- Exploratory Data Analysis – This step involves univariate analysis, bivariate analysis, and multivariate analysis to understand the database.
- Data Preparation – In this step, the data is prepared and transformed for machine learning modeling by encoding, oversampling, and rescaling.
- Feature Selection – This step involves dimensionality reduction of the dataset to reduce model overfitting.
- Machine Learning Modeling – This step aims to train the machine learning algorithms so they can accurately predict the data.
- Hyperparameter Fine Tuning – It is important to fine tune the hyperparameters for improving the model performance and the overall score.
- Conclusions – In this step, the model is tested using the unseen data and its performance is analyzed.
- Model Deployment – This step involves the creation of the flask API and saving the model and the functions to be implemented in the API.
You can learn more about the project using this link.
4. Heart Stroke Prediction
This project builds an application that predicts the probability of a person to have a stroke or heart failure. The user inputs the necessary personal and health information on the medical device. When the application predicts the heart failure probability, the model uses this information and displays a detailed result about the patient status. The model also gives possible precautions and advice to the user about visiting a health professional.
The potential users of the application are:
- Clinics/ hospitals
- Medical professionals
- Medical devices
Further, the application uses data ingestion to collect the data based on the user inputs. It also uses pipeline retraining to retrain the model to make it even more accurate, so it can correctly predict heart failure.
You can learn more about the project using this link.
5. Facemask Detection | Github Data Science Project
Face Mask Detection uses an Artificial Network to determine whether or not a user is wearing a mask. To detect people without masks, the software may be linked to any existing or new IP mask detection cameras.
Users of the app may also add faces and phone numbers to get notifications if the people around them are not wearing a mask. A notification can be issued to the administrator if the camera records an unidentified face.
Source code: Facemask Detection: Facemask detection using MobileNet (github.com)
Conclusion
These are some Data Science projects on GitHub that you may replicate to improve your Data Science abilities in the real world. The more time and effort you put into Data Science projects, the better at model building you will become.
FAQs
Q. How to use Github for Data Science projects?
Data scientists use GitHub for collaboration, making changes to projects “safely,” and tracking changes over time and rolling them back if needed. Traditionally, data scientists were not required to utilize GitHub since the process of bringing algorithms into production (where version control is essential) was generally delegated to technology or data technical staff.
However, solutions like H20.ai and Google Cloud AI Platform make it much easier for data scientists to create their code to deploy models into production and contribute to open source projects. As a result, knowing how to use version control is becoming increasingly vital for data scientists.
You can use Github for Data Science projects by following the steps below.
- Start with the master branch and create a new branch using the commands below.
git checkout master git pull git checkout -b branch-name
- Update, Add, Commit, and Push your changes to the remote repository using the commands below.
git status git add <your-files> git commit -m 'your message' git push -u origin branch-name
- Create a Pull request and make changes to the Pull request using the commands below.
git status git add <your-files> git commit -m 'your message' git push
Q. How to create a data science portfolio on Github?
Once you create your account on GitHub, you should start working on beginner projects. After working on some beginner projects, start working on the advanced assignments. Then start contributing to open source projects. This is how you will learn and make your Data Science Portfolio on Github.
Q. How to contribute to open-source projects?
As an aspiring data scientist, you would stand out in the open-source community by contributing to numerous projects. It allows you to enhance your abilities while also receiving inspiration and encouragement from like-minded people.
Do some screening when you’ve chosen a project you want to contribute to. Make sure it fits the following criteria to ensure you’ll like working on it:
1. Check the time of the last commit. This will tell you whether or not the maintainers are active, as well as how long it will take them to reply to your contribution.
2. Look at the number of people who have contributed.
3. Examine how frequently people make commits.
It’s a positive indicator if you observe a lot of recent activity because it suggests the community and maintainers are both engaged.
If you are planning to step into the field of Data Science and want to get Data Science Education, you can instantly get in touch with the Omdena team through our social media platforms. Our social media team is always active and shares Data science-related projects and updates of sessions. Below are our social media links. Get connected and stay updated.
- Facebook: Omdena
- LinkedIn: Omdena
- Website: Omdena School
You can also get enrolled in Omdena School. Omdena School’s objective is to provide quality education in the field of Data Science, Machine Learning, and Artificial Intelligence while addressing financial and geographic limitations.
Ready to test your skills?
If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects