11 proven Do’s and Don’ts you need to know to build your data science portfolio (+ How to create a LinkedIn Page to showcase your portfolio)
Author: Juber Rahman, Senior Data Scientist FedX, and Rosana de Oliveira Gomes
Standing out from the data science crowd in 2021 is more important than ever. While the demand for data scientists is still on the rise (even though slower than in previous years), you need to be on top of certain trends like the surge for data engineering skills and relevant work experience. This means you need to be smart about how you are building your portfolio and that’s exactly what this article will show you.
The importance of a data science portfolio
There is a famous saying- don’t tell me, show me. Even if you have a referral, the ability to show potential employers what you can do instead of just telling them you can do something makes all the difference.
Your portfolio should demonstrate your potential as a data scientist as well as your important and highly sought out skills. A data scientist should be able to model a real-world phenomenon in a mathematical notion (e.g. challenges with autonomous vehicles, protecting children online, modeling the behavior of desert locust), deal with complex problems following a standard methodology, and develop a solution from the scratch. Your portfolio should bear the footprint of your capabilities.
What is changing in data science roles
Keep in mind that industries are not looking for an amateur data scientist who is only able to play around with some models when provided with a cleaned, curated and well-fitted dataset. Data science has advanced a lot and only some familiarity with scikit-learn (great API though) may not suffice mainly due to the addition of a big amount of unstructured data (text, images). Deploying a model for scoring real-time streaming data has made an edge over batch scoring. Familiarity with cloud-based machine learning as a service or as a platform is almost a mandatory skill.
Looking at your GitHub profile hiring managers might be interested to call you for an interview or might lose the interest altogether.
I recently got a job at FedEx, the biggest cargo airlines carrier as a Sr. Data Scientist. My experiences with real-world projects at Omdena helped to get the interview, and during the interview I had to present a project and answer questions that can’t be learnt from studying books or reading blogs but only doing the work yourself , facing similar problems and developing a specific solution tailored to the dataset at hand.
Here are some rules and guidelines that will help you to stand out in the crowd.
Building your portfolio
#1 Be picky on the type of projects
Yes! Don’t just do what everyone else is doing like jumping into the Kaggle Titanic competition.
In your portfolio focus on projects that have a substantial value and a real-world connection, e.g. improving a business, making a difference in society. A problem that has breadth is usually cross-disciplinary requiring collaboration among domain expertise, specialists on the topic, data scientists, data engineers, and software engineers. Describe the steps you took in developing the dataset, validating the annotations, and establishing the ground truths. Depth is indicative of the modeling complexity and time commitment required to solve the problem. A problem that requires months is given much higher value than a project requiring a few hours.
#2 Advance your skills on messy data
It is better to pick a dataset that shows a diversity of data types, such as a mix of numerical and categorical features, as well as a lot of missing values, outliers and a data distribution far away from Gaussian Distribution; as most of the real-world dataset suffer from these issues. Nonetheless, everybody knows that playing with tabular data is relatively easier. In industry set up, most of the data is unstructured i.e. text or image data . To stand out in your GitHub repo you should have at least one advanced project requiring Deep learning, and one project requiring Natural Language Processing (NLP), or Computer Vision (CV) skills. There is a speculation that the next decade will be the decade of NLP and there will be a hype around it. Another important skill you might include in your portfolio is your ability to work with big data. An easy workaround for this is to rewrite a project using PySpark instead of using scikit-learn and build the models with packages available in spark.
#3 Demonstrate you can build models for deployment
Just like DevOps, MLOps is getting extremely popular . While most of the model development is backend work, deployment needs some frontend skill. For example, you may integrate your model with a streamlit app, or a flask API or a bokeh dashboard. It will be even better if you can dockerize your scripts and make it suitable for integration into micro services. The recruiters are very much interested to know if you have developed something that has been deployed and is in use by an organization or a commercial entity. A workaround may be to mention that I have deployment experience, the codes are available in my portfolio as an open source project. It is also better to learn and include your work showing the deployment of a model in a web app in the cloud environment i.e. Microsoft Azure or Amazon Web Services (AWS). Show how to create an endpoint to consume the model.
#4 Follow software engineering standards
- Arrange your code in an object oriented manner, always put focus on the readability of your codes, follow PEP-8 standards, write enough comments, include doc strings . Keep in mind that codes are read much more than it is written.
- Write computationally efficient codes, follow modular design principles.
- Make your model scalable, take into consideration the time and space complexity when refactoring the codes for final production. Your portfolio should indicate your ability to write production level codes.
#5 Create well organized presentations (not just for techies)
As a good data scientist you need to become a good story teller, as most of the time you will be discussing with stakeholders may not have a technical background.
There goes the saying- a picture is better than a thousand words. For each of your projects, create a markdown README file that describes the problem, including the architecture diagram, some insightful visualization from the exploratory data analysis, and that describes the methodology and the impact of the solution (in a story-driven manner). Make sure to provide links to related publications, blogs, articles, and cite references to support the reader.
#6 Patience: Focus on quality and depth of skills
Remember that your portfolio is your way to show your skills and work style. You want to portray yourself as a professional ready to the sector of your interest. For that, take the time to research the market that you are interested in work at (for example, healthcare, finance, energies, etc) and look for the skills on demand for that field, along with the most common problems to be solved there. These are the projects and skills you will highlight in your portfolio. Make sure to increase the quality of your portfolio by adding at least one project that is more advanced in terms of techniques or problem statements.
#1 Using documents only
This mainly goes for the aspiring data scientists that come from academia or that have a research background . You might think your publications or a nicely written report/document describing your projects are enough to speak for you. Unfortunately, no, the recruiters and hiring managers are more interested to see your codes while publications may add extra values. While for an academic position doc only portfolios often work, for an industry role it is much safer and better to include your project codes in your GitHub repo, don’t forget to include open source dataset and an instruction on how to reproduce the result.
#2 Using an unorganized coding script only
Never ever put a script on your GitHub without a readme file and proper description. Nobody has the time to decode your script to find out what you have done. As a mentor some of us had the opportunity to review GitHub profiles of aspiring data scientists, we had come across this issue a lot- unorganized repo, don’t make your portfolio or GitHub repo look like a dumpster.
#3 Highlighting common stuff from e-learning courses
While e-learning courses are good to get in or make a transition to data science here is a word of warning for you. Don’t include your Udacity, DataCamp, or Coursera projects in the highlighted repositories on the front page. These projects are usually one page requiring few hours of work and have been done by thousands if not millions of people. It may degrade your portfolio instead of upgrading it.
#4 Focusing on toy projects
Some people try to develop toy projects or use toy datasets in developing a project and put them in the portfolio. May be a good idea in the learning phase but not good for demonstrating that you are still a child data scientist playing with toys!
Recruiters and hiring managers visit a lot of GitHub repositories and are able to catch plagiarism on the first glance. Even if you are able to fool them during the initial screening in the interview phase you will be knocked out.
Showcasing your portfolio on LinkedIn (Checklist)
- Start with your LinkedIn title. List the job you are looking for in the title e.g. Data Scientist, Machine Learning Engineer. Don’t add words that indicate a lack of confidence e.g. aspiring, enthusiast. It is better to list multiple positions like Data Scientist | Machine Learning Engineer | Data Analyst. Recruiters will find you when they use these keywords to filter out job candidates.
- In the Introduction section summarize your interest, experience, expertise, skills and tools you are good at. You may use bullet points to organize and highlight the introduction section. Many recruiters may only look at your introduction, so be careful to craft it in the best possible manner to highlight your strong aspects that may interest a recruiter.
- List your past jobs with relevant projects, only a chronology is not very informative unless you add the projects and highlight your key achievements and contributions and summarize the key outcomes of the project. It is better to use quantitative metrics if possible like achieving a sensitivity of 96% or improved accuracy by 2% to describe your achievements. Also, mention the tools you used, e.g. Activeloop, Azure Cloud, Stremlit etc.
- Add graphics and links if possible. For example, you may link the website/page of the project. Use one or two key visualizations from the project. It is noteworthy that Visuals are much more appealing than mere words.
- List all related volunteer experiences in the volunteer section, specifically that demonstrates your positive attitude to teamwork, collaboration and diversity.
- List education, publications, awards, and certifications with verifiable links if applicable. Highlight what you have learnt in the course.
- Get endorsements for your skills from peers, colleagues, and team mates. Star the skills that are relevant to your job of interest. It is even better to get some recommendation from your project manager, reporting officer or academic supervisor.
- Write some LinkedIn articles on your past projects. Feature them on the LinkedIn page in the featured section. This will increase your profile weight a lot.
Last advice, joining Omdena projects you will be able to develop a portfolio that stands out. You will also get the chance to use advanced tools from startups as well as Industry leaders.