AI Insights

How to Prepare for a Junior Data Scientist Job in the Real World

February 24, 2020


article featured image

From nobodies to collaborating with global leaders. Working on an actual problem meant for us to overcome data quality challenges, finding the best-fit machine learning model, and improving our team and problem-solving skills.

In this article, Ashish Gupta and I describe our experiences in our first Omdena Challenge and how it prepares for a Junior Data Scientist Job.

It was always one of our dreams to work in enterprises that take initiatives for social good. Being part of the IT sector, we were trying to make use of technology for the benefit of the community.

However, the deployment and applicability of these technologies in solving real-world problems seemed out of reach.

Fortunately, Omdena provided us with the opportunity to leverage AI and ML technologies to join their #AIforEarth Challenge to detect wildfires in Brazil.

At Omdena, you belong to a team that comprises people from across the globe that are working towards a common objective.

In such a scenario, keeping up with the senior and Lead AI enthusiasts within the team is super difficult, but ultimately rewarding!

Photo by Kevin Ku on Unsplash

Photo by Kevin Ku on Unsplash

Entering a Real-World Machine Learning Project

Joining the challenge of detecting disastrous forest fires through the use of landscape images, we came across challenges in our dataset.

Always prepare for the worst!

Our dataset consists of images that are captured by cameras placed on communication towers in the forest areas. The dataset is precarious. It contains images ranging from images containing camera glares, fog, clouds, night-time images as well as images that contain smoke released from boilers.

For example, the image below comes with camera glares, which makes it difficult for the machine to interpret the image.

Images with camera glares

Images with camera glares

Such noisy images decrease the accuracy levels of a model. For instance, images that contain clouds are difficult to understand (even by us humans) because the smoke and clouds appear the same!

Original Image with clouds covering the forest area

Original Image with clouds covering the forest area

Mask that our model recognized as smoke for the above image

Mask that our model recognized as smoke for the above image

Images that did not contain this problem, displayed several other challenges. The following image is difficult to be recognized as smoke or fog.

Image with the confusion of smoke and fog

Image with the confusion of smoke and fog

And now what our model predicted for the mask.

Model prediction

Model prediction

With these data challenges, the accuracy metrics for most of the models prepared by the team were plateauing around 93–94%.

Something had to be done!

One way around these stagnant waters was to implement a routine to improve the quality of the dataset that is being fed into our model so that the machine could interpret them better.

Improving the image quality

We took initiative and began working on a task that could process images by making use of denoising, sharpening as well as upsampling techniques.

With everybody busy with their work, as a Junior Data Scientist, we had to take responsibility for the job.

We assumed the role of task managers and created a task for improving the quality of the images. This is when the team took notice! Our task involved using image processing methodologies such as image denoising, image sharpening, and image super-resolution to improve the quality of the dataset.

We stumbled upon several algorithms to perform this and started to experiment.

First, we came across a method known as the Non-Local Means of Denoising Images prepared by OpenCV. Below are some results.

Original Image

Original Image

After running it through the algorithm — Appears the same

After running it through the algorithm — Appears the same

As you can see, the original distorted and noisy image showed no improvements because the method only showed commendable results on images containing Gaussian Noise. However, our images contain random noise.

We shifted to implement yet another approach that was taken from the popular Deep Image Prior code repository. Using the various methodologies offered by this paper, we were able to eliminate some noise from the image. However, this was not practical!

We ran the code on a sample image containing Gaussian noise for 3000 iterations.

Original Image containing Gaussian Noise

Original Image containing Gaussian Noise

Image after denoising — 3000 iterations (Gaussian noise eliminated)

Image after denoising — 3000 iterations (Gaussian noise eliminated)

Because this showed promising results, we ran the code using our dataset’s images. Even after a whooping 8000 iterations, we had the following results.

Original Image

Original Image

Image after denoising — 8000 iterations (still a bit blurry as compared to Original image)

Image after denoising — 8000 iterations (still a bit blurry as compared to Original image)

Shoot for the moon. Even if you miss, you’ll land among the stars.

In the end, the code was not implemented into our team’s final model, because it dealt only with Gaussian noise removal.

However, the experience of moving on from smaller contributions to being task managers of a challenge was exciting, inspiring and motivating.

Prepared for a Junior Data Scientist Job

Technical skills

As beginners in the field, we were not very experienced with libraries such as the PyTorch library. The same was with Slack. Our time with the team has increased our skill sets and abilities in handling such software. I am now not only experienced with the technical aspects of it but can also apply them in future challenges or for other teams towards the development and deployment of new projects.

Communication and Collaboration

Prior to this challenge, we usually lacked confidence in putting forward our viewpoints in front of other people. This diffidence, however, was eradicated when the Omdena team’s seniors welcomed our viewpoints with open arms and we could communicate freely in meetings. This also created room for efficient collaboration on a global scale irrespective of the age or experience of the other team members.

Problem Solving Skills

Our team was in a fix and we assumed responsibility to try and create a solution for this fix. In a scenario where the only problem for our plateauing model seemed to be the dataset, we thought of solving this problem for the improvement of our team’s performance. In such a case, we efficiently worked on denoising the images of our dataset so that one could see an improvement in their quality. This improved quality could help us in achieving greater levels of accuracy from the model our team had developed.

From being nobody’s in a large group of global collaborators, we managed to gather praise from the team for our efforts towards coding the denoising solutions. We have also successfully presented these findings at our meetings held towards the end of the challenge.

We simply cannot express the levels of satisfaction you get when highly experienced people belonging to the AI enthusiasts community appreciate the work done by you and praise the efforts. It is not at all a problem even if you land amongst the stars you know!

This article is written by Yash Mahesh Bangera.

media card
AI-Powered Wildfire Detection and Monitoring in Government
media card
Artificial Intelligence For Wildfires: Scaling a Solution To Save Lives, Infrastructure, and The Environment
media card
Detecting Wildfires Using CNN Model with 95% Accuracy
media card
Detecting Wildfires with Artificial Intelligence