Applying Machine Learning to Predict Illegal Dumpsites

Applying Machine Learning to Predict Illegal Dumpsites

By Ramansh Sharma, Rosana de Oliveira Gomes, Simone Vaccari, Emma Roscow, and Prejith Premkumar


Just like any other day, we start our morning with a coffee and a snack to go from our favorite bakery. Later on the same day, we check out our mail where we find letters, newspapers, magazines, and possibly a package that just arrived. Finally at night, after a rough week, we decide to go out to have drinks with friends. Sounds like a pretty uneventful day, right?

Except that we produced lots of trash in the form of plastic, glass, paper, ad more.

According to eurostat, it is estimated that an average person in Europe produces more than 1.3 kg of waste per day (in Canada and the USA, it can go up to more than 2 kg). This is equivalent to a person producing 800 kg of trash per year. Now imagine millions of… billions of people doing the same. Every day!

To give you an even clearer perspective: less than 40% of all the waste produced in Europe is recycled — and it is even less across the other continents. Even further, it is estimated that 20% of all generated waste ends up on illegal dumping (s) in Europe, and 50% in Africa.

TrashOut is an environmental project which aims to map and monitor all illegal dumping (s) around the world and to reduce waste generation by helping citizens to recycle more. This is done through a mobile and web application that helps users with locating and monitoring illegal dumping (s), finding the nearest recycling center or bin, joining local green organizations, reading sustainability-related news, and notifying users about updates on their reports.

In this article, we discuss our analysis of illegal dumping (s) across the world, both in local and global scales.


The problem


Photo by Ocean Cleanup Group on Unsplash


The problem statement for this project was to “build machine learning models on illegal dumping (s) to see if there are any patterns that can help to understand what causes illegal dumping (s), predict potential dumpsites, and eventually how to avoid them”. We decided to tackle this wordy problem statement by dividing it into three manageable sub-tasks to be worked on throughout the duration of the project:

  • Sub-task 1.1: Spatial patterns of existing TrashOut dumpsites
  • Sub-task 1.2: Predict potential dumpsites using Machine Learning
  • Sub-task 1.3: Understanding patterns of existing dumpsites to prevent future potential illegal dumping (s)



  • TrashOut: Reports on illegal dumping (s) provided by users through the TrashOut mobile App. For each report, a number of features are recorded, and the most relevant for this analysis were: location (latitude and longitude, city, country, and continent), date, picture, size, and type of waste.
  • Open Street Maps (OSM): Geospatial dataset and information on the cities road network, including the type of roads (e.g. motorway, primary, residential, etc)
  • Socioeconomic Data and Applications Center (SEDAC): Population density at 1km grid, from which we also calculated the population density gradient to account for population density in the neighboring cells
  • FourSquare: Information about nearby venues
  • World Bank Indicators, World Bank’s “What a Waste 2.0”, Eurostat, European Commission Directorate-General for Environment: Datasets for socio-economic indicators.
  • Non-dumpsites Control Dataset: we generated our own Control Dataset, which was required to train the model on where dumpsites do not occur. For every TrashOut dumpsite location, we selected a pseudo-random location 1 km away and assigned this as a potential non-dumpsite location.



The first challenge was to identify and extract meaningful information for the spatial analysis from the available datasets. Our assumption was that illegal dumping (s) are more likely to occur in highly populated places, in proximity to main roads and in proximity to venues of interest such as sports venues, museums, restaurants, etc. Based on this assumption, we used the available dataset to extract, for every TrashOut dumpsites as well as for every location of our Non-dumpsite/Control Non-dumpsites, the 17 features described in Table 1:


Table 1: Datasets and API’s used to acquire different features for dumpsites * For the control dataset, the source for Continent was pycountry-convert library.


Sub-task 1.1: Finding existing dumpsites

City-based Analysis of Illegal Dumpsites/ Dumping (s)

We performed an in-depth analysis focused on six shortlisted cities, with the goal to represent different social statuses and geographical locations so all continents were included, and based on the availability of a considerable number of TrashOut dumpsite reports. The cities analyzed were:

  • Bratislava, Slovakia (Europe)
  • Campbell River, British Columbia (Canada)
  • London, UK (Europe)
  • Mamuju, Indonesia (Asia)
  • Maputo, Mozambique (Africa)
  • Torreon, Mexico (Central America)

For the city-based analysis, we accessed the road network information from the OSM dataset by using the Python package OSMnx. This API allows easy interaction with the OSM data without needing to download it, which makes it very accessible in any location around the world. We structured the analysis in a Colab Notebook for consistency and analyzed the following features for each city: distance to three types of roads (motorway, main and residential), distance to the city center, population density, size, and type of waste.


Results for Bratislava

The proportion of TrashOut dumpsites vs. Control Non-dumpsites and their proximity to nearest roads within 1 km is shown in Figure 1, however, the statistical assessment was undertaken within 100 m using the two-proportion Z-test. The three graphs are generated for each road type (motorways, main roads, and residential roads) with the purpose to identify whether dumpsites are more likely to appear in proximity of a specific road type. In Bratislava, around one-fifth of dumpsites were found in proximity to the main road (within 100 m), and these were found more likely to be reported next to the main road (within 100 m) compared to locations of Control Non-dumpsites. However, most dumpsites are not reported on roadsides, and in fact, being further away from a road was found to be a slightly better predictor of where a dumpsite might occur.


Figure 1: Proximity to a nearest major road for dumpsites and control datasets


The location of TrashOut dumpsites across Bratislava, colored by reported size, is shown in Figure 2. The majority (around three-quarters) of dumpsites are estimated by TrashOut users to be too big to be taken away in a bag. Dumpsites of all sizes are found throughout the city, but the largest dumpsites tend to be further away from motorways.


Figure 2: Size of dumpsites in the city of Bratislava


Several types of waste were reported alongside other types of waste within the TrashOut dumpsites. The number of dumpsites containing each type of waste is shown in the bar chart in Figure 3.1, whereas in Figure 3.2 is shown the percentage of dumpsites containing several types of waste in a matrix. The majority of reported dumpsites in Bratislava contain what TrashOut users describe as domestic waste. Domestic waste often coincides with plastic waste, which itself is found in around half of the dumpsites. Around one-third of dumpsites are reported to contain construction waste.



Figure 3: Waste types in the TrashOut datasets for Bratislava



Visualizing the distribution of dumpsite reports throughout the city with the spatial analysis undertaken can be informative in preparation to clean up existing dumpsites, as well as for identifying potential new hotspots. The following observations were drawn from this city-level geospatial analysis.

Information about the type and size of dumpsites may be important for local authorities and decision-makers to consider how best to clean up dumpsites. Having a spatial visualization of the locations and characteristics of each dumpsite across each city area, not only helps to inform management efforts to clean up existing dumpsites, but also to try minimizing potential new dumpsites by introducing bins for specific types of waste, or holding events to increase recycling awareness.

Plastic waste is found alongside other types in many dumpsites, which is not surprising. Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting and recycling.

The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This might suggest that residents find construction waste difficult or costly to dispose of legally, or that construction companies are neglecting their responsibility to clean up.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of whether there are more dumpsites in those areas.

The use of these tools and analysis will always need to be supported by local knowledge, as well as with the involvement of local municipalities and authorities.


1.2: Predicting potential dumpsites 

Features to train the Machine Learning model

The second subtask focused on creating a Machine Learning model that could predict whether a location is at risk of becoming a dumpsite. Since we have already seen the variables that were considered to be of a strong influence on dumpsites in Table 1, these variables could be used to predict whether a new location could turn into a dumpsite.

When acquiring the venue categories, we set a radius parameter in the Foursquare library until which distance it is supposed to fetch venue categories information. Although we created datasets with radii 500m, 1km, 2km, and 3km, we came upon the conclusion that the 1km radius dataset was the most appropriate one with the best model performance. It was not too near to the location from which the data was being collected therefore not losing any vital information, and at the same time not too far so that irrelevant information needed to be fetched.

The features: Number of Venue Categories, Nearest Venue Categories, and Frequent Venue Categories were only acquired up until 1km from a given location. Moreover, the five nearest venue categories and five most frequent venue categories were acquired for each given location as separate variables. If the Foursquare API failed to acquire not all 5 (or even none in some cases) categories within a 1km radius, then a None string would be placed instead in the empty variables.

A similar approach was taken for the OSM library for the distance to roads features. The value was only collected for roads up till a 1km radius from a location, with the exception of few cases where the API returned a distance slightly beyond 1km.

For the population density feature, our team discussed different approach ideas, and eventually, we decided that, instead of having a singular value for the population density of the given location, the probability of a dumpsite occurring in (or in very close vicinity of) that location is also affected by the surrounding population. Therefore if the location is in the center of a 1×1 square km cell, then the population densities of the eight 1×1 square km cells around the center cell are also considered. This would be a rather good way to see if the dumpsite is in the middle of a highly-populated area, in the outskirts of the city, or in a nowhere land. Using these nine different population densities (with a more weightage on the center cell’s density), a population gradient is calculated for the location which is given to the model as a separate feature in addition to the population density.

These were the 17 features that would be used for the Machine Learning model part of this subtask. But how do we teach the model what a dumpsite is?


Control dataset

We wanted to train a Machine Learning model such that it learns to understand what constitutes a dumpsite in the 17 variables we gathered. We fetched and calculated the features for every one of the approximately 56,000 dumpsites in the TrashOut dataset that we had. However, it is not possible to train a model just by showing it what a dumpsite is. This is because when we show a location that is highly unlikely to become a dumpsite, we want our model to confidently tell us so. An analogous comparison would be to show 56,000 cats to a child and then expect he/she to recognize that a dog is not a cat.

The solution lies in creating a control dataset. In order to teach the Machine Learning model what a dumpsite is, we also need to teach it to understand what a dumpsite is not.

For the sake of simplicity, we can also call the control dataset non-dumpsites. So how do we go about finding non-dumpsites? Any location that is not a dumpsite is in essence a non-dumpsite. However, that will not help the model learn meaningful differences between the two classes: dumpsites and non-dumpsites. Instead, what we can do is find close geographical points to the dumpsites that we already have and use them as the control dataset. Once again, we experimented with multiple distances from the dumpsite to pick these points and found that a distance of 1 km works best. The advantages of choosing these points are:

  • The points are close enough to the dumpsites so that there are subtle changes in the features that the model will be able to learn and appropriately map to the two classes.
  • The points are not too close so that the model fails to realize key differences between the features of the two classes.
  • The points are not so far that there is no correlation between the two classes, therefore, preventing nullification of the purpose of the control dataset.
  • When we choose a point near a reported dumpsite, we assume that a location nearby a known dumpsite has active users of the TrashOut app in the area, so if there was no dumpsite report, we assumed there was no dumpsite in that location.


Figure 4: Illustration of determining the approach to creating a control dataset.


We also took measures to make sure that the non-dumpsite point generated for each dumpsite did not contain another dumpsite, was not in the vicinity of another dumpsite, and was not in a major water body.

The control dataset was made for every dumpsite so that the two classes were balanced for binary classification. Additionally, all the features that were used in the dumpsite dataset were used for the control dataset as well.



Our team investigated three different Machine Learning models throughout the project:

  1. Random forest classifier — This approach did not work because the model failed to understand the data in a thorough manner and yielded extremely low accuracy. ❌
  2. Neural Networks (Dense and Ensemble) — These series of models and its iterations did not work either because the model was tremendously overfitting. It would be unsuitable for real-world purposes. ❌
  3. Light Gradient Boosted Model (LGBM) — This model was our final model. It had good accuracy and the minimum generalization error among all three models. ✅



The final accuracy of our model was 80% on the test set. We employed the use of k-fold cross-validation to maximize the accuracy our model could achieve on the test set. We also observed how important the individual features were when it came to classifying a given location to be prone to a dumpsite or not. This analysis was done with the help of SHAPELY and PDP plots as shown in Figure 5 below.


Figure 5: Importance of every individual variable on the model performance.


Figure 6: Probability output change in the model induced by a change in distance to roads feature


The SHAPELY plot in Figure 5 shows the contribution of each feature towards the prediction. It depicts the importance that each feature has. A high value of importance signifies that the model considers it as a very important factor when determining if a given location is a dumpsite or not. It is indicated that the most important feature of the model for both classes is the distance to the road variable.

The Partial Dependence Plot in Figure 6 helps one understand the effect of a specific variable on the model output. As the value of a feature changes, its effect on the model will also change accordingly. We compute these plots for all the numerical values that we have, to understand its effect on the prediction of the model. The one shown above is the plot analyzing the effect of distance to roads variable. As it can be observed, as the distance from a given location to a major road increases (positive x-axis), the probability that that location is a dumpsite decreases. The soft spike from 1500 m to 2500 m is due to how we manually placed the value of 2500 m in an example when the API could not find a road up till that distance. Regardless, this situation can be manually handled in the deployment implementation.

One of the key achievements of the team was being able to generate a full city heat map of the city of Bratislava (most of our tests were based here) by running the model on more than 700 locations in the city. In the heat map, the actual dumpsites are plotted in blue markers while the roads are marked in black lines (major roads and highways are visibly thicker than minor roads). The spectrum goes from whitish-yellow to dark red with yellow regions resembling a low probability of becoming a dumpsite and the red regions resembling a high probability of becoming a dumpsite. The heat map provides many beneficial usages. For example, municipalities and local authorities can make smaller heat maps for regional neighborhoods to determine which areas are at a high risk of becoming a dumpsite.

Another important variational use of this heat map is to combine it with valuable insights about socio-economic factors, population density, distance to roads, etc. The reason being that even though the model has considerably good accuracy, the wisdom still lies in the intuition of local authorities and municipalities. These officials will be better equipped to analyze key neighborhoods and areas to find the places where there is a major road or highway nearby, has a high population density, and a certain set of venue categories in close proximity. Then, a heat map can be generated for that area and specific regions can be identified which require immediate attention to mitigate the possibility of becoming a dumpsite.


Figure 7: Heat map of the city of Bratislava using the ML model to predict dumpsites


Sub-task 1.3: Preventing future dumpsites

Global analysis of illegal dumpsites/ dumping (s)

In order to analyze illegal dumpsites/ dumping (s) on a global scale, we combined the data from TrashOut with two other datasets:

From this setup, it was possible to divide the countries analyzed into four clusters, using unsupervised learning:


Figure 8: Global analysis clustering summary. Source: Omdena


Small population developed countries (Blue cluster): countries with a small population and population growth, but high urban populations and access to electricity. These countries also have low urban population growth and GDP. The countries in this cluster present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste.

High population developed countries (Orange cluster): countries with the highest GDP, and also high access to electricity, urban population, tourism. Low inflation and urban population growth. also produces high amounts of glass and paper/cardboard and is responsible for the highest production of special waste and total municipal solid waste: These countries are also associated with the lowest production of organic waste.

High population developing countries (Green cluster): countries with the highest population and inflation; moderate access to electricity and GDP, and growing urban and total population. They generate high amounts of organic, rubber/leather and wood waste, and low amounts of glass. This can be associated with the high population being less concentrated in cities and also to the level of industrialization of such countries, which may be the ones that most produce factory materials such as leather and rubber.

Small population developing countries/low income (Red cluster): countries with a small population and GDP; lowest electricity access, but highest GDP and population (including urban) growth. Most of the waste produced in these countries is from food and organic sources. These are the countries that also produce the lowest amounts of glass, metal, and rubber/leather waste. Such scenarios can be associated with the low populations and also possible use of waste in a sustainable way, as these countries also present low production of green/yard and wood waste.


Figure 9: Some features in the global analysis in country clusters.


Combining this analysis to the illegal dumpsites/ dumping (s) data from TrashOut, our team obtained the following insights:

  • Plastic waste is highly produced across all clusters, indicating that this kind of trash needs to have a global awareness strategy.
  • Different types of illegal trash are associated with rural and urban areas, making it important on a global scale. Countries with a higher rural population (low income) will produce more illegal organic waste, whereas developed countries present more reports on illegal dumpings/ dumping (s) of plastic, cardboard, yard/green waste, rubber/leather, and special waste.
  • Socio-economic factors such as infrastructure, sanitation, inflation, and tourism play a moderate role in the different production of illegal waste worldwide.
  • We identify that population is the most important factor in the production of special waste, municipal solid waste, and the total amounts of waste per year. However, most of the special waste produced in developed countries (majority) is reported.



In this project, we have analyzed data on a local and global scale to understand which factors contribute to illegal dumping, as well as predict and finding possible ways to avoid it.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of the number of dumpsites in those areas.

Nevertheless, visualizing the distribution of dumpsite reports with the spatial analysis undertaken can be informative for identifying potential new hotspots. The following observations can be extracted from our analysis:


On a city level

  • The prediction of the ML model and the heatmap can be used as tools for targeted waste management interventions, but will always need to be supported by local knowledge as well as with the involvement of local municipalities and authorities.
  • The main road and motorway junctions are locations where illegal disposal of waste is prone to occur. We can witness this in Bratislava and Torreón.
  • A lot of reports occur in natural resources areas, e.g. watercourses or natural parks. We can see this in Bratislava, Mamuju, and Campbell River. This may be due to two factors: ease of disposal without being caught; people walking by those areas may be more environmentally aware and wanting to preserve more the places where they go to enjoy nature. Consequently, they create reports more often.
  • Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting. This is especially clear in Maputo city.
  • The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This applies to companies and individuals. Construction waste seems to be a problem for many cities.
  • TrashOut reports seem to be created in pulses, not on a regular basis. For all the cities examined, there are certain months when the number of reports created exceeds the average by far. Moreover, reports seem to be generally created in specific parts of a city (eg. London).


On a global scale

  • Plastic waste production is high across the globe, making it an international problem;
  • Among all the socio-economic factors, population plays a strong role in the production of waste in the world (legal and illegal);
  • The level of development and socio-economic factors (infrastructure, sanitation, education, among others) play an important role in the kind of waste produced by countries.
  • In particular:
  1. Small developed countries present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste
  2. High population developed countries: high amounts of glass and paper/cardboard and the highest production of special waste and total municipal solid waste. These countries are also associated with the lowest production of organic waste
  3. High population developing countries: high amounts of organic, rubber/leather and wood waste, and low amounts of glass;
  4. Small developing countries/low income: most of the waste produced in these countries is from food and organic sources.


Possible Factors to Avoid Illegal Dumping

  1. Organization of clean-up events in areas where many dumpsites are already existing can be arranged to clean up the targeted areas in a fun, interactive, and educational way. Collaboration with local authorities should be put in place to improve existing waste infrastructure and build new ones, if necessary.
  2. Those areas identified as high risk for becoming new potential dumpsites could be targeted with waste infrastructure development/enhancement programs. Additionally, learning events could be organized to raise awareness about dumpsites risks, and how to minimize, or avoid altogether, dumpsites by using properly the waste facility infrastructure existing.
  3. Examples of learning events consist of learning sessions on how to use waste infrastructure and recycling bins according to local/national authorities, the benefits of a sustainable way of living through the 4R cycle (Refuse, Reduce, Reuse, Recycle), and how to avoid single-use items (or a specific type of waste as can be highlighted by a city-level analysis).

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

Analyzing Mental Health and Youth Sentiment Through NLP and Social Media

By Mateus Broilo and Andrea Posada Cardenas


We are living in an era where life passes so quickly that mental illness has become a pivotal issue, and perhaps a bridge to some other diseases.

As Ferris Bueller once said:

“Life moves pretty fast. If you don’t stop and look around once in awhile, you could miss it.”

This fear of missing out has caused people of all ages to suffer from mental health issues like anxiety, depression, and even suicide ideation. Contemporary psychology tells us that this is expected — simply because we live on an emotional roller coaster every day.

The way our society functions in the modern day can present us with a range of external contributing factors that impact our mental health — often beyond our control. The message here is not that the odds are hopelessly stacked against us, but that our vulnerability to anxiety and depression is not our fault. — Students Against Depression

According to WHO, good mental health is “a state of well-being in which every individual realizes his or her own potential, can cope with the normal stresses of life, can work productively and fruitfully, and is able to make a contribution to her or his community. At the same time, we find it at WordNet Search as “the psychological state of someone who is functioning at a satisfactory level of emotional and behavioral adjustment”. Notice that it is far from being a perfect definition, but it gives us a hint related to which indicator to look for, e.g. “emotional and behavioral adjustment”.

It’s foreseen that this year (2020) around 1 in 4 people will experience mental health problems. Especially, low-income countries have an estimated treatment gap of 85%, contrary to high-income countries. The latter has a treatment gap of 35% to 50%.

Every single day, tons of information is thrown into the wormhole that is the internet. Millions of young people absorb this information and see the world through the glass of online events and others’ opinions. Social media is a playground for all this information and has a deep impact on the way our youth interacts. Whether by contributing to a movement on Twitter or Facebook (#BlackLifeMatters), staying up to date with the latest news and discussions on Reddit (#COVID19), or engaging in campaigns simply for the greater good, the digital world is where the magic happens and makes worldwide interactions possible. The digital eco-not so friendly-system plays a crucial role and represents an excellent opportunity for analysts to understand what today’s youth think about their future tomorrow.

Take a look at the article written by Fondation Botnar related to the young people’s aspiration.


The power of sentiment analysis

Sentiment analysis, a.k.a  opinion mining or emotional artificial intelligence (AI), uses text analysis, and NLP to identify affective level patterns presented in data. Therefore, a wise question could be: How do the polarities change?


Top Mental Health keywords from Reddit and Twitter



Violin plots

Considering a data set scraped from Reddit and Twitter from 2016–2020, these “dynamic” polarity distributions could be expressed using violin plots.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by year. Here positive values refer to positive sentiments, whereas negative values indicate negative sentiments. The thicker part means the values in that section of the violin has a higher frequency, and the thinner part implies lower frequency.




On one hand, we see that as the years go by polarity tends to become more and more neutral. On the other hand, it’s difficult to understand which sentiment falls in what category, and what does the model categorizes as positive vs negative sentiments for each year. Also, text sentiment analysis is subjective and does not really spot complex emotions like sarcasm.


Violin plots according to label

So now, the next attempt was to see polarities according to labels — anxiety, depression, self-harm, suicide, exasperation, loneliness, and bullying.



Sentiment Violin-Plot hued by Year

Sentiment Violin Plots by label



Even if we try to see the polarities by the label, we might end up with surface-level results instead of crisp insights. Look at Self-harm, what’s the meaning of positive self-harm? But it’s still there in the green plot.

We see that most of the polarities are distributed close to the limits of the neutral region, which is ambiguous since it can be viewed as either a lack of positiveness or a lack of accurate sentiment categorization. The question is — how do we gain better insights?

Maybe we try plotting the mean (average) sentiment per year per label.



Mean Sentiment per Year hued by label



Notice that Depression was the only label that went through two consecutive decreasing mean sentiment values and passed from positive (2017–2019) to neutral in 2020. Moreover, Loneliness and Bullying classes are depicted only with one mark each, because they appear only in the data scraped from (Jan - Jun)/2020.


Depression-label word cloud

Before pressing on, let’s just take a look at the Depression-label word cloud. Here we can detect a lot of “emotions” besides the huge “depression” in green, e.g. “low”, “hopeless”, “financial”, “relationship”.


Keywords relating to mental health

Source: Omdena



These are just the most frequent words associated with posts labeled as Depression and not necessarily translates the feelings behind the scene. However, there is a huge “feel” there… Why? For sure, this is related to one of the most common words, which actually is the 6th more common word in the whole data set. In a more in-depth analysis aiming to find interconnections among topics, certainly “feel” would be used as one of the most prominent edges.



feel knowledge graph

“Feel” Knowledge Graph



This Knowledge graph shows all the nodes where “feel” is used as the edge connector. Very insightful but not very visible.

In fact, there’s a much better approach that performs text analysis across lexical categories. So now the question is: “What sort of other feelings related to mental health issues should we be looking for?”.




Empath analysis

The main objective of empath analysis consists of connecting the text within a wide range of sentiments besides just negative, neutral, and positive polarities. Now we’re able to go far beyond trying to detect subjective feelings. For example, look at the second and third lexicon- “sadness” and “suffering”. Empath uses similarity comparisons to map a vocabulary of the text words, (our data set is composed of Reddit and Twitter posts) across Empath’s 200 categories.


AI Mental Health

Empathy Value VS Lexicon




The Empath value is calculated by counting how many times it appears and is normalized according to the total text emotions spotted and the total number of words analyzed. Now we’re able to go much deeper and truly connect the sentiment presented in the text into some real emotion, rather than just counting the most frequent ones and assuming whether it is related to something good or bad.



Empathy value vs Year

Emotion trends hued by lexicon



We choose five lexicons that might be more deeply associated with mental health issues and show in the left plot: “nervousness”, “suffering”, “shame”, “sadness” and “hate”, we tacked these five emotions per year analyzed. And guess what? Sadness skyrocketed in 2020.



Sentiment analysis in the post-COVID world

The year 2020 turned our lives upside down. From now on we will most likely have to rethink the way we eat, travel, have fun, work, connect,… In short, we will have to rethink our entire lives.

There’s absolutely no question that the COVID-19 pandemic plays an essential role in mental health analysis. To take these impacts into account, since COVID-19 began to spread out worldwide in January, we selected all the data comprising the period of (January — June)/2020 to perform the analysis. Take a look at the Word Cloud related to the COVID-19 analysis from May and June.




COVID 19 Top keyword Analysis of Mental Health



Covid 19 Top Keyword Analysis of Mental Health




We can see words like help, anxiety, loneliness, health, depression, isolation. In this case, we can consider that it reflects the emotional state of people on social media. As said earlier that the sentiment analysis under polarity tracking isn’t that insightful, but we display the violin graphs below just for comparing.



Sentiment Violin-plot for COVID 19 Analysis by Months



Sentiment Violin-plot for COVID 19 Analysis by Label



Now we see a very different pattern from the previous one, and why is that? Well, now we’re filtering by the COVID-19 keywords and indeed the sentiment distribution now seems to make sense. Looking more closely at the distribution of the data, the following is observed.



Graph of number of relatable words vs count of words



In the word count from the sample of texts from 2020, only 2.59% of them contain words related to COVID-19. The words we used are “corona”, “virus”, “COVID”, “covid19”, “COVID-19” and “coronavirus”. Furthermore, the frequency of occurrence decreases as the number of related words found increases, the most common being at most three times in the same text.

Till now, we have presented the distribution of sentiments for specific words related to COVID-19. Nonetheless, questions about how these words relate to the general sentiment during the time period under analysis haven’t been answered yet.

The general sentiment has been deteriorating, i.e. becoming more negative, since the beginning of 2020. In particular, June is the month with the most negative sentiment, which coincides with the month with the most contagious cases of COVID-19 in the period considered, with a total number of 241 million cases. Considering the differences between the words related to COVID-19 and words that are completely unrelated, in the former, more negativity in sentiments is perceived in general.



Graph between sentiment vs months in 2020

The sentiment by the label is again observed — this time from January 2020 to June 2020 only.



Violin Plot by label 2020



Exasperation remains stable, with February being the month that attracts the most attention due to its negativity compared to the rest. Likewise, self-harm is quite stable. The months that call out the attention for their negativity in this category are March and June. Contrary to self-harm, in suicides, March doesn’t represent a negative month. However, the rest of the months between February and June not only present a detriment in the sentiment, which worsens over time, but they are also notably negative. June draws attention to having really positive and really negative sentiments (high polarities), which doesn’t happen in the other months. It has to be verified, but it could be that the number of suicides has been increasing in the last months. Regarding anxiety, a downward trend is also observed in the sentiment between February and May. Finally, one should be careful with loneliness, given the high negativity perception in May and June. Given that there are only data for June 2020 for Bullying, this label isn’t analyzed.

The next figure presents the time series corresponding to the sentiment between 2019/05 and 2020/06. A slight downward trend can be observed. This means that the general sentiment has become more negative. Additionally, there are days that present greater negativity, indicated by the troughs. Most of the troughs in the present year are found in the last months since April.



Sentiment Analysis from 2019-05

Incidents that moved the youth




There are other major incidents, besides COVID-19, that have influenced the youth to call for help and to speak up in 2020. The recent murder of George Floyd was the turning point and lighted up the #BlackLivesMatter movements. Have a look at the word cloud on the left — with the most frequent and insightful words

The youth gathered to protest against racism and call for equality and freedom worldwide. The Empath values related to Racism and Mental Health are displayed below.


AI Mental Health

Normalized empathy analysis



The COVID-19 pandemic has led the world towards a scenario of a global economic crisis. Massive unemployment, lack of food, lack of medicines. Perhaps the big Q is: “How will the pandemic affect the younger generations and the generations to come? ”. Unfortunately, there’s no answer to this question. Except that the economic crisis that we’re presently living in is definitely going to affect the younger generation because they’re the ones to study, go to college and find a job in the near future. The big picture tells us that unemployment is increasing on a daily basis and there are not enough resources for all of us. The Word Cloud in the opening of the article reflects some of the most frequent words related to the actual economic crisis.

Using Machine Learning to Predict Student Debt Repayment Capability

Using Machine Learning to Predict Student Debt Repayment Capability

By Laisha Wadhwa


Each day, there are news stories about the college tuition crisis. But what is the crisis we are seeking to solve? Is it the staggering amount of student debt? The rapidly rising cost of higher education? Is the interest being collected on student loans? The high default rate on student loans? Or all of the above?

At present student loan debt exceeds accumulated car loans and credit card debt by nearly $1.6 trillion. The student debt has hence become a $1.6 trillion crisis.

This case study highlights solutions to this problem using Artificial Intelligence (AI) and Machine Learning (ML).

  • Using ML to predict the repayment capability of the students.
  • Build a network that will allow the students to estimate the risk incurred while borrowing and manage the loan amount accordingly.
  • Understand if the student’s future earnings are maximized by attending the best school they can get into, despite the cost incurred.


The task initially started with a lot of literature review that gave useful insights into multiple drivers for the current state of students in the US and how advances in ML and AI create new possibilities across markets. It enables new avenues to disrupt one of the most costly and impactful markets: that of student debt.

The findings were functional while segmenting the students and narrowing down the features to use for defining these segments.


Data preparation and feature engineering

After the initial literature review, we dived into the data preparation which involved collecting the right sources of data, EDA, normalization, and feature transformation. Multiple sources of data were used; College Scorecard Data, 2019 US student loan debt by location and age, and Trends in Graduate Student financing data by NCES.

Let’s look a little closer:

  • Features with a large number of missing values were dropped.
  • The data values were normalized given there were many different features.


Exploring the data further

The pre-processed dataset was huge so the next step was to select the relevant features from the insights gathered from the literature review. After having a round of Exploratory Data analysis and model simulations the following features were selected for further creating the segments using K-means clustering. IN order to handle missing values MICE imputation technique was used for 3 features (INST_ENSIZE being one of them)


Feature importance using Lightgbm regressor



  • CD3: Three-year cohort default rate
  • Control3
  • Median earnings of students working and not enrolled 8 years after entry
  • The average age of entry
  • Control2
  • Median household income
  • Percent of all undergraduate students receiving a federal student loan
  • Highdeg4: Highest degree awarded (has 5 levels so level 2,4 and 5 were important)
  • Highdeg2
  • State44
  • State28
  • State30
  • State8
  • State22
  • The total share of enrollment of undergraduate degree-seeking students who are women
  • State11
  • Percent completed within 4 years at the original institution
  • INST_ENSIZE2: Institution Enrollment Size (Newly derived featured fro UGDS:[Enrollment of undergraduate certificate/degree-seeking students])
  • Highdeg5
  • State43
  • The median debt for students who have completed
  • State27


The above features are listed in the order of their importance which was obtained using Lightgbm regressor on the complete dataset.

And the following factors were considered while evaluating the financial aspect of attending a college:

  • Cost.
  • Amount of debt to be taken.
  • Future earnings.


% of students who complete their course or withdraw from it vs the borrowed loan amount.



Target Variable

The target variable is obviously repayment rate = fraction of students properly repaying their loans, while lies in the range of 0–1. COMPL_RPY_3YR_RT is taken as the target variable which is defined in the data set as “Three-year repayment rate for completers”.


Student Segmentation

A total of 12 segments were made out of the data. The segments are clearly demarcated from other segments. These segments show that there’s a diverse set of students taking loans each year and hence have varied repayment capabilities based on what college they went to, which course they took up, their demographic data, financial profile, and age profile.
K-means was used to generate the clusters and the kneed library was used to find the optimal “k” value for the segmentation process. The explained variance of the clusters was 95% (after PCA)


12 clusters obtained: represented in 2 dimensions using PCA



Next comes the modeling: Time for some insights

The student segmentation was followed by modeling the repayment capacity of the students to save them from lifelong debt.

The idea was to model the distinctive student segments using Gradient boosting regression. Linear regression came in handy to study the relationship between the median earnings and the average cost of the institution. The next thing was to work with different models like LightGBM, XGboost, and an ensemble of various other models to predict the repayment capability of a loan borrower. Finally, a stacked ensemble of XGBoost and Lightgbm performed the best when compared to all the other examples. The dataset was split into mutually exclusive training, validation, and test sets.RMSE has been used as an evaluation metric.


Relationship between the median earnings and the average cost of the institution.



The model predicts a number between 0–1. Let’s say the predicted value is 0.85, it implies that 85% it’s likely that that student will be able to repay the borrowed loan amount.


Quantitative Results

RMSE was used to evaluate the model. The RMSE value achieved on a sample of 30k students was 0.865 which proved that the model was able to predict the repayment capabilities pretty accurately given the data state and amount of pre-processing done to train a model.


Final thoughts
  • The whole analysis throws light on the current state of repayment of loans and provides an AI-based approach to evaluate the risk incurred while borrowing a certain amount of money.
  • As expected, borrowers with higher annual income and higher FICO scores are more likely to repay them completely.
  • Also, borrowers with lower interest rates and smaller installments are more likely to pay the loan fully.


Key Insights

  • Key takeaways from existing studies and reports:
    a. Typically, repayment of government student loans begins six months after college graduation and can last up to 25 years.
    b. Most borrowers pay their loans in ten years.
    c. Students fail to make a payment on their student loan in 270 days, they are considered to be in default.
    d. The average national two-year cohort default rate in the FSLP was 7 percent in 2008.


  • Understanding the data:
    a. The clustering of data points based on institution level and debt repayment rate data was crucial in understanding the data distribution and how some categories stood out from the rest.
    b. EDA in terms of analyzing repayment rates depicted patterns relating to how University and admission rates affected the median debt of students completing education.
    c. Most of the schools associated with high debt are visual arts and design institutions.



No of years loan borrowed along with the type of loan provider vs % borrowing amount for various sectors of loan providers



  • Sources of bias:
    a. A major bias introduced in this study by the collectors of the data is that the data is limited to students that are Title IV recipients. A Title IV award includes government loans and grants (4), and these are need-based awards. Thus, it is important to keep in mind that the conclusions we attempt to make using this data set are not general to all students, but rather only those that qualified for, or perhaps more importantly had access to financial aid.
    b. The second bias in this study is that it only pertains to schools that received financial aid. Over 80% of schools receive financial aid, so this bias is not as severe as the one mentioned earlier. However, this data set is limited to students that received financial aid at a subset of available schools.


  • Important observations from the study:
    a. The median and mean debt of those completing their education are over double than those that withdraw. However, more interestingly, we find that the populations for both groups appear to have a multimodal distribution.
    b. Mean debt for those that complete their schooling is similar for those that attend private for-profit and public schools, but significantly higher for private nonprofit schools and the same is true for the debt taken on for those that withdraw before completing their education.
    c. There is a linear relationship between the debt taken on by the students that complete their education for institutions that have an average cost < ~25,000, however, there is not a linear relationship for the higher-priced institutions.


  • Take on the hypothesis:
    Based on the analysis the hypothesis that a student’s future earnings are maximized by attending the best school they can get into, despite the cost was also proved to be true through numbers and graphs.



More about Omdena

Omdena is the collaborative platform to build innovative, ethical, and efficient AI and Data Science solutions to real-world problems. 

Finding Answers to the Student Debt Crisis Through Data Science

Finding Answers to the Student Debt Crisis Through Data Science

By Galina Naydenova

When thinking about student life, we tend to hear the positive sides of the argument — the build-up of knowledge, the thrill of research, the excitement of campus life, and the improved employment opportunities. There is, however, a huge cost in the shape of the student debt crisis, which remains for years after graduation, causes stress, delays important life milestones, and does not necessarily come with improved opportunities and quality of life.

As one of the task leaders, I am going to walk you through the steps we took in solving this challenge. It is less about the data and results, more about what you do with them.


Step 1. Setting the scope

Is there anything about student debt that we do not know about?

Omdena’s method of branching out into different task groups allows the freedom to cover many sides of a problem. The content of the task brings a team together — an idea is proposed and then voted into a task, with team members joining in, setting the agenda, and collectively taking decisions. As somebody with years of experience in the Higher Education (HE) sector in the UK, I felt that I am familiar with the subject matter and proposed the scope of our task to be the impact of Student Debt Crisis on different demographic groups (namely: female students, ethnic minority students, and first-generation students) and finding an AI solution for this student debt crisis. I chose this for the following reasons:

  • The gap in participation and attainment between ethnic minority groups and the rest is a well-documented problem, and recent events have made it even more relevant.
  • In the UK and elsewhere, there are questions about the necessity of taking up debt and the value of having a degree.
  • There is evidence that student debt disproportionately affects certain demographic groups, disadvantaging the very groups that higher education was supposed to benefit the most.
  • All of the above, if unaddressed, may lead to a drop in participation of disadvantaged groups in HE, reversing years of progress.

So we set our task goal to explore how student debt affects the different demographic groups, why, and what can be done about it through the multitude of data available.


Step 2. Setting the methodology

Keep it simple.

Being a community of ML and AI practitioners, the suggestions that came were regarding modeling, clustering sentiment analysis, web scraping, but there were also voices for the more traditional data analysis for this Student Debt Crisis Challenge. We ended up choosing the latter because:

  • Such an analysis would allow us to compare different groups, which was the aim of our task.
  • When applying machine learning approaches, it is easy to miss out on the detail. The volume of data hides the fact that there are different forces pulling in different directions.
  • Higher Education is a regulated sector and there is a large volume of data from multiple sources.
  • Our data is not coming from a single source, and not all of it is on an individual level, which makes modeling difficult.


Step 3. Getting to know the problem

Mapping the Student Debt Crisis journey.

With so many pieces of data, there is a danger of ending up with a collection of separate pieces of data, each one interesting in itself, but disjointed.

Having a framework beforehand would allow us to focus when researching sources, and also overcome one of the problems often found when applying machine learning –the ‘now what’ moment, when we see the data but cannot say what it means.

So, we created a map of the student loan journey, along with the questions whose answers we are aiming to find in data.

  • Loan decision: Why do students have to borrow?
  • Loan dimensions — How much are they borrowing and what are they getting for their loan?
  • Loan and studying — What happens at the different study milestones?
  • Loan and employment — What employment and earnings can be expected?
  • Loan effects — Effect on mental health and HE value perception

To answer all these questions, we sought data for different demographic groups, as per our task goal.


Step 4. Getting the data.

Aim for variety.

Luckily, as is the case with other regulated sectors, there was no lack of data. The mixture of statutory, financial, survey, trend, and official stats data, apart from allowing us to cross-validate findings, allowed us to build a rich, multifaceted picture. At one point we had nearly 50 pieces of data in our shortlist to investigate — a challenging task even in itself. Some sources, like the College Scorecard dataset, were massive, and we had to apply a fair amount of manipulation and standardization. Some of the sources came with long definition lists, which took a while to unpick. The agile nature of the project allowed us to sneak in some data on the impact of COVID-19 at the last minute, making it even more relevant.


Datasets used for analysis


Step 5. The recommendations

The last mile

I am going through these steps together. Often analysis relies on data speaking for itself and falls short of interpreting and giving recommendations. However, having a framework from the beginning, breaking down the big questions into small ones, and linking the questions to data, we can see areas that can be acted upon.

For example, there is no silver bullet for tackling drop-out, but small steps can be taken, to challenge institutions about the completion gap between the different demographic groups, to facilitate non-punishing transfers, to set work/study standards, and to use AI and predictive analytics in order to improve degree completion during this Student Debt Crisis.




What can be done?


1. Why do students have to borrow?

Some of the answers:

  • Low income and first-generation students are more likely to come from ethnic minorities
  • Independent students are more likely to be black than dependent students
  • Private for-profit institutions have a higher share of ethnic minorities, female and first-generation students
  • Undergraduates from ethnic minorities tend to have lower financial literacy

In a nutshell: Ethnic minority students borrow more because they cannot rely on their families, they opt for more expensive institutions, and, because of lower financial literacy, may end up borrowing at disadvantageous terms.

What can be done about it?

Increase general financial literacy, encourage planning in advance, establish peer advice network.


2. How much are they borrowing and what are they getting their loan for?

Some of the answers:

  • Minority ethnicity students more likely to be in debt, and have higher levels of debt.
  • Minority ethnicity students are more likely to attend two-year institutions (Hispanic students tend to favor associate degrees) and multiple institutions.
  • Private for-profit institutions have a higher share of ethnic minorities, female and first-generation students.
  • Female, black and Hispanic students miss out on STEM subjects.
  • African American households have the highest percentage of unpaid student loans.

In short: Having higher levels of debt does not necessarily translate into studying attractive subjects or landing at well-performing institutions.

What can be done about it?

Monitor and publicize institution data, advise on beneficial loan decisions, tackle the differences in STEM subject take-up.


3. What happens at the different study milestones?

Some of the answers:

  • Private for-profit institutions, which tend to attract female, ethnic minority, and first-generation students, have low completion rates.
  • Black students are relatively more likely to drop out or to transfer.
  • One-third of black students are working over 20 hours a week while studying.

Summing up: Ethnic minority students are more likely to fall at the very first hurdle — that of degree completion, and having to work during study further jeopardizes their chances of completion.

What can be done about it?

Challenge institutions over retention rates and/or gap, deploy predictive analytics solutions to reduce dropout, advise cap on work hours, ensure financially fair transfers


4. What employment and earnings can be expected?

Some of the answers:

  • Men are over-represented in STEM and business careers.
  • Men’s earning for degree and advanced degree holders rise faster than women.
  • For women with advanced degrees, the wage gap is wider.
  • The share of associate degree holders is larger in the industries highly exposed due to COVID-19.

In short: The gender wage gap persists and is even more pronounced for advanced degree holders, which risks making entire industries short of women leaders. COVID-19 disproportionately affects holders of associate degrees, tend to be favored by Hispanic students.

What can be done about it?

Support women in STEM and be clear on employment reality for advanced degree holders. Use the corona-virus crisis to retrain and up-skill vulnerable groups, including associate degree holders.


5. What is the effect on mental health and the value perception of Higher Education?

Some of the answers:

  • The majority of students from for-profit institutions and black students display high or very high levels of stress from education-related debt.
  • Women report higher levels of ill mental health than men in all ethnicities.
  • Degree holders from for-profit institutions are skeptical about their degree worth.

In summary: Students from for-profit institutions (mostly female, ethnic minority, and first-generation students) display higher degrees of stress and are clearly disillusioned by their higher education experience.

What can be done about it?

Make institution performance and loan conditions transparent, invest in debt counseling, emergency support, and relief, promote an unbiased view of the value of higher education and alternatives to a degree.



Even with a problem that has been addressed extensively, collective wisdom from an autonomous and diverse group can bring new insights. Contrary to the popular belief, lack of data was not the problem — it was scoping and breaking down the problem in the context of too much data, which is where collaboration and diverse thinking really helped. Yes, a self-organized international community presented challenges in agreeing times for meeting, brainstorming and feedback, but the colorful side of collaboration — the assorted and inconsistent graphs, as many styles as team members, was a welcome difference from the monotonous slide templates from our day jobs. In the end, there was a positive feeling that we have done our small bit to help to understand this big problem better and that it was a time well spent getting to know and learning from each other.



More about Omdena

Omdena is the collaborative platform to build innovative, ethical, and efficient AI and Data Science solutions to real-world problems. 

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

AI For Financial Inclusion: Credit Scoring for Banking the Unbankable

Steps towards building an ethical credit scoring AI system for individuals without a previous bank account.



The background

With traditional credit scoring system, it is essential to have a bank account and have regular transactions, but there are a few groups of people especially in developing nations that still do not have a bank account for a variety of reasons; they do not see the need for it, some are unable to produce the necessary documents, for some the cost of opening the accounts is high, some may not have the knowledge about opening accounts, lack of awareness, trust issues and some unemployed.

Some of these individuals may need loans for essentials; maybe to start a business or like farmers who need a loan to buy fertilizers or seeds. While many of them may be reliable creditors but because they do not get access to funding, they are being pushed to take out high-cost loans from non-traditional, often predatory lenders.

Low-income individuals have an aptitude for managing their personal finances. And we need a system for ethical credit scoring AI in order to help these borrowers and clutch them from falling into deeper debts.

Omdena partnered with Creedix to build an ethical AI-based credit scoring system so that people get access to fair and transparent credit.


The problem statement


The goal was to determine the creditworthiness of an un-banked customer with alternate and traditional credit scoring data and methods. The data was focused on Indonesia but the following approach is applicable in other countries.

It was a challenging project and I believe everyone should be eligible for a loan for essential business ventures but they should be able to pay it back while not having to pay exorbitant interest rates. Finding that balance was crucial for our project.


The data

Three datasets were given to us,

1) Transactions

Information on transactions made by different account numbers, the region, mode of transaction, etc.


2) Per capita income per area

All the data is privacy law compliant.


3) Job title of the account numbers

All data given to us was anonymous as privacy was imperative and not an afterthought.

Going through the data we understood we had to use unsupervised learning since the data was not labeled.

Some of us were comparing online available data sets to the data set we had at hand, and some of us started working on sequence analysis and clustering to find anomalous patterns of behavior. Early on, we measured results with silhouette score — a heuristic tool to figure out if the parameters we had would provide significant clusters. The best value is 1 with well separable clusters, and the worst is -1 with strongly overlapping ones. We got average values close to 0s, and these results were not satisfactory to us.


Feature engineering

With the given data we performed feature engineering. We calculated per ca-pita income score and segregated management roles from other roles. We also calculated the per capita income score so that we can place buckets into accounts in areas that are likely to be reliable customers. For example. management roles mean they would have a better income to pay back.




But even with all the feature engineering, we were unable to get a signal from the data given for clustering. How did we proceed?

We scraped data online from different sites like indeed and numbeo. Since we had these challenges we were not able to give one solution to the customer and had to improvise to provide a plan for future analysis, so we used dummy data.

We scraped data from sites like numbeo to get the cost of living per area, how much they spend on living. From indeed we got salary data to assign an average salary to the jobs.




With the data, scraped online and feature engineering from the given dataset, we tried to figure out if we can get a prediction from using clustering algorithms.


The solutions

  • Engineered Features & Clusters (from Datasets given)
  • Machine Learning Pipelines/Toolkit (for Datasets not provided)
  • Unsupervised Learning Pipeline
  • Supervised Learning Pipeline (TPOT/auto-sklearn)


1. Engineered features & clusters



As mentioned above, with the context that we have gathered from Creedix, we have engineered or aggregated many features based on the transaction time series dataset. Although these features describe each customer better, we can only guess the importance of each feature with regards to each customer’s credit score based on our research. So, we have consolidated features for each customer based on our research on credit scoring AI. As for the importance of each feature with regards to credit scoring AI in Indonesia, this will be up to the Creedix team to decide.

For example,

CreditScore = 7*Salary + 0.5*Zakut + 4000*Feature1 + …+ 5000*Feature6

Solutions given to Creedix were both Supervised Learning and Unsupervised Learning. Even after all the feature engineering and data found online we were still getting a low silhouette score signifying that there would be overlapping clusters.

So we decided that we will provide solutions for Supervised Learning using Auto ML and Unsupervised learning, both using dummy variables, the purpose -was to serve future analysis or future modeling for the Creedix Team.

The dataset we used for Supervised Learning —

With Supervised Learning, we did modeling with both TPOT and Auto SKLearn. This was done so that when we have more features available that are accessible to them but may not be for Omdena collaborators they can use the information to build their models. When they have target variables to use.


2. The model pipeline for Supervised Learning

Our idea is to create a script that can take any datasets and automatically search for the best algorithm by iterating through all classifiers/regressors, hyperparameters based on user-defined metrics.

Our initial approach was to code from scratch iterating individual algorithms from packages (e.g. sklearn, XGBoost and LightGBM) but then we came across Auto ML packages that already do what we wanted to build. Thus, we decided to use those readily available packages instead and not spend time reinventing the wheel.

We used two different auto ML packages TPOT and Auto-sklearn. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines and finding the best one for your data.



Auto-sklearn frees an ML user from algorithm selection and hyperparameter tuning. It leverages recent advances:

  • Bayesian optimization
  • Meta-learning
  • Ensemble construction

Both TPOT and auto-sklearn are similar, but TPOT stands out between the two due to its reproducibility. TPOT is able to generate both the model and also its python script to reproduce the model.


3. Unsupervised Learning

In the beginning, we used agglomerative clustering (a form of hierarchical clustering) since the preprocessed dataset contains a mix of continuous and categorical variables. As we have generated many features from the dataset (some of them very similar ones, based on small variations in their definition), first we had to eliminate most of the correlated ones. Without this, the algorithm would struggle to find the optimal number of groupings. After this task, we remained with the following groups of features:

  • counts the number of transactions per month (cpma),
  • average increase/decrease in value of specific transactions (delta),
  • average monthly specific transaction amount (monthly amount),

and three single specific features:

  • Is Management — assumed managerial role,
  • Potential overspend — value estimating assumed monthly salary versus expenses form the dataset,
  • Spend compare — how customer’s spending (including cash withdrawals) differs from average spending within similar job titles.


In a range of potential clusters from 2 to 14, the average silhouette score was best with 8 clusters — 0.1027. The customer data was sliced into 2 large groups and 6 much smaller ones, which was what we were looking for (smaller groups could be considered anomalous):



This was not a satisfactory result, anyway. On practical grounds, describing clusters 3 to 8 proved challenging, which is correspondent with a relatively low clustering score.



It has to be remembered that the prime reason for clustering was to find reasonably small and describable anomalous groupings of customers.

We, therefore, decided to apply an algorithm that is efficient with handling outliers within a dataset — DBSCAN. Since the silhouette clustering score is well suited for convex clusters and DBSCAN is known to return complex non-convex clusters, we forgo calculating any clustering scores and focus on the analysis of the clusters returned by the algorithm.

Manipulating the parameters of DBSCAN, we found the clustering effects were stable — the clusters contained similar counts, and customers did not traverse between non-anomalous and anomalous clusters.

Also analyzing and trying to describe various clusters we find it easier to describe qualities of each cluster, for example:

  • one small group contrary to most groups had no purchase, no payment transactions, and no cash withdrawals, but very few relatively high transfers by mobile channel,
  • another small group also had no purchase and no payment transactions, however, made cash withdrawals,
  • yet another small group had the highest zakat payments (for religious causes) and high amount of mobile transactions per month,
  • The group considered as anomalous (cluster coded with -1) with over 300 customers differentiated itself with falling values across most types of transactions (transfers, payments, purchases, cash withdrawals) but sharply rising university fees.



Important to note is that for various sets of features within the data provided here, clustering score for both hierarchical as well as DBSCAN methods returned even better clustering efficiency scores. However, at this level of anonymity (i.e. without the ground truth information), one cannot decide the best split of customers. It might transpire there is a relatively different optimal set of features that best splits customers and provides better entropy scores of these groups calculated on the creditworthiness category.

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

Analyzing Domestic Violence in Lockdowns: From No Data to Building an ML Classifier for Tweets

By Omdena Collaborator Harshita Chopra

Data mining, topic modeling, document annotations, NLP, and stacking machine learning models: A complete journey.

Artificial Intelligence and its possibilities have always fascinated me. Making machines learn through data is nothing short of amazing. When I got to learn about Omdena and it’s a wonderful initiative for bringing AI to social good using the power of global collaboration, I couldn’t stop myself from participating in its empowering challenges.

I felt truly content to be given the role of a Machine Learning Engineer in my first challenge. Connecting with a team of 50 fabulous collaborators from various countries around the world, including domain experts, data scientists, and AI experts — felt like a golden opportunity to gain knowledge in the best possible way.

It provided a space to create value out of my ideas and learn from the enhancements. The harmonizing environment gave me the experience of leading task groups and interacting with some really innovative minds.

In this blog post, I’ll walk you through a major part of the project I led and contributed to for the past month.


The Problem

There has been a surge in Domestic Violence (DV) and online harassment cases during COVID-19 lockdowns in India. Homes are no more a safe place for victims trapped in abusive relationships with their family members.

Domestic violence involves a pattern of psychological, physical, sexual, financial, and emotional abuse. Acts of assault, threats, humiliation, and intimidation are also considered acts of violence.

Data substantiating Domestic Violence from government resources are only available in summary form. Incidents are largely reported via calls, and hence make data and subsequent mapping difficult.

The goal of the challenge was to collect and analyze data from different social media platforms or news sources so as to gain insights on the rise in DV incidents during the nation-wide lockdown.


The Solution

Diverse social media platforms come up as a huge and largely untapped resource for social data and evidence. It generates a vast amount of data on a daily basis on a variety of topics. Consequently, it represents a key source of information for anyone seeking to study various issues, even the socially stigmatized and society tabooed topics like Domestic Violence.

Victims experiencing abuse are in need of earlier access to specialized services such as health care, crisis support, legal guidance, and so on. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims.

Red Dot Foundation plans to deal with this challenge. When the victims seek help, it is important to identify and analyze those critical posts and acknowledge the help needed with more immediate impact.

Tasks were divided to mine data from different sources: Twitter, Reddit, YouTube, News articles, Government reports, and Google trends. After the acquisition of huge amounts of data, the next step was filtering out relevant posts through topic modeling and keywords. This was followed by annotation of data and then building an NLP based machine learning classifier.

In this blog post, Tweets would be in the spotlight!



Scraping data with the right queries

Tweets need to be extracted in the pre-lockdown and during the lockdown period so as to judge the surge in domestic violence. Hence, we took a time-frame of January’20 to May’20.

Tweepy (the official tweets scraping API of twitter) extracts tweets only from the past seven days, making it a bothering limitation. Hence, we needed an alternative for mining old tweets with the location.

GetOldTweets3 is a fantastic Python library for this task. Twitter’s advanced search can do wonders for generating your customized query. In order to extract harassment-related posts, here are a few examples of queries we used:




Using ANDcombinations of relationships words with actions and nouns yield good results. The until and since attributes hold the limits of the time frame.

The setNear() feature accepts a location name (like Delhi, Maharashtra, India, etc) or latitude and longitude of that region. The central point of India is approximately around (22,80) degrees. The setWithin() feature accepts the radius around that point, and 1800 km generally covers India and nearby places.

After executing more such queries with different keywords, we had thousands of tweets in handsome relevant topics and some irrelevant.


Data needs to be classified  –  Would topic modeling work?

Since a considerable number of tweets in our huge datasets were not related to the kind of harassment we were looking for, we needed some filtering. Classifying tweets into broad topics was the goal. Topic modeling was the first thing that clicked.

Topic modeling is an unsupervised learning process to automatically identify topics present in a collection of documents and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Latent Dirichlet Allocation is the most popular technique to do so. It assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.



Topic 0 words are generally included in awareness posts or #BanTiktok posts due to its inappropriate content.
> Topic 1 words are headlines or real victim stories.


Topic modeling works best when the topics are considerably distinct or not too related to each other.

The generated topics didn’t satisfy our target of classifying as relevant or irrelevant. Hence we had to pick up another approach, since our dataset, in general, talks about kinds of harassment.

Rule-based classification turned up to be a more precise approach in this task. I created three sets of keywords to look for — relationships, violence verbs, and not-relevant words. The following algorithm as implemented in Python to filter out some documents.


Relationships: [List of keywords like wife, husband, housemaid etc.]
Verbs        : [List of keywords like harass, beat, abuse etc.]
Not-relevant : [List of keywords like webinar, politics, movie etc.]
Iterating through document:
  R1: (any word in Relationships) AND (any word in Verbs) -> 'Keep'
  R2: (any word in Not-relevant) -> 'Discard'


The dataset was filtered pretty much based on our customer needs. Now comes up with the task of modeling. But we need annotations for training a supervised model.


Deciding Labels and Annotating Tweets

Eminent domain experts helped in coming up with the categories to classify tweets based on their context. Proper labeling guidelines were set up and training sessions helped to label tweets properly, keeping in mind the edge cases. For document classification, a quick and awesome tool called Doccano was used. Several collaborators helped by taking up queues of data points and annotating them. Following were the labels used:

    (advocating against domestic abuse)
    (denying the existence of domestic abuse)
    (stating factual information or news)
    (describing an incident of domestic abuse)
    (other kinds of harassment)
    (harassment directed at individual or community)


Domestic Violence Twitter

Analytics derived from NER.



And data for modeling is ready!

After all the annotations and some fabulous work with collaborators, we’re ready with an incredible training dataset.

Tinkering with Natural Language Processing…

Once pre-processing of texts by lowering cases, removing URLs, punctuation, stopwords, followed by lemmatization was done — we were ready to play around with modeling techniques.

To convert words to vectors (machine learns through numbers), experimenting with TF-IDF Vectorizer gave good results but we had a very limited vocabulary, while the inference data would have a greater variety of words. Therefore, a decision of using pre-trained word embeddings was made.

Our model used FastText English word vectors(wiki-news-300d-1M.vec) and IndicNLP word vectors (indicnlp.v1.hi.vec) for Hindi and Hinglish languages present in the documents.

Since tweets related to DV stories were quite less in number, data augmentation was used on these — by creating new sentences using synonyms of the original words.

nlpgaugis a library for textual augmentation in machine learning experiments. The goal is to improve model performance by generating augmented textual data. It’s also able to generate adversarial examples to prevent adversarial attacks.



Bringing into play  – Machine Learning Model(s)

A number of models including BERT, SVM, XGBoostClassifier, and many more were evaluated. Since there were really minute differences between some similar classes, we needed to combine two sets of labels.

After combining similar labels:



Limitations faced — Data under various classes was not easily separable because 3 classes plainly talked about Domestic Violence (story, opinion, news/info) which made it tough for the classifier to spot marginal variation in semantics.

Also, data under DV_STORY had the least number of samples given the fact that it was the most relevant class.

Hence, to deal with an imbalanced dataset, Under Sampling using NeighbourhoodCleaningRule was used from the imbalanced-learn library. The resampled data was fed to Stacked Models.

Stacking is a way of combining predictions from multiple different types of ML models, that introduces the concept of a meta learner.



Source: GeeksforGeeks


Level 0 learners:
– Random Forest Classifier
– Support Vector Classifier
– MLP Classifier
– LGBM Classifier

Level 1 meta-learner:
SVC with hyperparameter tuning and custom class weights.




This pretty much sums up the modeling. This model was used to predict labels on 8000 rows long dataset containing tweets. The misclassifications were skimmed through and corrected in some crucial classes in order to deliver the best data.

I feel glad to be a part of this incredible community of change-makers. Making some great connections through this amazing journey is like an icing on the cake.

So excited to collaborate in the upcoming mind-blowing projects, making the world a better place using AI for good!




More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.


Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here