Applying Machine Learning to Predict Illegal Dumpsites

Applying Machine Learning to Predict Illegal Dumpsites

By Ramansh Sharma, Rosana de Oliveira Gomes, Simone Vaccari, Emma Roscow, and Prejith Premkumar


Just like any other day, we start our morning with a coffee and a snack to go from our favorite bakery. Later on the same day, we check out our mail where we find letters, newspapers, magazines, and possibly a package that just arrived. Finally at night, after a rough week, we decide to go out to have drinks with friends. Sounds like a pretty uneventful day, right?

Except that we produced lots of trash in the form of plastic, glass, paper, ad more.

According to eurostat, it is estimated that an average person in Europe produces more than 1.3 kg of waste per day (in Canada and the USA, it can go up to more than 2 kg). This is equivalent to a person producing 800 kg of trash per year. Now imagine millions of… billions of people doing the same. Every day!

To give you an even clearer perspective: less than 40% of all the waste produced in Europe is recycled — and it is even less across the other continents. Even further, it is estimated that 20% of all generated waste ends up on illegal dumping (s) in Europe, and 50% in Africa.

TrashOut is an environmental project which aims to map and monitor all illegal dumping (s) around the world and to reduce waste generation by helping citizens to recycle more. This is done through a mobile and web application that helps users with locating and monitoring illegal dumping (s), finding the nearest recycling center or bin, joining local green organizations, reading sustainability-related news, and notifying users about updates on their reports.

In this article, we discuss our analysis of illegal dumping (s) across the world, both in local and global scales.


The problem


Photo by Ocean Cleanup Group on Unsplash


The problem statement for this project was to “build machine learning models on illegal dumping (s) to see if there are any patterns that can help to understand what causes illegal dumping (s), predict potential dumpsites, and eventually how to avoid them”. We decided to tackle this wordy problem statement by dividing it into three manageable sub-tasks to be worked on throughout the duration of the project:

  • Sub-task 1.1: Spatial patterns of existing TrashOut dumpsites
  • Sub-task 1.2: Predict potential dumpsites using Machine Learning
  • Sub-task 1.3: Understanding patterns of existing dumpsites to prevent future potential illegal dumping (s)



  • TrashOut: Reports on illegal dumping (s) provided by users through the TrashOut mobile App. For each report, a number of features are recorded, and the most relevant for this analysis were: location (latitude and longitude, city, country, and continent), date, picture, size, and type of waste.
  • Open Street Maps (OSM): Geospatial dataset and information on the cities road network, including the type of roads (e.g. motorway, primary, residential, etc)
  • Socioeconomic Data and Applications Center (SEDAC): Population density at 1km grid, from which we also calculated the population density gradient to account for population density in the neighboring cells
  • FourSquare: Information about nearby venues
  • World Bank Indicators, World Bank’s “What a Waste 2.0”, Eurostat, European Commission Directorate-General for Environment: Datasets for socio-economic indicators.
  • Non-dumpsites Control Dataset: we generated our own Control Dataset, which was required to train the model on where dumpsites do not occur. For every TrashOut dumpsite location, we selected a pseudo-random location 1 km away and assigned this as a potential non-dumpsite location.



The first challenge was to identify and extract meaningful information for the spatial analysis from the available datasets. Our assumption was that illegal dumping (s) are more likely to occur in highly populated places, in proximity to main roads and in proximity to venues of interest such as sports venues, museums, restaurants, etc. Based on this assumption, we used the available dataset to extract, for every TrashOut dumpsites as well as for every location of our Non-dumpsite/Control Non-dumpsites, the 17 features described in Table 1:


Table 1: Datasets and API’s used to acquire different features for dumpsites * For the control dataset, the source for Continent was pycountry-convert library.


Sub-task 1.1: Finding existing dumpsites

City-based Analysis of Illegal Dumpsites/ Dumping (s)

We performed an in-depth analysis focused on six shortlisted cities, with the goal to represent different social statuses and geographical locations so all continents were included, and based on the availability of a considerable number of TrashOut dumpsite reports. The cities analyzed were:

  • Bratislava, Slovakia (Europe)
  • Campbell River, British Columbia (Canada)
  • London, UK (Europe)
  • Mamuju, Indonesia (Asia)
  • Maputo, Mozambique (Africa)
  • Torreon, Mexico (Central America)

For the city-based analysis, we accessed the road network information from the OSM dataset by using the Python package OSMnx. This API allows easy interaction with the OSM data without needing to download it, which makes it very accessible in any location around the world. We structured the analysis in a Colab Notebook for consistency and analyzed the following features for each city: distance to three types of roads (motorway, main and residential), distance to the city center, population density, size, and type of waste.


Results for Bratislava

The proportion of TrashOut dumpsites vs. Control Non-dumpsites and their proximity to nearest roads within 1 km is shown in Figure 1, however, the statistical assessment was undertaken within 100 m using the two-proportion Z-test. The three graphs are generated for each road type (motorways, main roads, and residential roads) with the purpose to identify whether dumpsites are more likely to appear in proximity of a specific road type. In Bratislava, around one-fifth of dumpsites were found in proximity to the main road (within 100 m), and these were found more likely to be reported next to the main road (within 100 m) compared to locations of Control Non-dumpsites. However, most dumpsites are not reported on roadsides, and in fact, being further away from a road was found to be a slightly better predictor of where a dumpsite might occur.


Figure 1: Proximity to a nearest major road for dumpsites and control datasets


The location of TrashOut dumpsites across Bratislava, colored by reported size, is shown in Figure 2. The majority (around three-quarters) of dumpsites are estimated by TrashOut users to be too big to be taken away in a bag. Dumpsites of all sizes are found throughout the city, but the largest dumpsites tend to be further away from motorways.


Figure 2: Size of dumpsites in the city of Bratislava


Several types of waste were reported alongside other types of waste within the TrashOut dumpsites. The number of dumpsites containing each type of waste is shown in the bar chart in Figure 3.1, whereas in Figure 3.2 is shown the percentage of dumpsites containing several types of waste in a matrix. The majority of reported dumpsites in Bratislava contain what TrashOut users describe as domestic waste. Domestic waste often coincides with plastic waste, which itself is found in around half of the dumpsites. Around one-third of dumpsites are reported to contain construction waste.



Figure 3: Waste types in the TrashOut datasets for Bratislava



Visualizing the distribution of dumpsite reports throughout the city with the spatial analysis undertaken can be informative in preparation to clean up existing dumpsites, as well as for identifying potential new hotspots. The following observations were drawn from this city-level geospatial analysis.

Information about the type and size of dumpsites may be important for local authorities and decision-makers to consider how best to clean up dumpsites. Having a spatial visualization of the locations and characteristics of each dumpsite across each city area, not only helps to inform management efforts to clean up existing dumpsites, but also to try minimizing potential new dumpsites by introducing bins for specific types of waste, or holding events to increase recycling awareness.

Plastic waste is found alongside other types in many dumpsites, which is not surprising. Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting and recycling.

The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This might suggest that residents find construction waste difficult or costly to dispose of legally, or that construction companies are neglecting their responsibility to clean up.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of whether there are more dumpsites in those areas.

The use of these tools and analysis will always need to be supported by local knowledge, as well as with the involvement of local municipalities and authorities.


1.2: Predicting potential dumpsites 

Features to train the Machine Learning model

The second subtask focused on creating a Machine Learning model that could predict whether a location is at risk of becoming a dumpsite. Since we have already seen the variables that were considered to be of a strong influence on dumpsites in Table 1, these variables could be used to predict whether a new location could turn into a dumpsite.

When acquiring the venue categories, we set a radius parameter in the Foursquare library until which distance it is supposed to fetch venue categories information. Although we created datasets with radii 500m, 1km, 2km, and 3km, we came upon the conclusion that the 1km radius dataset was the most appropriate one with the best model performance. It was not too near to the location from which the data was being collected therefore not losing any vital information, and at the same time not too far so that irrelevant information needed to be fetched.

The features: Number of Venue Categories, Nearest Venue Categories, and Frequent Venue Categories were only acquired up until 1km from a given location. Moreover, the five nearest venue categories and five most frequent venue categories were acquired for each given location as separate variables. If the Foursquare API failed to acquire not all 5 (or even none in some cases) categories within a 1km radius, then a None string would be placed instead in the empty variables.

A similar approach was taken for the OSM library for the distance to roads features. The value was only collected for roads up till a 1km radius from a location, with the exception of few cases where the API returned a distance slightly beyond 1km.

For the population density feature, our team discussed different approach ideas, and eventually, we decided that, instead of having a singular value for the population density of the given location, the probability of a dumpsite occurring in (or in very close vicinity of) that location is also affected by the surrounding population. Therefore if the location is in the center of a 1×1 square km cell, then the population densities of the eight 1×1 square km cells around the center cell are also considered. This would be a rather good way to see if the dumpsite is in the middle of a highly-populated area, in the outskirts of the city, or in a nowhere land. Using these nine different population densities (with a more weightage on the center cell’s density), a population gradient is calculated for the location which is given to the model as a separate feature in addition to the population density.

These were the 17 features that would be used for the Machine Learning model part of this subtask. But how do we teach the model what a dumpsite is?


Control dataset

We wanted to train a Machine Learning model such that it learns to understand what constitutes a dumpsite in the 17 variables we gathered. We fetched and calculated the features for every one of the approximately 56,000 dumpsites in the TrashOut dataset that we had. However, it is not possible to train a model just by showing it what a dumpsite is. This is because when we show a location that is highly unlikely to become a dumpsite, we want our model to confidently tell us so. An analogous comparison would be to show 56,000 cats to a child and then expect he/she to recognize that a dog is not a cat.

The solution lies in creating a control dataset. In order to teach the Machine Learning model what a dumpsite is, we also need to teach it to understand what a dumpsite is not.

For the sake of simplicity, we can also call the control dataset non-dumpsites. So how do we go about finding non-dumpsites? Any location that is not a dumpsite is in essence a non-dumpsite. However, that will not help the model learn meaningful differences between the two classes: dumpsites and non-dumpsites. Instead, what we can do is find close geographical points to the dumpsites that we already have and use them as the control dataset. Once again, we experimented with multiple distances from the dumpsite to pick these points and found that a distance of 1 km works best. The advantages of choosing these points are:

  • The points are close enough to the dumpsites so that there are subtle changes in the features that the model will be able to learn and appropriately map to the two classes.
  • The points are not too close so that the model fails to realize key differences between the features of the two classes.
  • The points are not so far that there is no correlation between the two classes, therefore, preventing nullification of the purpose of the control dataset.
  • When we choose a point near a reported dumpsite, we assume that a location nearby a known dumpsite has active users of the TrashOut app in the area, so if there was no dumpsite report, we assumed there was no dumpsite in that location.


Figure 4: Illustration of determining the approach to creating a control dataset.


We also took measures to make sure that the non-dumpsite point generated for each dumpsite did not contain another dumpsite, was not in the vicinity of another dumpsite, and was not in a major water body.

The control dataset was made for every dumpsite so that the two classes were balanced for binary classification. Additionally, all the features that were used in the dumpsite dataset were used for the control dataset as well.



Our team investigated three different Machine Learning models throughout the project:

  1. Random forest classifier — This approach did not work because the model failed to understand the data in a thorough manner and yielded extremely low accuracy. ❌
  2. Neural Networks (Dense and Ensemble) — These series of models and its iterations did not work either because the model was tremendously overfitting. It would be unsuitable for real-world purposes. ❌
  3. Light Gradient Boosted Model (LGBM) — This model was our final model. It had good accuracy and the minimum generalization error among all three models. ✅



The final accuracy of our model was 80% on the test set. We employed the use of k-fold cross-validation to maximize the accuracy our model could achieve on the test set. We also observed how important the individual features were when it came to classifying a given location to be prone to a dumpsite or not. This analysis was done with the help of SHAPELY and PDP plots as shown in Figure 5 below.


Figure 5: Importance of every individual variable on the model performance.


Figure 6: Probability output change in the model induced by a change in distance to roads feature


The SHAPELY plot in Figure 5 shows the contribution of each feature towards the prediction. It depicts the importance that each feature has. A high value of importance signifies that the model considers it as a very important factor when determining if a given location is a dumpsite or not. It is indicated that the most important feature of the model for both classes is the distance to the road variable.

The Partial Dependence Plot in Figure 6 helps one understand the effect of a specific variable on the model output. As the value of a feature changes, its effect on the model will also change accordingly. We compute these plots for all the numerical values that we have, to understand its effect on the prediction of the model. The one shown above is the plot analyzing the effect of distance to roads variable. As it can be observed, as the distance from a given location to a major road increases (positive x-axis), the probability that that location is a dumpsite decreases. The soft spike from 1500 m to 2500 m is due to how we manually placed the value of 2500 m in an example when the API could not find a road up till that distance. Regardless, this situation can be manually handled in the deployment implementation.

One of the key achievements of the team was being able to generate a full city heat map of the city of Bratislava (most of our tests were based here) by running the model on more than 700 locations in the city. In the heat map, the actual dumpsites are plotted in blue markers while the roads are marked in black lines (major roads and highways are visibly thicker than minor roads). The spectrum goes from whitish-yellow to dark red with yellow regions resembling a low probability of becoming a dumpsite and the red regions resembling a high probability of becoming a dumpsite. The heat map provides many beneficial usages. For example, municipalities and local authorities can make smaller heat maps for regional neighborhoods to determine which areas are at a high risk of becoming a dumpsite.

Another important variational use of this heat map is to combine it with valuable insights about socio-economic factors, population density, distance to roads, etc. The reason being that even though the model has considerably good accuracy, the wisdom still lies in the intuition of local authorities and municipalities. These officials will be better equipped to analyze key neighborhoods and areas to find the places where there is a major road or highway nearby, has a high population density, and a certain set of venue categories in close proximity. Then, a heat map can be generated for that area and specific regions can be identified which require immediate attention to mitigate the possibility of becoming a dumpsite.


Figure 7: Heat map of the city of Bratislava using the ML model to predict dumpsites


Sub-task 1.3: Preventing future dumpsites

Global analysis of illegal dumpsites/ dumping (s)

In order to analyze illegal dumpsites/ dumping (s) on a global scale, we combined the data from TrashOut with two other datasets:

From this setup, it was possible to divide the countries analyzed into four clusters, using unsupervised learning:


Figure 8: Global analysis clustering summary. Source: Omdena


Small population developed countries (Blue cluster): countries with a small population and population growth, but high urban populations and access to electricity. These countries also have low urban population growth and GDP. The countries in this cluster present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste.

High population developed countries (Orange cluster): countries with the highest GDP, and also high access to electricity, urban population, tourism. Low inflation and urban population growth. also produces high amounts of glass and paper/cardboard and is responsible for the highest production of special waste and total municipal solid waste: These countries are also associated with the lowest production of organic waste.

High population developing countries (Green cluster): countries with the highest population and inflation; moderate access to electricity and GDP, and growing urban and total population. They generate high amounts of organic, rubber/leather and wood waste, and low amounts of glass. This can be associated with the high population being less concentrated in cities and also to the level of industrialization of such countries, which may be the ones that most produce factory materials such as leather and rubber.

Small population developing countries/low income (Red cluster): countries with a small population and GDP; lowest electricity access, but highest GDP and population (including urban) growth. Most of the waste produced in these countries is from food and organic sources. These are the countries that also produce the lowest amounts of glass, metal, and rubber/leather waste. Such scenarios can be associated with the low populations and also possible use of waste in a sustainable way, as these countries also present low production of green/yard and wood waste.


Figure 9: Some features in the global analysis in country clusters.


Combining this analysis to the illegal dumpsites/ dumping (s) data from TrashOut, our team obtained the following insights:

  • Plastic waste is highly produced across all clusters, indicating that this kind of trash needs to have a global awareness strategy.
  • Different types of illegal trash are associated with rural and urban areas, making it important on a global scale. Countries with a higher rural population (low income) will produce more illegal organic waste, whereas developed countries present more reports on illegal dumpings/ dumping (s) of plastic, cardboard, yard/green waste, rubber/leather, and special waste.
  • Socio-economic factors such as infrastructure, sanitation, inflation, and tourism play a moderate role in the different production of illegal waste worldwide.
  • We identify that population is the most important factor in the production of special waste, municipal solid waste, and the total amounts of waste per year. However, most of the special waste produced in developed countries (majority) is reported.



In this project, we have analyzed data on a local and global scale to understand which factors contribute to illegal dumping, as well as predict and finding possible ways to avoid it.

It is important to stress that we cannot say where dumpsites actually appear, only where they are reported to TrashOut. Dumpsites may be reported with higher frequency in some areas because there are more residents or passersby to report them, regardless of the number of dumpsites in those areas.

Nevertheless, visualizing the distribution of dumpsite reports with the spatial analysis undertaken can be informative for identifying potential new hotspots. The following observations can be extracted from our analysis:


On a city level

  • The prediction of the ML model and the heatmap can be used as tools for targeted waste management interventions, but will always need to be supported by local knowledge as well as with the involvement of local municipalities and authorities.
  • The main road and motorway junctions are locations where illegal disposal of waste is prone to occur. We can witness this in Bratislava and Torreón.
  • A lot of reports occur in natural resources areas, e.g. watercourses or natural parks. We can see this in Bratislava, Mamuju, and Campbell River. This may be due to two factors: ease of disposal without being caught; people walking by those areas may be more environmentally aware and wanting to preserve more the places where they go to enjoy nature. Consequently, they create reports more often.
  • Waste that can be separated occurs simultaneously in reports: domestic and plastic, glass or metal. This might suggest that infrastructure is lacking (i.e. waste collection facilities), or the population is not aware of waste sorting. This is especially clear in Maputo city.
  • The amount of construction waste in reports for every part of the world suggests that legislation for construction and demolition waste needs to be improved and compliance needs to be checked/assessed in many places. This applies to companies and individuals. Construction waste seems to be a problem for many cities.
  • TrashOut reports seem to be created in pulses, not on a regular basis. For all the cities examined, there are certain months when the number of reports created exceeds the average by far. Moreover, reports seem to be generally created in specific parts of a city (eg. London).


On a global scale

  • Plastic waste production is high across the globe, making it an international problem;
  • Among all the socio-economic factors, population plays a strong role in the production of waste in the world (legal and illegal);
  • The level of development and socio-economic factors (infrastructure, sanitation, education, among others) play an important role in the kind of waste produced by countries.
  • In particular:
  1. Small developed countries present a high production of glass, metal, and paper/cardboard waste, and the highest production of yard/green waste
  2. High population developed countries: high amounts of glass and paper/cardboard and the highest production of special waste and total municipal solid waste. These countries are also associated with the lowest production of organic waste
  3. High population developing countries: high amounts of organic, rubber/leather and wood waste, and low amounts of glass;
  4. Small developing countries/low income: most of the waste produced in these countries is from food and organic sources.


Possible Factors to Avoid Illegal Dumping

  1. Organization of clean-up events in areas where many dumpsites are already existing can be arranged to clean up the targeted areas in a fun, interactive, and educational way. Collaboration with local authorities should be put in place to improve existing waste infrastructure and build new ones, if necessary.
  2. Those areas identified as high risk for becoming new potential dumpsites could be targeted with waste infrastructure development/enhancement programs. Additionally, learning events could be organized to raise awareness about dumpsites risks, and how to minimize, or avoid altogether, dumpsites by using properly the waste facility infrastructure existing.
  3. Examples of learning events consist of learning sessions on how to use waste infrastructure and recycling bins according to local/national authorities, the benefits of a sustainable way of living through the 4R cycle (Refuse, Reduce, Reuse, Recycle), and how to avoid single-use items (or a specific type of waste as can be highlighted by a city-level analysis).

How to Become a Data Engineer: The AI Plumber?

How to Become a Data Engineer: The AI Plumber?

By Natu Lauchande
21st Century roadmap to becoming a Data Engineer

What is a data engineer?

In broad strokes, a data engineer is responsible for engineering systems and tools that allow companies to collect raw data from a variety of sources, volume, and velocity into a format consumable by the broader organization. The most common downstream consumers of data engineering products are the AI/Machine Learning and Analytics functions of a company.

The best way to start talking and discussing this new and loosely defined role is the Data Science hierarchy of needs brilliantly depicted by Monica Rogatin in the pyramid below.







                                         Source: The Medium post “The AI Hierarchy of Needs”



A data engineer is the lead player on the first 3 foundational rows of the Pyramid: Collect, Move/Store and Explore and Transform. A plethora of roles from Data Analysts, Data Scientists, and Machine Learning Engineers are the heirs and lead role players on the higher phases of the value chain unlocking.

A Data Engineer is part of the functioning that provides the base to the highly critical job of the Data Scientists by hiding all the complexities involving the management, storage, and processing of the data assets of the company. He or she is a master of data ingestion, enrichment, and operations.



Source: Oreilly



With the deluge of data available within public and private companies, the ability to unlock this value is the critical factor in providing cheaper and better services to stakeholders and customers.


Skills of the trade

Data Engineers do come in different flavors and types. The core skills of the trade can be summarized below in order from essential to important:

  • Software Engineering: Data Engineering in its essence, is a discipline of Software Engineering where the same rhythms and methodologies of work are applied in order to execute the task at the end. The use of version control, unit testing, and agile techniques to ensure business alignment and quick delivery are paramount for success.
  • Relational Database/Data Warehouse Systems: Most of the data access in the data engineering space is democratized through access to ad-hoc querying into a relational database environment. Allowing expert users with basic knowledge of SQL to retrieve the data that they need in order to respond to a business query or decision.
  • Scalable Data Systems/Big Data: It’s central to the modern data engineer to understand data systems architectures. A good grasp of how distributed and parallel processing work is needed. The different types of indexing available in their environment to allow proper and efficient processing of the data at their disposal is a great skill to have.
  • Operating Systems / Command Line: Familiarity with your local environment of development being OS/*NiX/MIN is primal, particularly the command line where a lot of ad-hoc wrangling can happen.
  • Data Visualisation: A fundamental skill to effectively expose data products to a more general audience and quickly unlock data value through clear infographics, charts, and interactive analytics. Familiarity with a tool like Tableau, Superset, or Power BI is a must.
  • Data Science (Basics): An increasingly important user and stakeholder of a Data Engineering organization is the data science team. Understanding how data is used in the context of exploratory data analysis, machine learning, and predictive analytics ensures a virtuous cycle between critical data functions.

Data Engineers don’t need to be experts in all of the areas above. Having two core expertise in the above and a good understanding of the other areas go a long way in delivering value to a project.

A Data Engineer can come in different shapes and forms, so being very specific about your role is very important. As a nascent profession, it lacks standards and consistent job descriptions.

Typically transitions to successful data engineers are seen from the following backgrounds in the industry:

Software Developer/Engineer, Data Scientist, Database Administrator, Business Intelligence Developer, and, Data Analyst.


The path to mastery

To master data engineering I would start with the prerequisite of getting deep experience and expertise in two or more of the following areas.

  • Distributed Systems / Big Data
  • Database Systems / Data Warehousing
  • Software Development
  • Data Visualization

The most traditional path to mastery is a degree in a discipline with high Computing exposure (CS, EE, Info Sys., Applied Maths/Phys, Actuarial Science/Q) or a Quantitative degree followed by a couple of years in Software Development or Data Science with practical exposure to backend services and production systems. The data engineering field is loaded up with rockstar engineers from non-traditional backgrounds ( high school dropouts, literature majors, etc.).

A couple of top online courses and specialization available at the top websites ( Coursera, Udacity, Udemy, etc.) covering Big Data / Data Engineering tooling can give a good foundation to aspiring Data Engineers. The ones with the best reviews in your preferred learning platform will assist you in building a skill set for the role.

After this initial foundations I would recommend the following books for fundamentals in architecture:

  • Designing Data Data-Intensive Systems —Martin Klepmann
  • Data Engineering Cookbook—Andreas Kretz
  • Foundation of Architecting Data Solutions — Malaska et. AL,
  • Streaming Systems — Akidau et. al
  • The Data Warehouse Toolkit — Ralph Kimball

Nothing is more valuable at this stage than getting practical exposure in a real-world data engineer role. Keep practicing and growing the craft for the rest of your career.

Omdena as an organization that promotes AI challenges with volunteers across the world is the ideal place for anyone to sharpen their data engineering skills. In many of the Omdena challenges one of the most important skills needed is data engineering skills to prepare data, set up data pipelines, and operationalize pipelines.


Typical tools of the trade

With all the excitement in the field, a plethora of tools are popping up in the market, and knowing which one to use becomes a problem as there are many overlapping uses of them. A typical data engineer product/service does not differ much in terms of the complexity of a software system.

A typical data engineering pipeline will require expertise in at least one tool per function/category:


1. Function : Pipeline Creation / Management

Apache Airflow

  • End to end workflow authoring and management tool.
  • Provides a computing environment where your processes can run.

Alternatives: Azkaban, Luigi, AWS SWF


 2. Function: Data Processing

Apache Spark

  • A fundamental tool to process data in many formats at high scalability.
  • Allows facile enrichment and processing in SQL, Scala, and Python.

Alternatives: Apache Flink, Apache Beam, Faust


3. Function: Distributed Log/Queueing Systems

Apache Kafka — Scalable distributed queuing system that allows data to be processed and moved at a very high speed and large volumes.


4. Function: Stream Processing

Alternatives: Apache Flink


5. Function: Data/File Format

Apache Parquet — Very efficient data format geared for analytics and aggregations at scale on cloud or on-premises.

Alternatives: Arrow, CSV, etc.


6. Function: Data Warehousing /Querying


  • A cloud-based data warehouse system for structured and relational data storage and analytics.

Alternatives: AWS Redshift, Apache Hive, etc.


Keep in mind that tools go and come over the years. Focus on the picture and functional areas will keep you updated and ready to learn the new fancy tool.

Starting or joining an open-source that uses any data engineering tool is a good move from a growth perspective and longer-term mentorship by captains of the industry.


The future

In order to fulfill the promise of unlocking the value of data, more investment in the Data Engineering space is expected. There’ll be increasingly intelligent tooling available to handle the current and future challenges around data governance, privacy, and security.

I can see an increase in blending AI and ML techniques directly on the Data Engineering toolchain from an operations perspective and data quality assurance. Good examples of such tools are Deequ from AWS Labs that applies machine learning to data profiling. At the center of modern Data Engineering are areas like synthetic data generation to alleviate issues around data privacy when the cost of acquisition of data and compliance is too high Tools to watch out on the synthetic data space: Snorkel and the use of generative adversarial neural networks to generate everyday tabular data.

With the rise of Auto ML for prediction and data analytics, a central role will be given to the underpinning data infrastructure engineering of the datasets that drives the enterprise strategy. From here, we can only see an outlook of increasing relevance and opportunities to contribute positively to society.

I would like to acknowledge Laisha WadhwaJames Wanderi, and Michael Burkhardt for their input and suggestions on the article.

Building the Future Artificial Intelligence Enabled NGO

Building the Future Artificial Intelligence Enabled NGO

Challenges, opportunities, and next steps to build the 21st century AI Powered NGO.

Artificial intelligence (AI) has the potential to help tackle some of the world’s most challenging problems. We are sure you have heard this before, it all sounds good in theory but how do we walk the talk?

To find answers to these questions we organized a 1-hour webinar and panel discussion with some of the leading experts and organizations in the Humanitarian AI space.


Image for post

 Source: Omdena LinkedIn


What all panelists agreed is one is that despite a lot of attention toward fancy and futuristic AI, there are hundreds of problems that can be addressed with current applications. AI has been around since the 1940s and can come with very pragmatic approaches to address real-world problems and generate value. Another big misconception is that organizations think they need to have a perfectly defined problem and dataset before they can leverage the potential of AI.

The reality is there are no perfect problems and also no perfect data, so better let us start creating value right now by shifting our mind towards pragmatic and collaborative practices.


Case studies we discussed in the webinar

We wish this webinar and panel could have been longer to answer the many questions from the audience and discuss further how to apply AI most effectively to address urgent problems in the world.

In order to overcome challenges in the NGO space and harness the many opportunities, we all agreed on two essential ingredients; more collaboration and diversity of ideas and backgrounds.

As John Zoltner from Save the Children concluded in the panel:

“Let´s get the data together!”

Watch the full recording below.



I Struggled with PTSD, Now I Help to Address It Through AI

I Struggled with PTSD, Now I Help to Address It Through AI

Read the brave story of Anam from Pakistan who was struggling with Post-Traumatic-Stress Disorder (PTSD) after her dad was in a critical health condition. She had to prepare for entrance exams while taking care of her siblings for several months.



It is truly amazing how many inspiring individuals have applied to our Collaborative AI Projects. We very honored to share the story of Anam today from whom we learned a lot by just speaking and listening to her. Anam has been part of our AI challenge on building a machine learning model for PTSD assessment. 


Anam’s Story

I am a Computer Science student and before that, I was actually a pre-medical student. I switched a lot of majors. The thing in Pakistan is, after studying biology, you can either become a doctor or a dentist. I wanted to do research but there weren’t many options. That is why I decided to switch to computer science and for that, we have to study mathematics before college. I took a gap year to study math so that I was eligible to apply to an engineering university.

It was very hard to convince my parents to let me study maths at first because they were convinced that being a doctor would be a better choice for me. They finally agreed and I was studying maths and then I had to complete two years of the syllabus but I had only one year. Right after I started studying my dad got appendicitis and we went to get his appendix removed but it ended up being more than that.

His intestines stopped working, and he was in the hospital for a few months after that. We were just hoping his intestines would start working so we could go home. Then the surgical wound from where they opened him up, developed an infection. In order to get support, we had to move to Lahore, where the rest of my relatives live. When we moved to Lahore, they cleaned his wound, and it got infected again. He was on bed rest for about two months. His movements were minimal, which led to pulmonary embolism (blood clots had lodged in his lung). One day, he was going to the bathroom, when all of a sudden he passed out and nobody knew what was happening.

Everybody was at the hospital and nobody could figure out what happened. The doctors thought maybe he had a heart attack. He was taken to the ICU. The doctors started giving him CPR. I think he was gone for a minute or two, but the doctors were successful at bringing him back. They put him on life support forsupportfor a couple of days and that’s when we really lost all hope.

A couple of days later he finally woke up and we found out that he had had a pulmonary embolism.

I know a lot of people go to a lot of things and this is nothing compared to most of them. when my parents were in the hospital I was looking after my siblings. I had to tell them what was happening.

At the same time, I also had to focus on my studies. Even though there were a lot of people who did support us, at these times you really find out who is actually there for you and who isn’t. And a lot of people backed out. My friends would be telling me, you should be with your dad instead of even worrying about your studies. To avoid talks like these, I would hide while I studied. So after four months of being in the hospital and staying at my relatives, we could finally come home.

We were ecstatic.

I remember my mom telling me that she wanted to go out in the streets and shout for joy.


We came back home it was all fine and I gave my math exams after covering two years worth of syllabus in about 4 months, that too under extreme stress.

I came back after my last exam, ready to prepare for my college entrance tests, and something odd happened. I fell sick out of the blue. I had nausea 24/7. I couldn’t eat or drink. I would vomit if I tried, and I started to lose weight.

My parents took me to multiple doctors thinking that my stomach was upset. Months went by but we couldn’t figure out what was wrong. Later, I was diagnosed with severe anxiety which stemmed from the incident with my dad.

I remember my mom telling me, “When your dad was in the ICU, I would sit outside it all night and every day there was a new body being taken out of the ward. So, every time I saw the doors open, I hoped it wasn’t your dad’s body.” I could understand her feelings because that was exactly how I felt every time my mom called me from the hospital. I felt my heart drop every time my phone rang.

The fear had gotten stronger and now I had severe anxiety accompanied by recurring panic attacks. The fear that I might lose my parents kept me up all night. Whenever one of them left the house I would call them repeatedly to check up on them. I never turned my phone on silent while in class, because I was always fearing a call with a bad news.

I started medication, and my anxiety slowly started getting better. Throughout the recovery, my mother was always by my side. She distracted me when I had terrible thoughts. I felt safe only in her company.

My entrance test results finally came. I was accepted into one of the top CS universities in Pakistan.

My recovery still continues but today I feel great because I have never been able to share my story publicly before. I have always been told to keep it quiet as if talking about mental health problems is some sort of taboo. From my personal experience, I have realized that talking about it is what helps us get better. I hope I encourage people to speak up and share their stories


Augmenting Public Safety Through AI and Machine Learning

Augmenting Public Safety Through AI and Machine Learning

In this demo day, we took a close look at the tremendous potential AI offers for making communities safer, by helping to reduce, prevent, and respond to crimes. When it comes to public safety, it is often critical to act quickly. AI technologies can supplement the work of people, taking on monotonous and time-consuming tasks that would be impossible for humans to do effectively. Natural language processing can read and analyze public communications and news reports to detect potential problem areas and get-ahead of violence. Of course, this work must be done responsibly and ethically.

Sharing her perspective on the impact that AI can have in keeping people safe was an expert in the field, ElsaMarie D’Silva, the Founder & CEO of the Red Dot Foundation. The Red Dot Foundation’s award-winning platform Safecity crowdsources personal experiences of sexual violence and abuse in public spaces. ElsaMarie is listed as one of BBC Hindi’s 100 Women, and her work has been recognized by numerous UN organizations and the SDG Action Festival.

To go a little deeper into the application of AI for public safety, we shared Omdena projects that took innovative approaches to make communities safer.


Case Study 1: Preventing sexual harassment through a safe-path finder algorithm

UN Women states that 1 in 3 women face some kind of sexual assault at least once in their lifetime.”

With the first case study, the Omdena team drew upon Safecity’s crowdsourced data about sexual harassment in public spaces and leveraged open-source data to build heatmaps and calculate safe routes through major cities in India. Part of the solution is a sexual harassment category classifier with 93 percent accuracy and several models that predict places with a high risk of sexual harassment incidents to suggest safe routes.


AI Sexual Harassment



You can learn more about this and related projects here:


Case Study 2: Understanding gang violence patterns and actors through Twitter analysis

Our team worked in partnership with Voice 4 Impact, an award-winning NGO whose solution to violence in our communities addresses the questions people worldwide are asking: “How do we keep missing the signs?”

The Omdena team made use of natural language processing techniques — AI techniques that analyze text to understand what is being communicated. Machine learning algorithms were used to understand gang language and AI models built to detect violent messages on Twitter, without profiling. The aim is to predict and ultimately prevent, gang violence.


AI Gang Violence


You can learn more about this and related projects here:


Case Study 3: Analyzing Domestic Violence through Natural Language Processing (NLP)

Finally, we presented Omdena’s work to uncover domestic violence in India hidden due to COVID lockdowns. This work is part of a project with the award-winning Red Dot Foundation and Omdena’s collaborative platform to build solutions to better understand domestic violence and online harassment patterns during COVID-19. The project used natural language processing techniques with social media, government reports, and other text content to create a dataset with which Safecity could mobilize local efforts to protect and support domestic violence victims.



AI Domestic Violence



You can learn more about this and related projects here:





Host an AI project with us.


Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here