AI Insights

Clear Data for Clear Skies: How We used AI to Predict Air Quality in Poland

April 8, 2024


article featured image

Smog in Krakow


The Problem

Air pollution is a particular problem in Poland. The annual EEA (European Environment Agency) reports on air quality show that Poland is among the countries with the worst air quality in Europe. Bad air quality affects people’s lives and constitutes a considerable health risk.

Mitigating air pollution could improve quality of life and lead to a healthier society while also reducing costs for the Polish healthcare system.

Broad policy decisions try to mitigate air pollution by targeting its symptoms with a limited knowledge of how it really works. An important step towards true mitigation would be the identification of the main factors and causes of air pollution specific to Poland.

This article will go into how Omdena built a system that enables making smarter decisions when it comes to controlling air pollution.

The Goal

While there is tons of data measuring air quality in Poland, it hasn’t been utilized in a unified system. Through the holistic analysis of this data in a unified form, we can identify the causes that contribute to the poor air quality in Poland and enable policymakers to take measures to limit their effects. Insights on where this issue is higher in severity also allows them to focus their efforts where the need is the highest.

In addition to this, a predictive model would allow people to limit their exposure to polluted air through alerts and warnings issued by local authorities and organizations. It would also encourage people to do what they can in reducing their individual contribution to this phenomenon.

Overall, a data-driven approach to air quality monitoring and policymaking can contribute significantly to creating a healthier and sustainable environment in Poland. With this in mind we decided on our goal:

Investigating the primary factors responsible for air pollution in Poland and developing an effective tool to predict air quality.

The Background

Smog

Why is Poland’s air quality so low?

Poland is still highly dependent on coal for its energy needs, especially in electricity generation or heating systems. Poland ranks second in Europe in coal mining, second only to Germany. Coal is also a large part of their economy as it creates jobs and exports. This makes the people of Poland hesitant to move away from coal as an energy source despite measures taken by the government.

This problem becomes particularly apparent in winter. Millions of homes use coal to power their indoor heating systems leading to a thick smog that sets in on many Polish cities. The government often has to give out warnings to the public to stay indoors and use facemasks if going outside is needed.

The average age of cars in Poland is also higher compared to other countries in Europe. This means that people are still using older car models that have fewer features to make them more environmentally friendly.

A National Health Crisis

The air quality in Poland has wreaked havoc on the health of its residents. It is estimated that the effects of living in Warsaw all year can be as bad for one’s lungs as smoking 1000 cigarettes. Average annual concentrations of the carcinogenic benzo(a)pyrene reach 6 ng/m3, when the EU target value is set at 1 ng/m3.

The European Environment Agency estimates that every year around 48,000 people in Poland die prematurely due to the effects of air pollution. 

This affects the elderly the most. The air quality is estimated to shorten the life of an average Polish person by 9 months.

What has been done to solve this problem so far?

Poland’s government launched a €25 billion decade-long scheme aimed at tackling the country’s poor air quality. They plan to renovate 4 million homes and buildings by 2029 to equip them with better insulation and more efficient indoor heating.

Krakow, which is one of the cities most severely affected by poor air quality, banned the burning of coal in the city. This led to some improvement but with surrounding areas not also having such a ban meant the effects were limited.

Warsaw also banned the burning of coal in the city. They also introduced measures to disallow older models of cars into the city and created car-free zones in different parts of the city.

Challenges to Overcome

Filling in Gaps in Data

Although there are many weather stations in Poland collecting data on air quality, their distribution around the country is not uniform. In addition, the availability of data differs from station to station. As the level of air pollution in a certain area is affected by the levels in surrounding areas as well, the model had to be built to compensate for any gaps in the available data.

Unifying Datasets

The creation of this AI model required the integration of data from stations measuring air quality and stations tracking weather conditions. These stations used varying metrics in their range and the timeframe they cover. This along with the uneven distribution of the stations meant that unifying the various datasets was a challenging endeavor.

Our Approach

Step 1 – Translation from Polish to English and Making Data the Data Organized 

The data translation and cleaning process began with translating the Air Quality dataset from Polish to English and filtering it for the 2017-2021 period. After this, the separate files for each year and pollutant were merged into one. Despite this, some missing data occurred due to limited measurement stations and incomplete records.

Additionally, stations with similar names were identified and merged into a single time series, resulting in approximately 30 discovered stations. Subsequently, the weather dataset was consolidated into a single file, with each text file including the station ID. The static annual dataset required minimal cleaning as it was already aggregated at the Powiat and Voivodeship level.

The datasets were meaningfully joined by first merging pollutant and weather datasets at the Powiat level, then combining the result with the static annual dataset. This process resulted in a dataset covering 198 of the 380 powiats in Poland.

Step 2 – Consolidating all the Data for Analysis

The datasets underwent several steps to prepare them for merging. Initially, coordinates were utilized to assign Powiat and Voivodeship information to air quality and weather stations. This process involved GEOJSON data and reverse geocoding techniques.

The static annual dataset, already aggregated, required further cleaning and distribution of data to the Powiat level. Ultimately, the datasets were merged into a single dataset, with columns representing different data types and rows containing unique Powiat entries, ready for subsequent merging and analysis.

To perform imputation of missing data, the nearest stations for a given primary station had to be found. Since it was decided to work on the Powiat level, this task translated to finding the nearest Powiat(s) to a given Powiat. This information was later used during data imputation for finding the closest powiat which contained air quality/weather data.

Neighboring Powiats of Warsaw by distance

Neighboring Powiats of Warsaw by distance

Step 3 – Analyzing the Information  to Summarize Their Main Characteristics (Exploratory Data Analysis)

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

In the field of AI, EDA allows the model to find patterns in the data that would otherwise be hidden to human eyes.

The Air Quality datasets included daily measurements of pollutants from stations across Poland’s districts. However, most stations measured only one pollutant, and none covered all. Weather data featured various measures like Cloud Cover and Temperature, with few stations tracking multiple conditions. Merging these datasets posed a challenge due to differing station levels and Powiats having multiple monitoring stations. Additionally, the static dataset mainly offered annual data at the district or regional level.

Merging Datasets

To merge the datasets, we decided to aggregate all the data and then join all three datasets using information indicating the respective Powiats in all three datasets. Following imputation, both datasets were aggregated at the Powiat level, and the mean pollutant values were calculated for each corresponding Powiat. To identify the closest neighboring Powiats, a separate dataset was created, listing all 380 Powiats with their nearest neighbors.

After imputation and aggregation, the weather data was merged with the Air Quality dataset, resulting in a dataset containing data for 198 powiats with all weather datasets filled. Finally, the static datasets were merged with the merged Air Quality and Weather dataset to create the final single dataset.

(Left) Air Quality station locations before aggregation. (Right) Aggregated Mean Air Quality locations of 198 Powiats

(Left) Air Quality station locations before aggregation. (Right) Aggregated Mean Air Quality locations of 198 Powiats

Step 4 – Creating the System that Predicts

After preprocessing and exploring the data, the next step involved training and testing machine learning models to predict future CAQI levels and indexes. Supervised learning was chosen, with classification and regression techniques used to handle the categorical and continuous labels respectively.

What we found out through Data Analysis

Key Takeaways

  • We discovered that though weather has a notable impact on air quality levels, the effect is indirect. Man-made factors had a higher direct impact on the air quality in Poland.
  • Sundays had particularly low pollution levels compared to other days of the week as there is reduced economic activity on that day. This can be attributed to Sunday being a day off for most people as Poland is predominantly Catholic. 
  • The impact of public and school holidays depended on whether the occasion was one that resulted in higher human activity. Occasions where people stayed in their homes resulted in better air quality.
  • The air quality was significantly worse during Winter compared to Summer. This identified the burning of coal for heating as the leading cause of poor air quality.
  • Though the COVID-19 pandemic resulted in higher overall air quality due to reduced human activity, the variable of indoor heating during winter remained constant. This meant that in the long term, the positive impact of lockdowns on air quality was limited.

Analyzing the ‘When?’

The data was analyzed by time at three different levels – Weekly, Monthly, and Annually. Through this analysis, we were able to garner insights on how the different pollutants impacted air quality at different times.

Weekly

Aggregated Weekly Mean Pollutant Concentrations

Aggregated Weekly Mean Pollutant Concentrations

More than 85% of Polish people identify as Catholic; most attend weekly church service on Sunday. Most shopping malls, supermarkets and smaller shops are closed on Sundays, which could potentially result in less traffic. This could explain lower weekly pollutant concentrations on Sundays

Conversely, Ozone concentrations are higher during the weekend. Ozone is a secondary pollutant; it is not directly emitted by traffic, industries, etc. but formed on warm summer days by the influence of solar radiation on a cocktail of airborne pollutants. When there is less traffic during weekends, there is a reduction of Nitrogen dioxide emissions.

Monthly

Aggregated Monthly Mean Pollutant Concentrations

Aggregated Monthly Mean Pollutant Concentrations

The above plots illustrate the aggregated monthly mean pollutant concentrations. Ozone levels are higher during spring and summer. Conversely, the other pollutants are higher during autumn and winter. Higher concentrations of Ozone during spring and summer is a result of photochemical ozone production favored by high temperatures and intensive solar radiation during those months. By contrast, high levels of NO2 and particulate matter emissions during autumn and winter are mostly attributed to intensive burning of low-quality coal in coal furnaces for heating (Izabela Pawlak).

Annual

The plots below depict the annual mean concentration of pollutants from 2017 to 2021. The pollutant matter concentrations were highest during 2017 and gradually decreased from 2018 onwards. Interestingly, all four pollutants were lowest during 2020 before picking up again in 2021. This could indicate the effects of nationwide lockdown in Poland during the COVID-19 pandemic.

Annual Mean Pollutant Concentrations

Annual Mean Pollutant Concentrations

Less economic activity, closure of high power-consuming plants, suspension of air and railway traffic, reduction of car traffic, decrease of power production all led to a reduction of emissions into the atmosphere resulting in a marked improvement in air quality (ncbi.nlm.nih.gov).

Analyzing Overall Air Quality through CAQI (Common Air Quality Index)

CAQI is used for easily understanding the overall Air Quality in Poland based on all of the above pollutants. The analysis of CAQI trends shows that the CAQI has a downward trend in general, the lowest being during 2020. The plots below depict the weekly, monthly and yearly trends of CAQI values.

Aggregated Daily Mean CAQI values from 2017 to 2021

Aggregated Daily Mean CAQI values from 2017 to 2021

Weekly & Monthly Seasonality of CAQI levels, followed by Annual Trends

Weekly & Monthly Seasonality of CAQI levels, followed by Annual Trends

Generally, CAQI values are lowest on the weekends compared to weekdays. CAQI levels are also lower during spring and summer. Then, CAQI levels gradually increase during autumn, before reaching highest levels during winter. CAQI level Annual trends are similar to those we have seen before for particulate matter, where CAQI levels are seen to be at lowest during 2020.

How does weather factor in?

Scatterplot of daily mean CAQI levels and various weather conditions

Scatterplot of daily mean CAQI levels and various weather conditions

Here’s what we discovered about how different Weather Conditions affected air quality:

  • Cloud Cover: CAQI levels are evenly distributed from low to high levels of Cloud Cover, which suggests Cloud Cover alone may not have a huge impact on CAQI levels.
  • Wind Speed: Groups of high CAQI levels are located at the lower Wind Speed ranges. The higher the Wind Speed, the lower the CAQI.
  • Humidity: CAQI levels are noticeably higher at higher Humidity levels.
  • Precipitation: CAQI levels are observed to be higher at only lower Precipitation levels.
  • Snow Depth: CAQI levels are comparatively higher at higher Snow Depth levels. But temperature could be a confounding variable here as snow only appears when the temperatures are extremely low.
  • Temperature: There is a clear non-linear relation between Temperature and CAQI values, where CAQI levels are higher at lower Temperatures. This could be attributed to the use of coal burners for heating during winter in Poland. At higher Temperatures, CAQI levels are significantly lower, indicating better Air Quality in spring and summer.

Impact of COVID-19 Lockdowns

Daily CAQI Time Series. COVID-19 Lockdown periods are indicated in orange. The green line represents 90-Day roling mean of CAQI

Daily CAQI Time Series. COVID-19 Lockdown periods are indicated in orange. The green line represents 90-Day roling mean of CAQI

The first case of COVID-19 in Poland was registered on March 4th, 2020. Governmental measures significantly restricted social and economic activities. As seen in the previous section, 2020 saw the lowest CAQI levels in Poland. Notably, the first lockdown fell during spring, which historically has lower CAQI levels. The 90-day rolling mean of CAQI levels shows a drop in CAQI levels way before the COVID-19 lockdown during 2020. Subsequent lockdowns were during autumn and winter, which historically see an increase in CAQI levels as more people use coal stoves for heating.

Impact of Public Holidays and School Holidays

Boxplot distribution of CAQI levels for various Public Holidays and observances in Poland

Boxplot distribution of CAQI levels for various Public Holidays and observances in Poland

The above boxplot illustrates how CAQI levels vary by different holidays in Poland. Non-Holiday categories are those days that are not a public holiday. Certain holidays such as Valentine’s Day, All Saint’s Day, Independence Day and National Day of the Victorious Great Poland Uprising see a higher median level of CAQI levels compared to other days.

The school holidays also affect overall CAQI levels in Poland. In summer and winter, median CAQI levels are comparatively lower given the school holidays.

Man Made Factors vs Weather

The EDA revealed that man-made activities are the main source of pollution in Poland. Weather conditions only have an indirect effect on the overall air quality. Data related to human activities in Poland such as land use, crop production, emission of particles and pollutant gasses, forest area and fires, population density, production of electricity, vehicle types, and air pollution reduction systems were only available at an annual level.

Recording data of these features at a daily level could be a valuable tool in identifying air quality levels in Poland. 

Testing and Evaluating the Predictive Model

To assess the trained models forecasting accuracy, the model was trained on both train and test data to forecast future CAQI levels of Warsaw, the capital of Poland, from Jan 2022 to Jan 2023. The forecasted CAQI levels and Actual CAQI levels are shown below.

Forecasted vs Actual Warsaw CAQI levels from Jan 2022 to Jan 2023

Forecasted vs Actual Warsaw CAQI levels from Jan 2022 to Jan 2023

The model successfully captured the trend and seasonality of the Air Quality levels. 

The model forecasts for the first 2 months are very close to the actual CAQI levels. As the forecasting horizon increases, the model tends to over predict CAQI levels, but this is not unusual as long-term forecasting is more difficult than short-term forecasting.

Possible Next Steps

Tuning the model further with non-public data

More data almost always improves modeling in AI. Our model used data that was publicly available for downloading. Access to data that is not publicly released could help create a better understanding of the problem and better predictions.

Collection of important data on a daily level

The project highlighted the leading man-made causes of air pollution in Poland. However, we found that a lot of this data is only available at an annual level. Accurate measuring the effects of man-made activities on air quality will require more comprehensive data collection.

Measuring the impact of Policy Decisions

As the model can predict air quality based on current variables, it can help measure how much a particular policy impacts air quality based on the difference between the predictions and the actual measurement after a policy is adopted. This means policymakers will have a way to know what works and what doesn’t.

Use as a warning system

High levels of air pollution can cause adverse effects on the health of people exposed to it. Issuing alerts based on the foreknowledge of air quality can help people protect themselves better by staying indoors or using protective gear such as facemasks.

Time Frame

The entire project was completed in a 1 Month period, between February and March 2023. In this time we were able to achieve all of the following:

  • Collecting and Aggregating data from Hundreds of publicly available sources
  • Analyzing this data through comprehensive EDA and Feature Engineering
  • Training and testing the predictive model through the insights gathered in the analysis

Further Applications of This Technology

There are many more places in the world that suffer from this same issue due to different factors. Using our system to analyze the causes and create predictive models can help these cities make the lives of their citizens better and healthier.

The system is also versatile and can be adapted for other forms of pollution and ecological phenomena. Using this framework for such applications can help improve the health of millions of people.

  1. Manufacturing: By analyzing air quality, manufacturers can reduce emissions, comply with regulations, and enhance workplace safety.
  2. Transportation: Businesses can optimize routes, minimize emissions, and improve public health by considering air quality in logistics and transit planning.
  3. Energy Production: Energy companies can minimize emissions, comply with regulations, and identify suitable locations for renewable energy projects using air quality data.
  4. Agriculture: Farmers can minimize emissions and protect air quality by managing fertilizer use and livestock waste based on air quality analysis.
  5. Real Estate and Construction: Developers can ensure compliance, promote indoor air quality, and minimize environmental impact by considering air quality in building projects.
  6. Healthcare: Healthcare providers can improve patient outcomes and public health by using air quality data to understand the impacts of pollution on respiratory health and develop targeted interventions.
  7. Marketing: Based on the information that was discovered brands and companies can create activities and campaigns raising awareness, or helping to reduce their contribution.

By incorporating air quality considerations into their operations, businesses can improve efficiency, reduce environmental impact, and promote sustainability and public health.

Want to work with us too?

media card
Harnessing AI to Monitor and Optimize Reforestation Efforts in Madagascar
media card
Predicting Sudden Cardiac Arrest: Time Series Classification with LSTM Recurrent Neural Networks
media card
Desertification Detection using Machine Learning and Satellite Data