Predicting Rainfall Through Weather Balloons Data and Machine Learning

January 24, 2022

In this case study, you can read how we utilized weather balloon data to build a 12-hour rainfall prediction model to mitigate climate change in western Africa. We compared different models such as Convolutional Neural Networks, Semi-supervised learning, and tree-based Machine Learning (ML) to build a classification model to predict rainfall. We deployed the winning model in a docker container.

The project partner: The use case in this case study stems from Kanda Weather Group which hosted an Omdena Challenge as part of Omdena´s AI Incubator for impact startups.

What are Weather Balloons?

“Climate change is having a growing impact on the African continent, hitting the most vulnerable hardest, and contributing to food insecurity, population displacement, and stress on water resources. In recent months we have seen devastating floods, and an invasion of desert locusts and now face the looming specter of drought because of a La Niña event. The human and economic toll has been aggravated by the COVID-19 pandemic,” said WMO Secretary-General Petteri Taalas.

Predicting the weather with accuracy is very difficult. Climate change, the complexity of the Earth’s climate system, and the imperfections of current climate models make it very challenging. For African farmers, having one piece of information is especially critical: time of rainfall. Knowing if it will rain on a certain day helps them make decisions on their water usage.

Kanda Weather Group has built a software solution that sends convertible digital currency stored on a blockchain to residents of West Africa for their effort in the labor-intensive process of launching a weather balloon. These weather balloons carry instruments up in the air to send back information on atmospheric pressure, temperature, humidity, and wind speed.

Figure 1: Kanda Weather Balloons empower African communities to collect weather data (https://kandaweather.org/about-kandaweather/)

The primary reason why the use of weather balloons has remained to be such a key part of the regional, national, and global forecast process is the fact that weather balloons have the unique ability to measure local and/or regional vertical profiles of temperature, moisture, wind speed, wind direction, and more pertinent details of the atmosphere.

Weather balloons allow atmospheric scientists to study the atmosphere to better understand and predict how the lower to middle levels of the atmosphere may influence localized and regional weather conditions over some period of time. These local influences may help farmers and local communities to be aware of their daily weather and accordingly, they can take preventive measures for potential extreme weather conditions.

Next, you can read how we developed an ML model to predict rainfall in the next 12 hours based on the data from a weather balloon. For this project, the scope was limited to one city, Douala in Cameroon.

Data Preparation

Collecting the data

Building a machine learning model to predict rainfall from weather balloons data requires historical data. We aimed to gather data from at least the last 10 years. We decided to use the Global Hourly – Integrated Surface Database (ISD) for the rainfall data. The main reason was that this dataset consists of surface observation and not a meteorological model which could contain more inaccuracies. For the weather balloons data, we used the Integrated Global Radiosonde Archive (IGRA) which is a database of radiosonde and pilot balloon observations from more than 2,800 stations located in almost every country.

Figure 2: Weather balloon dataset

To ensure data quality and also to ensure that the data reflected the current climate of Douala Cameroon, we chose to limit the data to the years 2008-2021. This would give us enough data to work with without compromising on the relevance and accuracy of the data. The IGRA dataset records physical data (temperature, humidity, wind speed, and direction) from weather balloons at different pressure levels. The different layers of the atmosphere impact the weather in different ways, therefore it is important to be able to provide the data from different altitudes to an ML model to make better predictions. During this project, we used the weather balloon data as features to predict the rainfall data.

Cleaning the data

The cleaning phase was very important in this project due to the amount of missing data and the presence of a lot of outliers. One of the main challenges was imputing missing data while not creating incoherent data. Incoherent data in the case of physical data would be, for example, a temperature value below the absolute zero (273.15°C).

The dataset that required the most cleaning was the weather balloons dataset. The main task that we had to do was handling the missing values. In the whole dataset there were many missing values, and dropping them was not a good idea since we would end up with a very small and limited dataset.

Adding missing values

We decided to fill the missing values using a KNN imputer because the algorithm is easy to understand and it seemed to give us very coherent results. The algorithm takes the average values of ‘k’ with very similar data points to impute a missing value.

Dealing with outliers

Another challenge was dealing with outliers, since this was a time-series dataset we had to be extra careful while removing the outliers. We experimented with eliminating outliers using an algorithm with interquartile range, but decided, in the end, to leave the outliers in the dataset as they represent an important part of the data. Removing the outliers would result in a smaller dataset that would not have been big enough for modeling purposes.

Figure 3: High-level cleaning pipeline for the data

Results and Insights

Exploratory Data Analysis

Exploratory data analysis (EDA) is a crucial step before jumping into the modeling part. This stage can tell you whether the features you’ve chosen are good enough to model if all features are required and if there are any correlations that should be taken into account in subsequent steps.

One of the first things we checked was if the rainfall data we had is consistent with the climate in Douala. For this, we compared it to climatology data which gives you an average number of rainy days per year and month.

Figure 4: Average Rainy Day Per Month ISD vs. Climatology

In the above chart, our rainfall data is represented in blue and orange and the reality of the climate in Douala Cameroon is in green. We can see a clear imbalance between the months. The winter season has fewer rainy days than the summer season. Our rainfall data doesn’t follow exactly the climatology data but it is close enough.

During the analysis of the datasets, we also discovered that not all the years had the same amount of data. Our data spanned from 2008 to 2021 but we had some years with more than 300 weather balloon launches and some with less than 100. The chart below presents the distribution of data points per year. It is important to keep in mind the differences between the years.

Figure 5: Number of weather balloon launches per year

One last aspect of the data we explored was the distribution and skewness of the features. Some features presented a normal distribution shape but with some skewness. For example, the wind U component which describes the horizontal wind towards the east was naturally skewed because the wind comes from the Atlantic ocean located on the west of Cameroon and blows towards the east.

Modeling (Machine Learning vs. Deep Neural Networks)

The objective of the project was to create binary classification models to predict whether or not it rained. We tried both Machine Learning models and Deep Neural Networks. Ultimately we found that the tree-based ML models performed the best and accurately represented the data.

Data Pre-processing

For the pre-processing of the datasets, we applied various normalization techniques such as MinMaxScaler and StandardScaler because the features had a wide range in data values. We also used log transformation to remove the skewness of some features.

We also found that the models had better results when we provided them with the value of the month when the data was collected. This is due to the fact that the weather in Cameroon depends on the season: the winter is dry and the summer is considered the rain season.

To test all the combinations of preprocessing and feature engineering we created a preprocessing pipeline with a simple classifier model and we applied GridSearchCV to find the best performing pipeline.

First, we declare the preprocessing steps. The first step is to encode the month value. We also experiment with applying a log transformation on the skewed features.

def log_transform(x):
    return np.log(x + 1)
preprocessor = ColumnTransformer(
    transformers=[
        ("month_encoder", OneHotEncoder(handle_unknown="ignore"), ['month']),
        ("log_transform", FunctionTransformer(log_transform), skewed_features),
    ],
    remainder='passthrough'
)

Then we declare the pipeline and the parameters we are willing to test. For example, for the month encoder, we can either do nothing or use an OneHotEncoder. The same goes for the scaler, we decided to test different methods.

pipeline = Pipeline([  
    ('preprocessing', preprocessor),
    ('scaler' , StandardScaler()),
    ('model' , RandomForestClassifier())
])
parameters = {
    'preprocessing__month_encoder': ['passthrough', OneHotEncoder(handle_unknown='ignore')],
    'preprocessing__log_transform': ['passthrough', FunctionTransformer(log_transform)],
    'scaler': ['passthrough', StandardScaler(), MinMaxScaler(), Normalizer(), Normalizer()]}

Finally, we use the GridSearCV functions to loop through all the possible combinations of pipelines to find the best one. We used cross-validation with 5 folds so that it doesn’t take too long to run.

scoring = {'auc': 'roc_auc', 'accuracy': 'accuracy', 'f1':'f1'}
grid = GridSearchCV(pipeline, parameters, cv=5, scoring=scoring, refit='accuracy', return_train_score=True).fit(X_train, y_train)
best_pipeline = grid.best_estimator_

With this method, we could easily understand which processing steps were improving our models and which were not. Once we had a good preprocessing pipeline we would try different models and apply hyperparameters tuning on the best ones.

Splitting the data

One thing we experimented a lot with was the splitting of the data. Initially, we used a train-test split using the years. We kept the years 2008 to 2017 for training and 2018 to 2020 for testing. This gave us some good results but we were also limited because every year did not have the same distribution for rain. Using this method we discovered that the training split had 65% of rainy days and the testing split only 60%. This would create some overfitting.

Then, we decided to use a random train-test split but with stratification. This would ensure that both train and test splits would have the same distribution for the stratified variables. We chose to stratify the rain and month variables. Just by splitting the data in this way our models gained between 1% and 4% of accuracy.

Tree-based models

For this classifying problem, we decided to try many different approaches. Kanda Weather told us they experimented with XGBoost classifiers and got good results from it. We decided to concentrate our efforts on tree-based algorithms like decision trees, random forest, but also advanced tree-based models like CatBoost, XGBoost, and LightGBM.

More complex approaches and deployment

Experimenting with Convolutional Neural Networks

We also experimented with deep learning models. One model that was performing well was a convolutional neural network. This type of architecture is usually used for image recognition. Since this model expects a 2D image as input we used the fact that each launch had two dimensions: the pressure level and the physical features corresponding to this pressure level.

In a way, the model treats the pressure as the height of an image and the different pixels represent the physical features (temperature, humidity, wind). The obvious advantage here is that the model can leverage spatial locality in the pressure dimension.

In an image, understanding the relationship between neighboring pixels helps the model understand the image as a whole; Similarly, understanding the relationship between neighboring (in pressure terms rather than height/width) values of humidity, temperature, and so on can help the model understand weather information from a launch in its entirety.

Figure 6: Architecture of a CNN (source: researchgate.net)

In the image above, the input would be a rectangular matrix with each row representing a different pressure level and each column a different feature (temperature, humidity, and wind). In our case, the output would be binary: rain or no rain.

Experimenting with Semi-Supervised Learning

We also used Semi-Supervised Learning. While our feature dataset (weather balloon launches) spans almost four decades, the precipitation target data was only available from 2008. Hence supervised learning approaches were limited to just a decade of weather information. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. This algorithm got 79% accuracy but due to time constraints, we didn’t have time to improve it.

Model comparison

Table 1: Model Comparison

	CNN	Semi-supervised learning	XGBoost (Winner)
Performance (Accuracy)	77%	79%	82%
Limitations	It’s a complex model that is difficult to explain to a client.	We were unsure of the quality of the unlabeled data.	Accuracy depends on the amount of pressure level recorded.

After trying different models and machine learning algorithms we kept only the best-performing ones. Surprisingly, it was the advanced tree-based models that were performing the best.

REST API

We delivered to the client a functional REST API that could fetch the data of recent balloon launches, preprocess it and make a prediction. To share our API with the client we used docker.

Figure 7: Deployment pipeline

Limitations

During this project, we discovered the domain of weather prediction and that it is not easy to make very accurate predictions even for something as simple as will it rain today.

Our best-performing model reached 82% of accuracy on the training set but it is hard to predict the accuracy of this model once in production.

Conclusion

The main goal of the challenge was to develop a model that forecasts 12-hour rainfall based on weather balloons data. During the project, the team gathered resources in the form of articles, videos, and scientific papers. We downloaded relevant datasets, analyzed, cleaned, and merged them. We also developed many different models ranging from logistic regression to advanced tree-based models and deep learning models. We created an API for the client for them to be able to use our models and integrate them into their existing platform.

During the project, we only worked on one location but the preprocessing and modeling pipeline can be easily reproduced for other locations if there is data. We delivered analysis and reports that can help the client check the quality of rainfall data. We also provided best practices on how to prepare and split the data before modeling to maximize the model’s performance.

Weather balloons have a lot of potential for weather prediction in Africa. They have a very low cost compared to weather stations and can be deployed by anyone. They can be as accurate as more expensive systems for local weather and especially rain prediction.

—

This article is written by authors: Max Lutz, Adwait Kelkar, Deepali Bidwai, David R. Torres, Guneet S. Kohli, Govind Jeevan, Siddhanth Ramani, Eugene Gitonga Muiru, Mulugheta T. SOLOMON.

You might also like