Author: Rehab Emam
African governments are using significant portions of public budgets to finance infrastructure, but that infrastructure often responds to past or current needs, not future needs based on expected changes related to climate change, migration, urbanization, etc. In this Omdena Challenge, “Building an ML Model to Predict Future Infrastructure Needs of Africa for Policy Makers”. 50 collaborators collected open-source data to predict the infrastructure needs of several African countries. The team started many tasks to get all available data that can help in estimating and predicting different kinds of infrastructural needs in African countries. And visualized them all in an easy-to-use Streamlit dashboard.
It’s important to look into a data science problem from different angles and to use all tools to deliver the best solution. Like natural language processing, remote sensing, route planning, data analysis, and machine learning modeling. In this article, we show the product the team has delivered to visualize all their data and predictions. An interactive dashboard using Streamlit makes it easier for the user to go around important and available data and predictions to make better policies and decisions.
The team started NLP steps including the analysis of written materials from tweets to mass media to include activities like sentiment analysis, topic modeling, and other statistical analyses. With regard to the Twitter data, the team collected data regarding 14 different topics using 90 variations of various country names. They have also obtained academic papers using Semantic Scholar and Google Scholar. Focusing the task on sentiment analysis on different aspects of infrastructure over time and by country. And analyzing mere positive/ negative sentiment, trying to draw meaning from the context, as well.
You can read the full steps and analysis here.
They have outcomes for Kenya and Nigeria that cover finance, education, and transportation.
To accomplish that, a series of keywords have been defined for each topic, and tweets containing a defined keyword and the word ‘Africa’ or the name of an African country would be scraped. For example, with the topic ‘Finance’, the keywords, ‘finance’, ‘investment’, ‘economy’, ‘economic’, ‘income’, ‘banking’, and etc, have been used.
To understand the general (public) opinions towards a topic, sentiment analysis was conducted to classify the tweets into categories of different sentiments (positive, negative, and neutral).
In a more accessible and insightful way, a slide bar of years, that shows the top 40 most frequent words/phrases appearing in the tweets of a topic (split by sentiment) in a specific year.
With the network graphs below, we can see what has been mentioned most frequently with different keywords of a selected topic. The central node of the network always represents a keyword (i.e. that was used in tweet scraping on a topic) and the thickness of the edges (lines) represent the frequency of the ‘connected’ words/phrases (peripheral nodes) appearing together with that central node (the thicker, the more frequently) in the chosen year. Here, only the top 20 most frequent words/phrases are shown with each keyword.
This is available on the Streamlit dashboard for a variety of topics like banking, economics, economy, facility, finance, income, investment, and market. And for each category, there is a set of targeted keywords.
African countries population
Another analysis and predictions were accomplished on the population of African countries, divided into East Africa, North Africa, and West Africa with maps showing the population density on administrative district level using open-source WorldPop data.
Exploratory analysis for 5 African countries Burundi, Ghana, Nigeria, Benin, and Cameroon. Showing the electricity access means in different districts. The dashboard visualizes the historical and current data besides maps visualization of the models’ predictions.
Water stress index
Water Infrastructure Modeling: the team has been developing a model to predict GRACE-based hydrological drought index in African countries with a high drought index using climate models, premised on temperature and precipitation over certain time frames.
Spatial constraints and considerations
One more method using QGIS is analyzing the distance score for important locations and points on the map like hospitals, markets, roads, powerline, negative population, and positive population.
In this method, population density (the number of persons living in a commune) is combined with:
Communities (the location): The idea behind the method is to give an indication of development needs at a practical administrative level; this means that it is not possible to look at the results at very coarse resolution (like national level) because then the distances would be zero (they all fall within the country), neither should they be too fine (at the population level: 100m) because the data doesn’t allow for such analysis, for example, the road network doesn’t (at the moment) support routing). Communities are both spatial differences and detailed enough to be used in distance calculations.
Distance to Hospitals (Health): The distance to hospitals, or health-care facilities, is only one of the aspects of the SDG, and in particular, SDG 3: “Ensure healthy lives and promote wellbeing for all at all ages”, having access to health care is an important factor to reach this goal. The method used is Euclidean distance from Community centers to hospitals.
Distance to Commercial Centres (Economy): To develop a sustainable economy, the distances from produce to market should be short and easy to access. SDG 8, in particular, is looking at decent work and economic growth, and it could be expected that communities closer to the commercial centers (provincial capitals) have a better chance of fulfilling this goal without additional support. Communities further away could benefit from incentives or development programs to encourage economic activities. Again using Euclidean distance from Community centers to Commercial centers.
Distance to Power (Electricity): The goals in SDG 7 is to ensure access to affordable, reliable, sustainable, and modern energy, and distance to powerlines is only one of many possible infrastructural themes, and it is possible to use this technique for other SDG, like water sources (in SDG 6). Solving this using Coast function (raster) from powerlines, and maximum distance inside a community and QGIS operation: Zonal Statistics.
Distance to Roads (Transportation): Sustainable transport is part of many Sustainable Development Goals (SDGs), and in this example, the distance to a road is used as an indicator of how well a location (inside a community) is connected to the larger road network of that community, as an indicator for development. Method used: Coast function (raster) from roads, and average distance inside a community and QGIS operation: Zonal Statistics.
Population (the driver): The four SDG themes are set against the population distribution in Burundi so that either the communities with high populations are selected (maximum impact) or with the lowest populations (increased need for development). The method used here: Sum the population within a community boundary and QGIS operation: Zonal Statistics.
Distance calculations used
In this proof of concept, only 4 of the 17 SDG are processed, but the method is not limited to these four, and it is possible to include additional SDG in future analysis. Two different distance calculations are used:
- Euclidean distance: For the themes Hospitals and Commercial Centers, a simple euclidean (as the bird flies) distance is calculated from each community to the nearest location. Using the Euclidean distance is a rough estimation, and with a properly configured road network, it would be possible to calculate the distance much more accurately.
- Cost distance: For the distance to Roads and Powerlines, another approach is used to calculate the distance, here the distance calculated from all locations to the nearest road or powerline, and the average distance of a community is used as an indicator of how connected the community is to the network. In both cases, the intent is to show the level of development, or the amount of infrastructure, in an area, and not to estimate which part of the population has (direct) network access.
We have seen that the same problem could be solved using several data science fields. Going through all of them in one team provides us with comprehensive solutions for better strategies and decisions. Summing that all in one Streamlit dashboard makes things easier for the user to look at the problem from different perspectives.
To read more about delivering a data science product that shows your solutions and makes it more accessible and friendly to use, we recommend “Deploy a Model Using Docker as Endpoint and Pathology Mobile App” where they built an offline Android application to detect pathologies in Ultrasound images.