GeoSpatial Data Analytics Applied: A Friendly Guide to Folium and Rasterio
April 19, 2021
A friendly introduction to remote sensing and how-to guide on handling and visualizing geospatial data using Rasterio and Folium. Applying all code on Uganda, Africa, by using WorldPop data.
Prerequisites
❌ Limited knowledge in geographic information system;
✔️ Some Python programming and visualization skills
We didn’t have much prior experience working with remote sensing and geospatial data visualization. Still, we managed to make it happen, thanks to this collaborative project with Omdena on evaluating Africa’s infrastructure development needs!
The Project
Efficient and effective infrastructure is critical to economic transformation. Over time, Africa’s infrastructure investment needs have increased, with an estimated financing gap amounting up to $108 billion every year (Ref 1).
In response to the call by an Omdena challenge partnering with African Center for Economic Transformation (ACET), we joined a global community of data scientists, engineers, and enthusiasts. We collaborated on a project over a period of two months to evaluate and predict the infrastructure needs of various African countries using data science and the latest technology. The project aimed to support ACET’s ongoing work on infrastructure development.
As part of the project, we were tasked with retrieving population information of African nations. In particular, we needed some spatially disaggregated estimates at subnational, local levels. We were also interested in the comparison of annual population growth.
The Challenges in Population Estimation
Population analysis is regarded as extremely important in regional and urban planning. The main sources for such demographic data are the national census, typically conducted once every 10 years, along with national registers of births and deaths. However, survey-driven censuses are expensive to implement, are infrequently performed, and only provide population counts over broad areas. Moreover, in resource-poor settings, national registers are generally lacking or incomplete. These problems could be worse for some African countries. For example, the Democratic Republic of the Congo hasn’t had a census since 1984 (Ref 2).
Fortunately, recent advances in computing power, availability of regularly updated high-resolution satellite imagery, global positioning systems (GPS)-enabled field survey techniques, and statistical learning methods have given rise to opportunities for alternative approaches to producing reliable, spatially refined estimates of human populations.
Data collection and preparation
WorldPop’s open-source gridded population data [footnote 1] is a good example of the application of these methods and techniques. It combines data from neighborhood-scale microcensus surveys (population counts undertaken in small areas) with information from national-scale satellite imagery and digital mapping. In short, WorldPop leveraged machine learning modeling (random forest) to extrapolate high-resolution national population estimates from the relatively sparse microcensus data (including predicting populations in unsurveyed locations). In the Omdena project, we used Worldpop imagery data to help with our tasks in population estimation. Also, for the blog here, we are demonstrating (with modified code) how we visualized population distribution using this dataset.
¬¬Some features of WorldPop data (and why we used it in our project?)
- Datasets cover almost all countries on the African continent
- The gridded population data (i.e., raster images) are available at a spatial resolution as detailed as 3 arc seconds (approximately 100m at the equator)
- These high-resolution maps are available yearly from 2000–2020
- They are provided in the format (.tif) that can be easily analyzed and handled with commonly used programming and analytics software
Aggregating WorldPop Counts for Uganda
Let’s code along!
There are three main parts in the workflow to compute subregional estimates and visualize population distribution using the raster data. I use Jupyter notebook to perform this coding part.
1. Downloading and loading the gridded population images
There are a few ways to download the Worldpop data (.tif images) using some plugins outlined on their website. Here, we use wget
and its .download()
function to download the data directly from their FTP server.
To load and read the data, we use the rasterio
module.
You can download, read and plot this image above using the following code:
# Import required packages # -------------------- import wget import rasterio from rasterio import mask import matplotlib.pyplot as plt %matplotlib inline import numpy as np import geopandas as gpd import pandas as pd import folium # Download raster/tif file (download gridded population data from WorldPop) # -------------------- FILE_DIR = "data/worldpop" # d/l yr 2020 data url2020 = "ftp://ftp.worldpop.org.uk/GIS/Population/Global_2000_2020/2020/UGA/uga_ppp_2020.tif" wget.download(url2020, FILE_DIR) # also d/l yr 2019 data url2019 = "ftp://ftp.worldpop.org.uk/GIS/Population/Global_2000_2020/2019/UGA/uga_ppp_2019.tif" wget.download(url2019, FILE_DIR) # Read raster/tif file # -------------------- uga_tif = 'data/worldpop/uga_ppp_2020.tif' raster_uga = rasterio.open(uga_tif) pop_uga_data = raster_uga.read(1) # A crude way to count the population size represented in the image pop_uga_count = pop_uga_data[pop_uga_data > 0].sum() ## Plot raster/tif image # -------------------- def plot_raster(rast_data, title='', figsize=(10,10)): """ Plots population count in log scale(+1) """ plt.figure(figsize = figsize) im1 = plt.imshow(np.log1p(rast_data),) # vmin=0, vmax=2.1) plt.title("{}".format(title), fontdict = {'fontsize': 20}) plt.axis('off') plt.colorbar(im1, fraction=0.03) title = 'Population Distribution (2020) in Uganda (Log Scaled) n Est count: {}'.format(uga_pop_count) plot_raster(pop_uga_data, title)
To work out the estimated total number of people of the country, we can sum up the estimates represented by each value per grid cell (with this line specifically).
# A crude way to aggregate the population size represented in the image pop_uga_count = pop_uga_data[pop_uga_data > 0].sum()
2. Aggregating regional estimates at district levels
To extract the estimated count for a district, we need to know the information about the boundary of the district. A shapefile — a geospatial data format — stores this information (i.e. the digital boundaries of areas/features at a defined level). Here, we use the shapefile map for Uganda from the GADM website.
Let’s load and plot the shapefile (at administrative level 2).
# the shape files for Uganda were downloaded here: # https://gadm.org/ uga_gdf = gpd.GeoDataFrame.from_file("data/shpfiles/UGA/gadm36_UGA_2.shp") # plot the shapefile plt.rcParams['figure.figsize'] = 5,5 uga_gdf.plot(color="white", edgecolor="#2e3131") plt.title('Uganda: level 2 regions')
uga_gdf.head(3) #check the first few rows of the shapefile dataframe
This is what the geo-data frame (shapefile data) looks like, though we are only interested in getting the district names (‘NAME_1’) and ‘geometry’ from this dataset. (‘geometry’ contains the geospatial points of the polygon that defines the boundary of the district)
As a side note, spatial data may be formatted in different Coordinate Reference Systems (CRSs). A CRS defines how a two-dimensional, projected map relates to the real places on the earth. We can check the CRS as well as other metadata stored in the shapefile.
uga_gdf.crs #to see the CRS format of the file
We can apply the shapefile with the raster image to extract the estimated number of people of a district, for example, Apac, in Uganda (using rasterio
’s mask.mask
function).
# using mask.mask function from Rasterio to define the region of interest gtraster, bound = mask.mask(raster_uga, uga_gdf[uga_gdf.GID_1 == “Apac”].geometry, crop=True) gtraster[0][gtraster[0]>0].sum()
Now, putting everything together, we produce the local-scale population estimates of all the districts defined in the shapefile, for 2019 and 2020. On top of that, we compute the percentage change in population sizes over the year (follow the code below).
# the shapefiles for Uganda were downloaded here: # https://gadm.org/ # Load in the shapefile # -------------------- uga_gdf = gpd.GeoDataFrame.from_file("data/shpfiles/UGA/gadm36_UGA_2.shp") # Estimate the population size per defined district for each year (from the .tif image available for each year) # -------------------- for year in range(2019, 2021): # Read raster/tif file raster_uga = 'data/worldpop/UGA/uga_ppp_{}.tif'.format(year) pop_raster_uga = rasterio.open(raster_uga) pop_uga_data = pop_raster_uga.read(1) # loop through each defined district contained in the shapefile and use it as the mask to extract values _results = [] for i in uga_gdf['GID_1']: roi = uga_gdf[uga_gdf.GID_1 == i] # using the mask.mask module from Rasterio to specify the ROI gtraster, bound = mask.mask(pop_raster_uga, roi["geometry"], crop=True) # values greater than 0 represent the estimated population count for that pixel _results.append(gtraster[0][gtraster[0]>0].sum()) # save the estimated counts for each year in a new column uga_gdf[str(year)] = _results # also, compute the percentage change in estimated counts across years uga_gdf['growth_rate'] = uga_gdf[['2019', '2020']].pct_change(axis=1)['2020']*100
This is the final data frame.
Having extracted the estimated population sizes and then computed the percentage change (growth rate) by district levels, we can visualize them on a map for easier comparison using choropleth graphs.
3. Creating interactive choropleth maps with mask.mask
Folium
Before making the choropleth layer, we need a folium base map, and it is important to let the map know where it should be centered (i.e., specifying the centroid coordinates to the map). We can assign the centroid by taking the mean of the coordinates of all the polygons.
Once we have the centroid, we create the map object and choose a background tile. By default, ‘Open Street Map’ is chosen, and this is also the one we use.
At this point, the stage has been set. We are ready to create and overlay the choropleth plot to the map by passing the following parameters into the choropleth
function.
- .geo_data: geopandas data frame holding information of geographic geometry
- .data: data frame containing values to be used for plotting the choropleth maps
- .columns: the names of two columns — first column to be used as the key (‘NAME_1′ containing the names of the district in our case); second column are the values to be mapped (‘growth_rate’)
- .key_on: Variable in
geo_data
to bind the data to. It must start with ‘feature’ - .fill_color: Can pass a hex code, color name, or one of these color palettes: ‘BuGn’, ‘BuPu’, ‘GnBu’, ‘OrRd’, ‘PuBu’, ‘PuBuGn’, ‘PuRd’, ‘RdPu’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, and ‘YlOrRd’
- .fill_opacity: Opacity level, range 0–1
- .line_opacity: Geopath line color (boundary of the polygon)
- .legend_name: Title for data legend
# Estimate centroids of the country # -------------------- cent_x =uga_gdf['geometry'].centroid.x.mean() cent_y =uga_gdf['geometry'].centroid.y.mean() # Create a map object using Folium # -------------------- map_uga_popdist = folium.Map(location=[cent_y, cent_x], zoom_start=7, tiles='OpenStreetMap') # Create the choropleth map # -------------------- choro = folium.Choropleth(geo_data=uga_gdf, name='choropleth', data=uga_gdf, columns=['NAME_1', 'growth_rate'], key_on='feature.NAME_1', fill_color='YlOrRd', fill_opacity=0.6, line_opacity=0.8, legend_name= "Population size across Uganda's subregion" ).add_to(map_uga_popdist) # add labels to map choro.geojson.add_child(folium.features.GeoJsonTooltip(fields=['NAME_1', '2020', 'growth_rate'], aliases=['District', 'Est Population in 2020', 'Est growth_rate'], labels = False)) folium.LayerControl().add_to(uga_popdist_map) uga_popdist_map
Wait, something is still missing on the map. How do we know which district is which? To improve visualization and interaction, we can add some informative labels to the map. (In our example, we include ‘District’, ‘Estimated Population in 2020’, and ‘Estimated growth rate’ in the labels.) This is achieved by using the Tooltip
function, and it is possible to add one via the .add_child
method (check the third last line in the code block above).
Hooray!! We have made it almost to the end and created the choropleth map showing the population growth rates of Uganda districts in 2020!
You can learn more about folium and also check out the cool examples on their page https://python-visualization.github.io/folium/modules.html
Remarks — Some Considerations
By this point, we get the idea that WorldPop’s estimated counts can be reaggregated to any polygon shape. One common criticism of choropleth maps is that the shapes (polygon) aren’t usually uniform in size, so bigger areas can dominate a map, giving the impression that some particular colors, and the associated data points, are more prevalent. To tackle this issue, we may turn to an alternative way to create and assign an equally sized shape (e.g. hexagon) for each region. An approach like hexagonal binning may attenuate the artificial visible importance of different districts’ differencing sizes while maintaining their geospatial organization.
Next Challenge Up in Solving Real-Life Problems
Recently we have come across Driven Data Pump it Up Challenge: Tanzania, a crowdsourcing contest to predict the functionality of water pumps in Tanzania. We think that population information such as the size of the community and population density might be some important features to consider as they directly reflect the number of people affected by the infrastructure development.
With the coordinates of the water pumps and the WorldPop data, we can feature engineer populations around pumps by re-aggregating the WorldPop estimations into custom shape circles (approx. 1km, 5km, and 10km radii) around each water well. GeoPandas has a helper function called buffer
that expands a point to a circle-like polygon shape. We can create polygon shapes with GeoPandas’ helper function (illustrated in the image below) and aggregate the populations within each buffer shape.
The code for this plot
import pandas as pd import geopandas as gpd """ assume train_df, a pandas dataframe, exists with longitude and latitude columns """ # create geopandas dataframe gpd_df = gpd.GeoDataFrame(train_df, geometry = gpd.points_from_xy(train_df['longitude'], train_df['latitude'])) # approximately 10km radius gpd_df['geometry_1km'] = gp_df['geometry'].apply(lambda x: x.buffer(0.1))
This is how we conceptualize population information that can be used in this challenge. Let us know your ideas?!
Summary
From Zero to Hero
- We demonstrated in this blog some basic functions to get started with remote sensing and exploring geospatial data.
- As an example, we showed how population estimation at local/regional levels can be achieved through aggregating WorldPop raster data by administrative boundaries of a country.
- We also illustrated how to plot a basic choropleth map using Folium.
We were very thankful for everyone in our Omdena project who has supported and given us the insights, pointers, and guidelines when we needed them.
Footnotes
Gridded population datasets are estimates of population in small grid cells derived with a geostatistical model using census or small area population counts and a number of other spatial datasets.
This challenge is quite interesting, we had a lot of work and methodologies to apply, one of the other tasks we have applied through the challenge is explained thoroughly here.
References
- African Development Bank (AfDB). 2018. African economic outlook 2018: Macroeconomic Developments and Structural Change: Infrastructure and its Financing. Abidjan: African Development Bank.
- https://www.southampton.ac.uk/news/2020/10/worldpop-census-calculation-nigeria.page
This article is written by Johnny Lau, and Albert Um.
You might also like