A Natural Language Processing (NLP) analysis pipeline walkthrough for feature extraction, scraping Twitter, Google, and 1200 PDF files through automated APIs. The overall approach allowed us to gather data that visualizes several billion dollars of not for profit grant data for further NLP analysis across six countries. Finally, the team built an interactive dashboard visualizing the distribution of the grants.
The use case: Where we applied our NLP analysis
Every year government, philanthropy, the private sector, and other grantmakers from across the globe allocate a significant portion of their budget for grants aiming to advance a variety of causes. Despite what seems like an abundance of funds flowing through the social sector, many not-for-profits suffer from a lack of resources. A significant reason is the lack of transparency of grant information.
It is estimated that up to $80–90 billion Australian dollars of grants are disbursed each year.
Our Community is a social enterprise that provides information, tools, and advice to thousands of social sector organizations to support their crucial work of building stronger communities. Our Community’s Innovation Lab, together with Omdena, took on the challenging task of tackling unstructured information and building solutions that would facilitate positive social change for NGOs in need.
Traditional methods to find and monitor grants are time-consuming, expensive, and limited. The aim of this project is to help get money flowing between grantmakers (funders) and the not-for-profit sector, providing the necessary capital to enable positive social change.
BUT wait for it, where is the data?
Like every other problem which tries to leverage Artificial Intelligence for finding a viable solution, a lack of data can hamper any real progress.
Data is growing at an astonishing rate every minute and the majority stake in all this is held by unstructured data. Historically, unstructured data have been ignored because of the complexities involved in dealing with them, but since the majority of human information is embedded in this form, it couldn’t be ignored anymore.
That’s where NLP comes into the picture, a subfield of artificial intelligence through which computers are enabled to understand and interpret human languages.
In this article, we will mainly be focused on the way we generated data from various unstructured sources, till we got untangled data ready for NLP analysis
How we got the unstructured data
Major chunks of data were stored in pdf format which could offer some valuable insights and therefore couldn’t be ignored.
In our case, we ended up with more than 1200 pdf files to be downloaded and scraped from various websites. This task could become cumbersome if done manually, so we decided on automating the entire process by designing a microservice that used RESTful APIs under the hood.
We leveraged the flask framework for developing RESTful APIs which were then deployed over AWS EC2 as a containerized service using docker. A CSV file containing links to all the pdf files is uploaded to the service. The pdf files are automatically downloaded into EC2 and the files are parsed using the grobid service. The parsed data from all the pdfs were then collated into a single file which was then uploaded over AWS S3.
That’s the great thing about microservices; they are standalone programs that can be developed and readily deployed to be made available to different users. We wanted our code to be reusable so developing the microservice to use RESTful APIs seemed like a no-brainer. Flask is a python based micro-framework that comes in handy when needed to quickly develop small web-based applications.
By this stage, it had become apparent that Google was to be our go-to tool so we certainly couldn’t stop now. We decided to automate the scraping process to collect all data returned by Google search based on certain keywords which were achieved through Apify.
After this, we certainly couldn’t ignore Twitter — after all, some of the major action is taking place on that platform. We decided to scrape relevant data from Twitter as well.
Once we had our unstructured data in a single place, the next challenge was feature engineering to preprocess the data and extract features for further NLP analysis. For this task, we brought in NLTK and Spacy which are two powerful NLP libraries for related use cases.
Spacy comes loaded with named entity recognition (NER) and parts of speech tagging (pos tagging). NER helps in locating and classifying named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, and so on. Through spacy’s NER we identified major features including the grantmaker countries and money being awarded. This became our final dataset for NLP analysis and visualization.
Now finally, the NLP analysis and visualization
After the initial cumbersome process of feature engineering, we reached the point of visualizing our structured data under a beautifully created dashboard. We created the dashboard using StreamLit and Plotly. To visualize the data frame that was created after feature extraction, we displayed the dataset with interactive features to help the user understand the data better.
In addition to processing unstructured data using NLP, the team wrote numerous site-specific scrapers to extract data from the web that was already structured in tables.
The overall approach allowed us to gather several billion dollars worth of grant data for NLP analysis across six countries.
To give you a little flavor of the final dashboard, we will leave you with a screenshot.
50 engineers, eight weeks, and one common goal
This project was made possible by more than 50 technology changemakers who built solutions over eight weeks to facilitate positive social change for NGOs in need. A special thanks to Our Community who gave us the opportunity to use our AI skills for good.
A huge shout out to our task managers and all the collaborators.
Other recent case studies