Machine Learning For Rooftop Detection and Solar Panel Installment

Machine Learning For Rooftop Detection and Solar Panel Installment

By Harshita Chopra

 

Solar energy is a promising and freely available resource for managing the forthcoming energy crisis, without hurting the environment. Unlike conventional fossil fuels, it won’t run out anytime soon.

Fact — There’s enough solar energy hitting the Earth every hour to meet all of humanity’s power needs for an entire year.

Let’s face it, what’s cooler than the sun powering your home? And that is quite literally true.

Fun fact — Solar panels also act as “roof shades” to keep buildings cool. They absorb the sun’s rays, directing them away from the roof, whereas a roof without panels would allow heat to penetrate into the building.

As people around the world look for ways to “go green” and protect the earth, solar panels provide an excellent option. But the utility industry needs smart systems that can help improve the integration of renewables in an effective way.

Solar AI, a Singapore based startup incubated as a part of ENGIE Factory, collaborated with Omdena, to pull off a mission to hyper-scale the deployment of distributed solar and the transition towards 100% renewables by modernizing the way rooftop solar is sold.

 

The problem statement

The rooftop solar assessment process can be time consuming and expensive, taking anywhere between 1 hour to 2 full days to calculate the solar potential of each rooftop. In the solar industry, this has resulted in the cost of sales taking up to 30–40% of total project costs, significantly worsening the unit economics of solar projects.

By automating these evaluations with Artificial Intelligence, Solar AI aims to drastically reduce the cost of this process and make this information easily available for both building owners as well as solar energy companies.

So we had a mission to accomplish within eight weeks:

Combining multiple models that can automatically identify rooftops and detect rooftop features using machine learning like obstacles, material, slopes and area from high-resolution satellite imagery.

 

The solution

Solar AI provided us with high-resolution satellite imagery in Singapore. With these huge and detailed images in hand, we had a list of tasks to perform.

The 2 GB size of one image fascinated me enough, to begin with pre-processing and creating thousands of smaller tiles out of it — using just a few lines of code bundled up in a function.

 

Snapshot of a few tiles created from the huge image / Source: omdena.com

 

 

The power of annotations

Even the most technically advanced algorithms cannot address or solve a problem without the right data. We know having access to data is quite valuable, but having access to data with a learnable structure is the biggest competitive advantage nowadays. That’s the power of data annotation.

 

A quirky image with hundreds of rooftops / Source: omdena.com

 

Our wonderful team of collaborators volunteered to annotate thousands of rooftops in 500+ tiles. We pulled off a smarter method of annotating the buildings, by mapping the OSM data on the raster layer (TIF format tile) in the QGIS software.

The consistent determination of the annotators resulted in a perfectly labeled dataset for Supervised Machine Learning algorithms.

The food for models was ready!

 

Scanning images of rooftops via machine learning 

The major task was to detect rooftops in a given image using machine learning & computer vision models.

Not just this, we also had to determine their type/structure such as Flat-roof, Hip-roof, Shed-roof, or any other. Hence, this became an instance segmentation problem.

We tried out a number of models such as Mask R-CNN, YOLACT (You Only Look At CoefficienTs), Dectectron2, and more. After training on different batches of annotations as they were delivered, we kept seeing improvement in results. Eventually, the best performing model was selected to go ahead with other tasks.

 

Source: omdena.com

 

Zooming in on your rooftops 

Now that we had the bounding boxes and mask contours of various rooftops, trapped properly in a data frame, we were ready to start the analysis of individual rooftops. After extracting and zooming into masks of each detected roof, we needed the following attributes:

  • Obstacle Detection
  • Area of the roof (excluding obstacles)
  • The material of the roof
  • Detecting faces of Hip/Shed roof
  • The orientation of individual slopes

 

Calculating “Area Available” for panels

For the calculation of a rooftop’s effective area, the area occupied by obstacles has to be subtracted from the whole. So that gives rise to the task of identifying obstacles.

Due to the lack of labeled data for obstacle detection, our genius team shifted their thought process towards an unsupervised approach of edge detection and creating contours. By setting a threshold on contour colors, obstacles were distinguished from plain area to a great extent.

An effective area was therefore mathematically calculated as the difference between total area and obstacle area in terms of pixels, which was then converted into meters squared.

 

Roof Materials / Source: omdena.com

 

Quality of the roof

Because solar panels are installed on your home’s rooftop, it is important to understand how different roof materials may influence this process.

Generally, they range from concrete, metal, roof tiles, eternit to composite shingles.

This task also required a labeled dataset, so I decided to jump in to find a solution where we could skip annotations per se. Using Open Street Maps, we created a small but fruitful dataset labeled with roofs and their materials. A deep learning-based Image Classification model was then created which identifies the material of the roof and gives the probability scores for each class.

 

Which way do solar panels face?

 

Source: omdena.com

 

Orientation, or the direction your roof faces, may have a large impact on how productive roof-mounted solar panels will be. Your system will generate the most energy when it gets as many hours of light exposure per day as possible. In most places, the ideal power generation angle is 30–40 degrees.

 

Source: omdena.com

 

The task of identifying many faces of a hip roof was a challenging one. After multiple attempts with different approaches, the task team managed to create an appreciable mathematical model that could identify the facets as well as the angles they’re inclined on, using some constructive utility functions. The output was the orientation of different roof facets.

 

Conclusion: Putting it all together

The outputs of all the tasks were captured systematically in a data frame. Keeping in mind that we computed various attributes based on pixel values, we converted them back to geographic coordinates at the end. This allowed us to project the data on satellite images of a particular CRS (Coordinate Reference System).

After merging everything into an automated pipeline and many rounds of reviews, evaluation, fixing bugs, and testing — our software was ready to be delivered.

Solar AI is extremely happy with the final deliverables, and this is something that makes the experience even more worthwhile. As CEO Bolong Chew puts its:

“This work went beyond our wildest expectations and we’re extremely happy. We set the bar really high and the team delivered. It was an amazing experience.”

Augmenting Public Safety Through AI and Machine Learning

Augmenting Public Safety Through AI and Machine Learning

In this demo day, we took a close look at the tremendous potential AI offers for making communities safer, by helping to reduce, prevent, and respond to crimes. When it comes to public safety, it is often critical to act quickly. AI technologies can supplement the work of people, taking on monotonous and time-consuming tasks that would be impossible for humans to do effectively. Natural language processing can read and analyze public communications and news reports to detect potential problem areas and get-ahead of violence. Of course, this work must be done responsibly and ethically.

Sharing her perspective on the impact that AI can have in keeping people safe was an expert in the field, ElsaMarie D’Silva, the Founder & CEO of the Red Dot Foundation. The Red Dot Foundation’s award-winning platform Safecity crowdsources personal experiences of sexual violence and abuse in public spaces. ElsaMarie is listed as one of BBC Hindi’s 100 Women, and her work has been recognized by numerous UN organizations and the SDG Action Festival.

To go a little deeper into the application of AI for public safety, we shared Omdena projects that took innovative approaches to make communities safer.

 

Case Study 1: Preventing sexual harassment through a safe-path finder algorithm

UN Women states that 1 in 3 women face some kind of sexual assault at least once in their lifetime.”

With the first case study, the Omdena team drew upon Safecity’s crowdsourced data about sexual harassment in public spaces and leveraged open-source data to build heatmaps and calculate safe routes through major cities in India. Part of the solution is a sexual harassment category classifier with 93 percent accuracy and several models that predict places with a high risk of sexual harassment incidents to suggest safe routes.

 

AI Sexual Harassment

 

 

You can learn more about this and related projects here:

 

Case Study 2: Understanding gang violence patterns and actors through Twitter analysis

Our team worked in partnership with Voice 4 Impact, an award-winning NGO whose solution to violence in our communities addresses the questions people worldwide are asking: “How do we keep missing the signs?”

The Omdena team made use of natural language processing techniques — AI techniques that analyze text to understand what is being communicated. Machine learning algorithms were used to understand gang language and AI models built to detect violent messages on Twitter, without profiling. The aim is to predict and ultimately prevent, gang violence.

 

AI Gang Violence

 

You can learn more about this and related projects here:

 

Case Study 3: Analyzing Domestic Violence through Natural Language Processing (NLP)

Finally, we presented Omdena’s work to uncover domestic violence in India hidden due to COVID lockdowns. This work is part of a project with the award-winning Red Dot Foundation and Omdena’s collaborative platform to build solutions to better understand domestic violence and online harassment patterns during COVID-19. The project used natural language processing techniques with social media, government reports, and other text content to create a dataset with which Safecity could mobilize local efforts to protect and support domestic violence victims.

 

 

AI Domestic Violence

 

 

You can learn more about this and related projects here:

 

 

 

 

Host an AI project with us.

 

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

Matching Land Conflict Events to Government Policies via Machine Learning | World Resources Institute

By Laura Clark Murray, Nikhel Gupta, Joanne Burke, Rishika Rupam, Zaheeda Tshankie

 

Download the PDF version of this whitepaper here.

Project Overview

This project aimed to provide a proof-of-concept machine-learning-based methodology to identify land conflicts events in geography and match those events to relevant government policies. The overall objective is to offer a platform where policymakers can be made aware of land conflicts as they unfold and identify existing policies that are relevant to the resolution of those conflicts.

Several Natural Language Processing (NLP) models were built to identify and categorize land conflict events in news articles and to match those land conflict events to relevant policies. A web-based tool that houses the models allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The geographic scope of the project was limited to India, which has the most environmental (land) conflicts of all countries on Earth.

 

Background

Degraded land is “land that has lost some degree of its productivity due to human-caused process”, according to the World Resources Institute. Land degradation affects 3.2 billion people and costs the global economy about 10 percent of the gross product each year. While dozens of countries have committed to restore 350 million hectares of degraded land, land disputes are a major barrier to effective implementation. Without streamlined access to land use rights, landowners are not able to implement sustainable land-use practices. In India, where 21 million hectares of land have been committed to the restoration, land conflicts affect more than 3 million people each year.

AI and machine learning offer tremendous potential to not only identify land-use conflicts events but also match suitable policies for their resolution.

 

Data Collection

All data used in this project is in the public domain.

News Article Corpus: Contained 65,000 candidate news articles from Indian and international newspapers from the years 2008, 2017, and 2018. The articles were obtained from the Global Database of Events Language and Tone Project (GDELT), “a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages.” All the text was either originally in English or translated to English by GDELT.

  • Annotated Corpus: Approximately 1,600 news articles from the full News Article Corpus were manually labeled and double-checked as Negative (no conflict news) and Positive (conflict news).
  • Gold Standard Corpus: An additional 200 annotated positive conflict news articles, provided by WRI.
  • Policy Database: Collection of 19 public policy documents related to land conflicts, provided by WRI.

 

Approach

 

Text Preparation

 

In this phase, the articles of the News Article Corpus and policy documents of the Policy Database were prepared for the natural language processing models.

The articles and policy documents were processed using SpaCy, an open-source library for natural language processing, to achieve the following:

  • Tokenization: Segmenting text into words, punctuation marks, and other elements.
  • Part-of-speech (POS) tagging: Assigning word types to tokens, such as “verb” or “noun”
  • Dependency parsing: Assigning syntactic dependency labels to describe the relations between individual tokens, such as “subject” or “object”
  • Lemmatization: Assigning the base forms of words, regardless of tense or plurality
  • Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
  • Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies, or locations.

 

Coreference resolution was applied to the processed text data using Neuralcoref, which is based on an underlying neural net scoring model. With coreference resolution, all common expressions that refer to the same entity were located within the text. All pronominal words in the text, such as her, she, he, his, them, their, and us, were replaced with the nouns to which they referred.

 

For example, consider this sample text:

“Farmers were caught in a flood. They were tending to their field when a dam burst and swept them away.”

Neuralcoref recognizes “Farmers”, “they”, “their” and “them” as referring to the same entity. The processed sentence becomes:

Farmers were caught in a flood. Farmers were tending to their field when a dam burst and swept farmers away.”

 

Coreference resolution of sample sentences

 

 

Document Classification

 

The objective of this phase was to build a model to categorize the articles in the News Article Corpus as either “Negative”, meaning they were not about conflict events, or “Positive”, meaning they were about conflict events.

After preparation of the articles in the News Article Corpus, as described in the previous section, the texts were then prepared for classification.

First, an Annotated Corpus was formed to train the classification model. A 1,600 article subset of the News Article Corpus was manually labeled as “Negative” or “Positive”.

To prepare the articles in both the News Article Corpus and Annotated Corpus for classification, the previously pre-processed text data of the articles was represented as vectors using the Bag of Words approach. With this approach, the text is represented as a collection, or “bag”, of the words it contains along with the frequency with which each word appears. The order of words is ignored.

For example, consider a text article comprised of these two sentences:

Sentence 1: “Zahra is sick with a fever.”

Sentence 2: “Arun is happy he is not sick with a fever.”

This text contains a total of ten words: “Zahra”, “is”, “sick”, “happy”, “with”, “a”, “fever”, “not”, “Arun”, “he”. Each sentence in the text is represented as a vector, where each index in the vector indicates the frequency that one particular word appears in that sentence, as illustrated below.

 

 

 

With this technique, each sentence is represented by a vector, as follows:

“Zahra is sick with a fever.”

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“Arun is happy he is not sick with a fever.”

[0, 2, 1, 1, 1, 1, 1, 1, 1, 1]

With the Annotated Corpus vectorized with this technique, the data was used to train a logistic regression classifier model. The trained model was then used with the vectorized data of the News Article Corpus, to classify each article into Positive and Negative conflict categories.

The accuracy of the classification model was measured by looking at the percentage of the following:

  • True Positive: Articles correctly classified as relating to land conflicts
  • False Positive: Articles incorrectly classified as relating to land conflicts
  • True Negative: Articles correctly classified as not being related to land conflicts
  • False Negative: Articles incorrectly classified as not being related to land conflicts

 

The “precision” of the model indicates how many of those articles classified to be about the land conflict were actually about land conflict. The “recall” of the model indicates how many of the articles that were actually about the land conflict were categorized correctly. An f1-score was calculated from the precision and recall scores.

The trained logistic regression model successfully classified the news articles with precision, recall, and f1-score of 98% or greater. This indicates that produced a low number of false positives and false negatives.

 

Classification report using a test dataset and logistic regression model

 

 

Categorize by Land Conflicts Events

The objective of this phase was to build a model to identify the set of conflict events referred to in the collection of positive conflict articles and then to classify each positive conflict article accordingly.

A word cloud of the articles in the Gold Standard Corpus gives a sense of the content covered in the articles.

A topic model was built to discover the set of conflict topics that occur in the Positive conflict articles. We chose a semi-supervised approach to topic modeling to maximize the accuracy of the classification process. We chose to use CorEx (Correlation Explanation), a semi-supervised topic model that allows domain knowledge, as specified by relevant keywords acting as “anchors”, to guide the topic analysis.

To align with the Land Conflicts Policies provided by WRI, seven relevant core land conflicts topics were specified. For each topic, correlated keywords were specified as “anchors” for the topic.

 

 

 

The trained topic model provided 3 words for each of the seven topics:

  • Topic #1: land, resettlement, degradation
  • Topic #2: crops, farm, agriculture
  • Topic #3: mining, coal, sand
  • Topic #4: forest, trees, deforestation
  • Topic #5: animal, attacked, tiger
  • Topic #6: drought, climate change, rain
  • Topic #7: water, drinking, dams

The resulting topic model is 93% accurate. This scatter plot uses word representations to provide a visualization of the model’s classification of the Gold Standard Corpus and hand-labeled positive conflict articles.

 

Visualization of the topic classification of the Gold Standard Corpus and Positive Conflict Articles

 

 

Identify the Actors, Actions, Scale, Locations, and Dates

The objective of this phase was to build a model to identify the actors, actions, scale, locations, and dates in each positive conflict article.

Typically, names, places, and famous landmarks are identified through Named Entity Recognition (NER). Recognition of such standard entities is built-in with SpaCy’s NER package, by which our model detected the locations and dates in the positive conflict articles. The specialized content of the news articles required further training with “custom entities” — those particular to this context of land conlficts.

All the positive conflict articles in the Annotated Corpus were manually labeled for “custom entities”:

  • Actors: Such as “Government”, “Farmer”, “Police”, “Rains”, “Lion”
  • Actions: Such as “protest”, “attack”, “killed”
  • Numbers: Number of people affected by a conflict

This example shows how this labeling looks for some text in one article:

 

 

These labeled positive conflict articles were used to train our custom entity recognizer model. That model was then used to find and label the custom entities in the news articles in the News Article Corpus.

 

Match Conflicts to Relevant Policies

The objective of this phase was to build a model to match each processed positive conflict article to any relevant policies.

The Policy Database was composed of 19 policy documents relevant to land conflicts in India, including policies such as the “Land Acquisition Act of 2013”, the “Indian Forest Act of 1927”, and the “Protection of Plant Varieties and Farmers’ Rights Act of 2001”.

 

Excerpt of a 2001 policy document related to agriculture

 

 

A text similarity model was built to compare two text documents and determine how close they are in terms of context or meaning. The model made use of the “Cosine similarity” metric to measure the similarity of two documents irrespective of their size.

Cosine similarity calculates similarity by measuring the cosine of an angle between two vectors. Using the vectorized text of the articles and the policy documents that had been generated in the previous phases as described above, the model generated a collection of matches between articles and policies.

 

Visualization of Conflict Event and Policy Matching

The objective of this phase was to build a web-based tool for the visualization of the conflict event and policy matches.

An application was created using the Plotly Python Open Source Graphing Library. The web-based tool houses the models and allows users to explore land conflict events spatially and through time, as well as explore all land conflict events by category across geography and time.

The map displays land conflict events detected in the News Article Corpus for the selected years and regions of India.

Conflict events are displayed as color-coded dots on a map. The colors correspond to specific conflict categories, such as “Agriculture” and “ Environmental”, and actors, such as “Government”, “Rebels”, and “Civilian”.

In this example, the tool displays geo-located land conflict events across five regions of India in 2017 and 2018.

 

 

 

By selecting a particular category from the right column, only those conflicts related to that category are displayed on the map. Here only the Agriculture-related subset of the events shown in the previous example is displayed.

 

 

News articles from the select years and regions are displayed below the map. When a particular article is selected, the location of the event is shown on the map. The text of the article is displayed along with policies matched to the event by the underlying models, as seen in the example below of a 2018 agriculture-related conflict in the Andhra Pradesh region.

 

 

Here is a closer look at the article and matched policies in the example above.

 

 

 

Next Steps

This overview describes the results of a pilot project to use natural language processing techniques to identify land conflict events described in news articles and match them to relevant government policies. The project demonstrated that NLP techniques can be successfully deployed to meet this objective.

Potential improvements include refinement of the models and further development of the visualization tool. Opportunities to scale the project include building the library of news articles with those published from additional years and sources, adding to the database of policies, and expanding the geographic focus beyond India.

Opportunities to improve and scale the pilot project

 

Improvements
  • Refine models
  • Further development of visualization tool

 

Scale
  • Expand library of articles with content from additional years and sources
  • Expand the database of policies
  • Expand the geographic focus beyond India

 

 

About the Authors

  • Laura Clark Murray is the Chief Partnership & Strategy Officer at Omdena. Contact: laura@omdena.com
  • Nikhel Gupta is a physicist, a Postdoctoral Fellow at the University of Melbourne, and a machine learning engineer with Omdena.
  • Joanne Burke is a data scientist with MUFG and a machine learning engineer with Omdena.
  • Rishika Rupam is a Data and AI Researcher with Tilkal and a machine learning engineer with Omdena.
  • Zaheeda Tshankie is a Junior Data Scientist with Telkom and a machine learning engineer with Omdena.

 

Omdena Project Team

Kulsoom Abdullah, Joanne Burke, Antonia Calvi, Dennis Dondergoor, Tomasz Grzegorzek, Nikhel Gupta, Sai Tanya Kumbharageri, Michael Lerner, Irene Nanduttu, Kali Prasad, Jose Manuel Ramirez R., Rishika Rupam, Saurav Suresh, Shivam Swarnkar, Jyothsna sai Tagirisa, Elizabeth Tischenko, Carlos Arturo Pimentel Trujillo, Zaheeda Tshankie, Gabriela Urquieta

 

Partners

This project was done in collaboration with Kathleen Buckingham and John Brandt, our partners with the World Resources Institute (WRI).

 

 

About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through global bottom-up collaboration. Omdena is a partner of the United Nations AI for Good Global Summit 2020.

Building a Risk Classifier for a PTSD Assessment Chatbot

Building a Risk Classifier for a PTSD Assessment Chatbot

 

MLFlow to structure a Machine Learning project and support the backend of the risk classifier chatbot regarding PTSD.

 

 

The Problem: Classification of Text for a PTSD Assessment Chatbot

The input

A text transcript similar to:

 

therapist and client conversation snapshot

 

The output

Low Risk -> 0 , High Risk -> 1

One of the requirements of this project was to have a productionized model for Text Classification regarding PTSD that could communicate with a frontend, for example, using Machine Learning.

As part of the solution to this problem, we decided to explore the MLFlow framework.

 

MLFLow

MLflow is an open-source platform to manage the Machine Learning lifecycle, including experimentation, reproducibility, and deployment regarding PTSD. It currently offers three components:

MLFlow Tracking: Allows you to track experiments and projects.

MLFlow Models: Provides a model and framework to persist, version, and serialize models in multiple platform formats.

MLFlow Projects: Provides a convention-based approach to set up your ML project to benefit the maximum work being put in the platform by the developer’s community.

Main benefits identified from my initial research were the following:

  • Work with any ml library and language
  • Runs the same way anywhere
  • Designed for small and large organizations
  • Provides a best practices approach for your ML project
  • Serving layers(Rest + Batch) are almost for free if you follow the conventions

 

 

The Solution

 

The focus of this article is to show the baseline ML models and how MLFlow was used to aid in Text Classification and training model experiment tracking and productionization of the model.

 

Installing MLFlow

pip install mlflow

 

Model development tracking

The snippet below represents our cleaned and pretty data, after data munging:

 

snapshot of a table containing transcript_id, text, and label as the column headings

 

In the gist below a description of our baseline(dummy) logistic regression pipeline:

 

train, test = train_test_split(final_dataset, 
random_state=42, test_size=0.33, shuffle=True)
X_train = train.text
X_test = test.text

LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, 
norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')),
('clf',LogisticRegression(solver='sag'),),
])
 
The link to this code is given here.
 

One of the first useful things that you can use MLFlow during Text Classification and model development is to log a model training run. You would log for instance an accuracy metric and the model generated will also be associated with this run.

 

with mlflow.start_run():
LogReg_pipeline.fit(X_train, train["label"])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
accuracy = accuracy_score(test["label"], prediction)
   
mlflow.log_metric("model_accuracy", accuracy)
mlflow.sklearn.log_model(LogReg_pipeline, "LogisticRegressionPipeline")

 

The link to the code above is given here.

 

At this point, the model above is saved and reproducible if needed at any point in time.

You can spin up the MLFlow tracker UI so you can look at the different experiments:

 

╰─$ mlflow ui -p 60000                                                                                                                                                                                                                  130 ↵
[2019-09-01 16:02:19 +0200] [5491] [INFO] Starting gunicorn 19.7.1
[2019-09-01 16:02:19 +0200] [5491] [INFO] Listening at: http://127.0.0.1:60000 (5491)
[2019-09-01 16:02:19 +0200] [5491] [INFO] Using worker: sync
[2019-09-01 16:02:19 +0200] [5494] [INFO] Booting worker with pid: 5494

 

The backend of the tracker can be either the local system or a cloud distributed file system ( S3, Google Drive, etc.). It can be used locally by one team member or distributed and reproducible.

The image below shows a couple of models training runs in conjunction with the metrics and model artifacts collected:

 

Experiment Tracker in MLFlow screenshot

Sample of experiment tracker in MLFlow for Text Classification

 

Once your models are stored you can always go back to a previous version of the model and re-run based on the id of the artifact. The logs and metrics can also be committed to Github to be stored in the context of a team, so everyone has access to different experiments and resulted in metrics.

 

MLFlow experiment tracker

 

Now that our initial model is stored and versioned we can assess the artifact and the project at any point in the future. The integration with Sklearn is particularly good because the model is automatically pickled in a Sklearn compatible format and a Conda file is generated. You could have logged a reference to a URI and checksum of the data used to generate the model or the data in itself if within reasonable limits ( preferably if the information is stored in the cloud).

 

Setting up a training job

Whenever you are done with your model development you will need to organize your project in a productionizable way.

The most basic component is the MLProject file. There are multiple options to package your project: Docker, Conda, or bespoke. We will use Conda for its simplicity in this context.

 

name: OmdenaPTSD

conda_env: conda.yaml

entry_points:
main:
 command: "python train.py"

 

The entry point runs the command that should be used when running the project, in this case, a training file.

The conda file contains a name and the dependencies to be used in the project:

 

name: omdenaptsd-backend
channels:
- defaults
  - anaconda
dependencies:
- python==3.6
  - scikit-learn=0.19.1
  - pip:
- mlflow>=1.1

 

At this point you just need to run the command.

 

Setting up the REST API classifier backend

To set up a rest classifier backend you don’t need any job setup. You can use a persisted model from a Jupyter notebook.

To run a model you just need to run the models serve command with the URI of the saved artifact:

 

mlflow models serve -m runs://0/104dea9ea3d14dd08c9f886f31dd07db/LogisticRegressionPipeline
2019/09/01 18:16:49 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2019/09/01 18:16:52 INFO mlflow.pyfunc.backend: === Running command 'source activate 
mlflow-483ff163345a1c89dcd10599b1396df919493fb2 1>&2 && gunicorn --timeout 60 -b 
127.0.0.1:5000 -w 1 mlflow.pyfunc.scoring_server.wsgi:app'
[2019-09-01 18:16:52 +0200] [7460] [INFO] Starting gunicorn 19.9.0
[2019-09-01 18:16:52 +0200] [7460] [INFO] Listening at: http://127.0.0.1:5000 (7460)
[2019-09-01 18:16:52 +0200] [7460] [INFO] Using worker: sync
[2019-09-01 18:16:52 +0200] [7466] [INFO] Booting worker with pid: 7466

 

And a scalable backend server (running gunicorn in a very scalable manner) is ready without any code apart from your model training and logging the artifact in the MLFlow packaging strategy. It basically frees Machine Learning engineering teams that want to iterate fast of the initial cumbersome infrastructure work of setting up a repetitive and non-interesting boilerplate prediction API.

You can immediately start launching predictions to your server by:

 

curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d 
'{"columns":["text"],"data":[[" concatenated text of the transcript"]]}'
[0]%

 

The smart thing here is that the MLFlow scoring module uses the Sklearn model input ( pandas schema) as a spec for the Rest API. Sklearn was the example used here it has bindings for (H20, Spark, Keras, Tensorflow, ONNX, Pytorch, etc.). It basically infers the input from the model packaging format and offloads the data to the scoring function. It’s a very neat software engineering approach to a problem faced every day by machine learning teams. Freeing engineers and scientists to innovate instead of working on repetitive boilerplate code.

Going back to the Omdena challenge this backend is available to the frontend team to connect at the most convenient point of the chatbot app to the risk classifier backend ( most likely after a critical mass of open-ended questions).

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Estimating Possible Undetected COVID-19 Infection Cases using Probability Analysis

Estimating Possible Undetected COVID-19 Infection Cases using Probability Analysis

 
 
Country-wide estimations for undetected Covid-19 cases and recommendations for enhancing testing facilities based on Probability Analysis

The Problem: Why estimating undetected Covid-19 cases is crucial?

An estimation of the undetected Covid-19 cases is important for authorities to plan economical policies, make decisions around different stages of lockdown, and to work towards the production of intensive care units.

As we have crossed a psychological mark of 1 million Covid-19 patients around the globe, more questions are popping up regarding the capabilities of our health care systems to contain the virus. One of the major worries is the systematic uncertainty in the number of citizens who have hosted the virus. The major contribution to this uncertainty, i.e. Probability Analysis, is possibly due to the small fraction of Covid-19 tests being performed.

The main test to confirm if someone has Covid-19, is to look for signs of the virus’s genetic material in the swab of their nose or throat. This is not yet available for most people. The healthcare workers are morally restricted to reserve the testing apparatus for seriously ill patients in the hospital.

 

The Solution

 

In this article, we will show a simple Bayesian approach, a part of Probability Analysis to estimate the undetected Covid-19 cases. The Bayes theorem can be written as:

P(A|B) = P(B|A) × P(A) / P(B)

where P(A) is the probability of event A, P(B) is the probability of event B, P(A|B) is the probability of observing event A if B is true, and P(B|A) is the probability of observing event B if A is true.

The quantity of interest for us is P(infected|notTested) i.e. the probability of infections that are not tested. This is equivalent to the percentage of the population infected by Covid-19 but not tested and we can write it as:

P(infected|notTested) = P(undetected|infected)×P(infected)/P(not tested)

Here the other probabilities are:

  • P(notTested|infected): Probability of tests not done on people that are infected or percentage of the population not tested but infected.
  • P(infected): Prior probability of infection or known percentage of the infected population.
  • P(not tested): Probability or percentage of people not tested.

The following plot shows the total Covid-19 tests per million people and the total number of confirmed cases per million people for several countries. This suggests a clear relation between the Covid-19 tests and confirmed positive detections.

 

Test per million vs Positive per million graph

Figure 1: Tests per million versus positive Covid-19 cases per million as of 20 March 2020 (data source).

 

Assuming that all countries follow this relation between the Covid-19 tests and confirmed cases, we can make a rough estimate of the number of undetected cases in each country by using Probability Analysis in every country.

 

Let’s take Australia as an example:

For example, the plot shows that prior knowledge of infected cases

P(infected) = 27.8/10⁶, and

P(not tested) = (10⁶ — 473)/10⁶.

To estimate the P(notTested|infected), I used the relation between the Covid-19 tests and confirmed cases as in the above Figure 1. This is done by fitting a power law of the form: y = a * x**b, where a is normalization, and b is the slope of this power law. The following plot shows a fit to the data points from the above plot, where the best fit a = 0.060±0.008 and b = 0.966±0.014.

 

Test per million vs positive per million graph 2

Figure 2: The relation between Covid-19 tests and confirmed cases and a power-law best fit.

 

Using the best fit parameters, P(notTested|infected) = (10⁶— 4473)/10⁶ / (a * (10⁶ — 4473)**b)/10⁶.

With probabilities 1, 2 and 3, I find P(infected|notTested) = 0.00073 per cent population of Australia. Multiplying this by the population of Australia indicates that there is a possibility of about 18,600 undetected Covid-19 cases in Australia (Probability Analysis report). The following plot shows possible undetected Covid-19 cases as a function of tests per million for different countries as of 20 March 2020.

 

Tests per million vs Undetected Covid-19 cases graph

Figure 3: Estimation of undetected Covid-19 cases (see assumptions in the text).

 

Note that several assumptions and considerations are made to estimate these undetected cases. For instance:

  • I assumed that all countries would follow the same power-law relation to estimating P(notTested|infected). However, this is not an extremely good assumption as there is huge scatter in this relation between different countries.
  • Our prior knowledge of the number of infections can be biased itself as P(infected) depends on the number of tests performed as of 20 March 2020.
  • I haven’t considered the susceptibility of a country’s populations to Covid-19, and the attack rate i.e. the biostatistical measure of the frequency of morbidity, which for Covid-19 is estimated around 50–80% (Verity et al. 2020).
  • The impact of government policies of these countries from 14 days before 20 March and 14 days after is not considered.
  • I haven’t considered how susceptible people are targeted for testing in different countries in the next days.

Figure 4 below shows the total number of confirmed cases versus the tests per million as of 5 April 2020 for several countries (data source).

After 16 days on 5 April, the confirmed positive cases in countries like Ukraine, India and Philipines are consistent with the predictions in Figure 3. These countries performed ≤ 10 tests per million people as of 20 March.

Note that the consistency between estimations as of 20 March and 5 April does not necessarily mean that all undetected cases as of 20 March are confirmed now. Several of the confirmed cases as of 5 April are expected to be new cases due to the spread between 20 March and 5 April (even in the presence of lockdowns).

The estimated undetected cases for countries like Colombia and South Africa are about twice as large (Figure 3) as compared to the total confirmed cases as of 5 April (i.e. about 1,500 for both). Both countries have performed about 100 tests per million people.

Countries like Taiwan, Australia, and Iceland, on the other hand, have shown an order of magnitude small number of confirmed cases as compared to estimated numbers in Figure 3.

This indicates that the countries that have not boosted their testing efficiency to more than 1,000 tests per million people have significantly larger uncertainties on the number of current confirmed cases.

 

Tests per million vs Total positive cases graph

Figure 4: The total number of confirmed cases versus the tests per million as of 5 April 2020.

 
 

Given the data in Figure 4 from 5 April 2020, I repeated the whole exercise again to estimate the undetected Covid-19 cases for these countries, cities, and states. The following figure shows the best fit power-law and data points similar to Figure 2 but for the data as of 5 April 2020.

 

Tests per million vs Positive per million graph

Figure 5: Best fit power law for data as of 5 April 2020.

 

The best-fit slope for the power-law relation in Figure 5 (b = 1.281±0.009) is consistent with the slope in Figure 2 at the 2-σ confidence level. This helps our assumption of estimating P(notTested|infected) from the best fit power-law relation (the slope is not changing), however, other caveats are the same as before.

Finally, the following plot shows the estimated undetected Covid-19 cases for different countries as of 5 April 2020.

 

Tests per million vs undetected covid-19 cases graph

Figure 6: Estimated Undetected Covid-19 cases as of 5 April 2020 (see assumptions in the text).

 

As the comparison between the undetected estimations as of 20 March (Figure 3) and confirmed cases as of 5 April (Figure 4) shows that more tests per million people are required to capture the possible undetected cases, thus now is the high time that authorities raise the testing efficiency in order to reduce the systematics from undetected Covid-19 cases. This seems to be the only good way to reduce the death rate of Covid-19 patients as indicated by a large amount of Covid-19 testing in Germany and South Korea.

To make this happen, all countries need at least one testing center within a radius of 20 Km and arrange more drive through testing facilities as soon as possible.

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here