Building a Risk Classifier for a PTSD Assessment Chatbot

Building a Risk Classifier for a PTSD Assessment Chatbot


MLFlow to structure a Machine Learning project and support the backend of the risk classifier chatbot regarding PTSD.



The Problem: Classification of Text for a PTSD Assessment Chatbot

The input

A text transcript similar to:


therapist and client conversation snapshot


The output

Low Risk -> 0 , High Risk -> 1

One of the requirements of this project was to have a productionized model for Text Classification regarding PTSD that could communicate with a frontend, for example, using Machine Learning.

As part of the solution to this problem, we decided to explore the MLFlow framework.



MLflow is an open-source platform to manage the Machine Learning lifecycle, including experimentation, reproducibility, and deployment regarding PTSD. It currently offers three components:

MLFlow Tracking: Allows you to track experiments and projects.

MLFlow Models: Provides a model and framework to persist, version, and serialize models in multiple platform formats.

MLFlow Projects: Provides a convention-based approach to set up your ML project to benefit the maximum work being put in the platform by the developer’s community.

Main benefits identified from my initial research were the following:

  • Work with any ml library and language
  • Runs the same way anywhere
  • Designed for small and large organizations
  • Provides a best practices approach for your ML project
  • Serving layers(Rest + Batch) are almost for free if you follow the conventions



The Solution


The focus of this article is to show the baseline ML models and how MLFlow was used to aid in Text Classification and training model experiment tracking and productionization of the model.


Installing MLFlow

pip install mlflow


Model development tracking

The snippet below represents our cleaned and pretty data, after data munging:


snapshot of a table containing transcript_id, text, and label as the column headings


In the gist below a description of our baseline(dummy) logistic regression pipeline:


train, test = train_test_split(final_dataset, 
random_state=42, test_size=0.33, shuffle=True)
X_train = train.text
X_test = test.text

LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, 
norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')),
The link to this code is given here.

One of the first useful things that you can use MLFlow during Text Classification and model development is to log a model training run. You would log for instance an accuracy metric and the model generated will also be associated with this run.


with mlflow.start_run():, train["label"])
# compute the testing accuracy
prediction = LogReg_pipeline.predict(X_test)
accuracy = accuracy_score(test["label"], prediction)
mlflow.log_metric("model_accuracy", accuracy)
mlflow.sklearn.log_model(LogReg_pipeline, "LogisticRegressionPipeline")


The link to the code above is given here.


At this point, the model above is saved and reproducible if needed at any point in time.

You can spin up the MLFlow tracker UI so you can look at the different experiments:


╰─$ mlflow ui -p 60000                                                                                                                                                                                                                  130 ↵
[2019-09-01 16:02:19 +0200] [5491] [INFO] Starting gunicorn 19.7.1
[2019-09-01 16:02:19 +0200] [5491] [INFO] Listening at: (5491)
[2019-09-01 16:02:19 +0200] [5491] [INFO] Using worker: sync
[2019-09-01 16:02:19 +0200] [5494] [INFO] Booting worker with pid: 5494


The backend of the tracker can be either the local system or a cloud distributed file system ( S3, Google Drive, etc.). It can be used locally by one team member or distributed and reproducible.

The image below shows a couple of models training runs in conjunction with the metrics and model artifacts collected:


Experiment Tracker in MLFlow screenshot

Sample of experiment tracker in MLFlow for Text Classification


Once your models are stored you can always go back to a previous version of the model and re-run based on the id of the artifact. The logs and metrics can also be committed to Github to be stored in the context of a team, so everyone has access to different experiments and resulted in metrics.


MLFlow experiment tracker


Now that our initial model is stored and versioned we can assess the artifact and the project at any point in the future. The integration with Sklearn is particularly good because the model is automatically pickled in a Sklearn compatible format and a Conda file is generated. You could have logged a reference to a URI and checksum of the data used to generate the model or the data in itself if within reasonable limits ( preferably if the information is stored in the cloud).


Setting up a training job

Whenever you are done with your model development you will need to organize your project in a productionizable way.

The most basic component is the MLProject file. There are multiple options to package your project: Docker, Conda, or bespoke. We will use Conda for its simplicity in this context.


name: OmdenaPTSD

conda_env: conda.yaml

 command: "python"


The entry point runs the command that should be used when running the project, in this case, a training file.

The conda file contains a name and the dependencies to be used in the project:


name: omdenaptsd-backend
- defaults
  - anaconda
- python==3.6
  - scikit-learn=0.19.1
  - pip:
- mlflow>=1.1


At this point you just need to run the command.


Setting up the REST API classifier backend

To set up a rest classifier backend you don’t need any job setup. You can use a persisted model from a Jupyter notebook.

To run a model you just need to run the models serve command with the URI of the saved artifact:


mlflow models serve -m runs://0/104dea9ea3d14dd08c9f886f31dd07db/LogisticRegressionPipeline
2019/09/01 18:16:49 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2019/09/01 18:16:52 INFO mlflow.pyfunc.backend: === Running command 'source activate 
mlflow-483ff163345a1c89dcd10599b1396df919493fb2 1>&2 && gunicorn --timeout 60 -b -w 1 mlflow.pyfunc.scoring_server.wsgi:app'
[2019-09-01 18:16:52 +0200] [7460] [INFO] Starting gunicorn 19.9.0
[2019-09-01 18:16:52 +0200] [7460] [INFO] Listening at: (7460)
[2019-09-01 18:16:52 +0200] [7460] [INFO] Using worker: sync
[2019-09-01 18:16:52 +0200] [7466] [INFO] Booting worker with pid: 7466


And a scalable backend server (running gunicorn in a very scalable manner) is ready without any code apart from your model training and logging the artifact in the MLFlow packaging strategy. It basically frees Machine Learning engineering teams that want to iterate fast of the initial cumbersome infrastructure work of setting up a repetitive and non-interesting boilerplate prediction API.

You can immediately start launching predictions to your server by:


curl -H 'Content-Type: application/json' -d 
'{"columns":["text"],"data":[[" concatenated text of the transcript"]]}'


The smart thing here is that the MLFlow scoring module uses the Sklearn model input ( pandas schema) as a spec for the Rest API. Sklearn was the example used here it has bindings for (H20, Spark, Keras, Tensorflow, ONNX, Pytorch, etc.). It basically infers the input from the model packaging format and offloads the data to the scoring function. It’s a very neat software engineering approach to a problem faced every day by machine learning teams. Freeing engineers and scientists to innovate instead of working on repetitive boilerplate code.

Going back to the Omdena challenge this backend is available to the frontend team to connect at the most convenient point of the chatbot app to the risk classifier backend ( most likely after a critical mass of open-ended questions).



More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

Using AI To Prevent Gang Violence via Analyzing Tweets

Using AI To Prevent Gang Violence via Analyzing Tweets

Applying machine learning to understand gang language and detect threatening tweets related to gang violence.


The problem

“Some believe that bolstering school security will deter violence, but this reactionary measure only addresses part of the problem. Instead, we must identify threats, mitigate risk, and protect children and staff before an act of violence occurs.” — Jeniffer Peters, Founder of Voice4Impact (Project Partner)

Chicago is considered the most gang-infested city in the United States, with a population of over 100,000 active members from nearly 60 factions. Gang warfare and retaliation are common in Chicago. In 2020 Chicago has seen a 43% rise in killings so far compared to 2019.



The solution

It was noticed that gangs often use twitter to communicate with fellow gang members as well as threat other gang members. Gang language is a mixture of icons and some gang terms.


Sample Gang Language


The project team split the work into two parts:

  • Implement a machine-learning algorithm to understand gang language and detect threatening tweets related to gang violence.
  • Find co-relation between threatening tweets and actual gang violence.


Part 1: Detecting violent gang language and influential members

The goal was to classify tweets as threatening or non-threatening so that the threatening ones can be routed to intervention specialists who will then decide what action to take.


Step 1: Labeling tweets collaboratively

First, a tool was created to label tweets faster and train the machine learning model. We were only provided the raw tweets. Searching the web, we found LightTag, which is a product designed for exactly this but it is a paid product once you exceed the comically low number of free labels.

We needed a simpler solution that does everything we need, and nothing else. So, we turned to a trusted old friend: Google Spreadsheets. A custom Google Spreadsheet was made (the template publicly available here). It features a scoreboard, so labelers get credit for their contribution, and a mechanism to have at least two people label each tweet to ensure the quality of labels.




To ensure the quality of our labels, we decided we need at least two labels on every tweet, and if they are not the same, a third label would be required to break the tie. Row color-coding makes it easy to see which rows are finished. If the row has been labeled once, it will be colored green. If the row has been labeled twice and the two labels do not agree, it will be colored red. Also on the scoreboard page, is a count of how many tweets are labeled once, labeled twice with conflicting labels, and finished on each page.


Step 2: Sentiment analysis (with probability value) of tweets being violent

The sentiment analysis team built a machine learning model to predict whether the tweets are threatening or non-threatening. But first, we needed to address the challenges of an imbalanced dataset where over 90% of the tweet feed was non-threatening, and the scarcity small size of the labeled dataset. We tested multiple techniques, including loss functions specifically designed for imbalanced datasets, undersampling, transfer learning from existing word embeddings algorithms, and ensemble models. We then combined the reservoir of violent signal words to come up with probability value (the probability that a tweet is more prone to using violent words) against each tweet.


Step 3: Detect influential members in the twitter gang network

Next, we wanted to identify the influential members of the network. A network analysis resulted in a directed graph and by using the Girvan Newmann algorithm, the communities in the networks could be also detected. Using PageRank values of each node, the influential members were identified.


5 steps to build an effective network analysis of tweets

1. Using python’s networkX, a graph using the mentions and authors of the tweets were created

Network Analysis Gang Violence

Network analysis


A detailed article on the Network analysis.

The nodes represent mentions in the tweet/author of a tweet. Edge A →B means B was mentioned in the tweet posted by A.

2. Thousands of tweets were used to create a directed graph and using Girvan Newmann algorithm, the communities in the networks were detected. Also, using PageRank values of each node, the influential members in the network could be identified. This value is not crucial to the network analysis but can be useful if one tries to track any gang member who is influential in the network.

3. The members in the communities are either authors or mentions. So, the tweets were then tagged with the community number based on the mention or author names.

4. The total number of signal keywords in all the communities was calculated and so was the total number of signal words for individual communities.

5. The final result was a dataset of tweets that had the community tag and probability of using violent words — based on usage of signal words within the community relative to all the communities. For example, In the picture below, members from Community 1 who are authors or mentions in the tweets are more likely to be inclined towards using violent keywords. So, the tweets which contain authors/mentions from this community are contextually more violent.



Also, the network analysis can give an insight into which members are more influentials within the community. One can get a notion by looking at the PageRank values of the members of the community. The greater the PageRank, the more influential a member is.


Page Rank vs Gang Member


Part 2: Correlation between actual violence and tweets

Next, we wanted to understand, if there is any co-relation between actually Crimes and mention of ‘Gun’ in a threatening tweet.

Below is the correlation between the two metrics on the same day, 1-day, and 2-day shift.


Same day


1-day shift


2-day shift


Through this analysis, we can see that there is a correlation between the number of crimes and the use of a gun in threatening tweets with a 2-day shift. This can be very useful for authorities to prevent gang violence.


More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.


An Attempt to Identify Cybersex Crimes through Artificial Intelligence

An Attempt to Identify Cybersex Crimes through Artificial Intelligence

Classifying the online chats between two persons as sexual abuse or non-sexual abuse using text mining and deep learning.


The problem

The vast growth in technology and social media has revolutionized our lives by transforming the way we connect and communicate. Unfortunately, the darker side of this development has exposed a lot of children and teenagers from various ages to become victims of online sexual abuse.

To help combat the severity of the problem, I joined an Omdena project together with the Zero Abuse Project. Among 45 Omdena collaborators from across 6 continents, the goal was to build AI models to identify patterns in the behavior of institutions when they cover-up incidents of sexual abuse.

The identification and analysis of sexual crimes assure public safety and has been made possible by leveraging AI. Natural Language Processing and various machine learning techniques have played a major in the successful identification of online sexual abuse.


The solution

The main idea of this task was to classify online chats between two persons as sexual abuse or non-sexual abuse. We planned in implementing this by using text mining and deep learning techniques such as LSTM-RNN. In the following example, our idea aimed at classifying the chats as predatory or non-predatory.


Classifying online chats 

We have used the open-source PAN2012 dataset provided in the context of the Sexual Predator Identification (SPI) Task in 2012 initiated by PAN (Plagiarism analysis, authorship identification, and near-duplicate detection) lab. However, the realistic data provided by PAN has a high noise level with unbalanced training samples and varying length of conversations.

The challenging part of this dataset was in changing the chat text abbreviations and cyber slang texts such as “u” for “you”, “ur” for “your” and “l8r” for “later”. Such words are necessary for feature selection and for improving the performance of the model used for the classification.


Wait, are we stuck with preprocessing?

Initially, with the huge dataset and high noise levels, preprocessing did seem like a herculean task! Well, 80% of the time goes into preprocessing in order to achieve the best results. We managed to implement it by using text mining techniques. We started off by carrying out a basic analysis of checking for null characters, finding the sentence length of each text message as well as finding out the words with the highest frequencies. We also implemented the removal of stopwords, stemming, and lemmatization. The aim of both stemming and lemmatization is to reduce the corpus size and complexity for creating embeddings from simpler words which is useful for sentiment analysis. Stopwords are words that are omitted since it does not provide value for the machine’s understanding of texts.

Furthermore, we realized our dataset contained loads of emojis, URLs, hashtags, misspelled words, and slangs. In order to reduce the noise levels to a greater extent, we had to remove the emoticons from the chats using regular expressions and change the misspelled words by creating a dictionary. The tricky part here involved converting the chat slang abbreviations since it was necessary for feature selection. Unfortunately, it was difficult to find a library or database of words that do that. We had to create a dictionary for that purpose.

slang_dict = {"aren't": "are not", "can't": "cannot", "couldn't": "could not","didn't": "did not","doesn't": "does not",
"don't": "do not","hadn't": "had not"......}
def process_data(data):
   return data


The Exploratory Data Analysis

We further tried to analyze the top 20 frequently words in the chatlogs as unigrams and bigrams. A unigram is an n-gram consisting of a single word from a sequence and bigrams contain two words from a sequence.


Top 20 Unigrams


From the analysis, we inferred that words such as “age”,” sex”, “hi” etc were very frequently used in the catalogs.

Moving into the language model and classification

The XML dataset provided by PAN2012 is unlabelled and manual labeling is a pretty difficult task considering the number of samples present in the dataset. To solve this situation, sentiment analysis was carried out to identify the polarity and subjectivity of the chatlogs. Polarity is a float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion, or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

Considering the different number of sentences in conversations (from 1 to more than 500), the extra-long conversations were padded by zeros and then split into parts, each with an equal length of 100. This strategy is helpful to prevent underfitting in the LSTM-RNN model when processing long conversations. These tokenized words were converted into word embeddings to be fed into the LSTM-RNN classifier using the GLoVe pre-trained model.

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating a global word-word co-occurrence matrix from a corpus.

# Co-occurence matrix
def fill_embedding_matrix(tokenizer):
   vocab_size = len(tokenizer.word_index) 
   embedding_matrix = np.zeros((vocab_size+1, 100)) 
   for word, i in tokenizer.word_index.items():
       embedding_vector = embeddings_index.get(word) 
       if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
   return embedding_matrix

The architecture of the LSTM-RNN classifier 


Each word embedding is fed into the binary LSTM-RNN classifier. It consists of one embedding layer, two LSTM-RNN layers with 200 units and 50 timesteps as well as a sigmoid layer that is implemented on the Tensorflow framework for the binary classification. The results could have been improved if labeling the chatlogs could be efficient and if the persisting noise in the dataset could be reduced. However, this task of classifying the sexual predators provided us a clearer perspective of an efficient feature selection and new approaches to solving the labeling problem in order to improve the accuracy of the LSTM-RNN classification.


More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.






Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here