Building An Elevation Map For Forest Cover using Deep Learning

Building An Elevation Map For Forest Cover using Deep Learning

The Problem: How to build an Elevation Map?

Now, how do we get to the elevation map? And what do we use it for?

Those were the main questions we asked ourselves, considering that we will be doing something great for the world in this project regarding the understanding of topographical maps using deep learning.

First, what is an Elevation Map anyway?

 

 

An elevation map shows the various elevations in a region. Elevation in maps is shown using contour lines (level curves), bands of the same color (using imagery processing), or by numerical values giving the exact elevation details. Elevation maps are generally known as Topographical Maps by using Deep Learning.

 

The Solution: Generating Elevation Maps

 

Diagram of Elevation Map Process

 

What needs to be created?

  • A Digital Elevation Model (DEM).

 

Digital Elevation Model (example)

 

A digital Elevation Model is a specialized database that represents the relief of a surface between points of known elevation. It is the digital representation of the land surface elevation.

 

  • Level Curves (contour line).

 

 

Contour lines are the most common method of showing relief and elevation on a standard topographical map using deep learning. A contour line represents an imaginary line of the ground, above or below sea level. Contour lines form circles (or go off the map). The inside of the circle is the top of a hill.

We worked with the DEM to create the contour lines, using GIS open-source software, in this case, we used a GIS software called QGIS with a plugin’ called “Contour”, which uses the elevations of the DEM to define the level curves and obtain a contour line model of the study area (it is possible to define the distance between each level curve, which in this case occurred every two meters).

 

DEM converted into Contours.

 

  • Triangulated Irregular Network (TIN).

 

 

 

Next, we need a Triangular Irregular Network (TIN) with vector-based lines and three-dimensional coordinates (x,y,z). The TIN model represents a surface as a set of contiguous, non-overlapping triangles.

We applied computational geometry (Delaunay Triangulation) to create the TIN.

Delaunay triangulations are widely used in scientific computing in many diverse applications. While there are numerous algorithms for computing triangulations, it is the favorable geometric properties of the Delaunay triangulation that make it so useful.

For modeling terrain or other objects with a set of sample points, the Delaunay Triangulation gives a good set of triangles to use as polygons in the model. In particular, the Delaunay Triangulation avoids narrow triangles (as they have large circumcircles compared to their area).

 

  • Digital Terrain Model (DTM).

 

 

 

A DTM is a vector dataset composed of regularly spaced points and natural features such as ridges and break lines. A DTM augments a DEM by including linear features of the bare earth terrain.

DTM’s are typically created through stereophotogrammetry, but in this case, we downloaded a point surface of the terrain.

The points are called LiDAR points, a collection of points that represents a 3D shape or feature. Each point has its own set of X, Y, and Z coordinates and in some cases additional attributes. We can think about a point cloud as a collection of multiple points and converted into a DTM using GIS open-source software.

 

The Results

After applying previous techniques in the context of identifying trees, we got the following results.

 

  • Our Digital Elevation Model

 

Digital Elevation Model of the Study Area

 

  • The Contour Lines

 

Level Curves of the Study Area

 

  • Our Triangulated Irregular Network

 

TIN of the Study Area

 

  • The Digital Terrain Model

 

DTM of the Study Area

 

 

These results are part of an overal solution to identify trees close to power stations and allows us to find and determine the elevations of the forest cover, as well as the aspect of the land, either for geological purposes (determination of land use), forestry (control of protected natural areas), or to prevent fires in high-risk areas (anthropic causes).

The elevation map can give great help to public agencies to perform spatial analysis, manage large amounts of spatial data, and produce cartographically appealing maps to aid in decision making. It improves the responders efficiency and performance by giving them rapid access to critical data during an incident.

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up global collaboration.

Enhancing Satellite Imagery Through Super Resolution

Enhancing Satellite Imagery Through Super Resolution

By James Tan

 

The power of deep learning paired with collaborative human intelligence to increase crop cultivation through super resolution.

 

The Problem

In order to accurately locate crop fields from satellite imagery, it is conceivable that images of a certain quality are required. Although Deep Learning is notoriously known for being able to pull off miracles, we human beings will have a real field day labeling the data if we cannot clearly make out the details within an image.

The following is an example of what we would like to achieve.

 

Semantic Segmentation Example https://www.frontiersin.org/articles/10.3389/feart.2017.00017/full

 

 

If we can clearly identify areas within a satellite image that correspond to a particular crop, we can easily extend our work to evaluate the area of cultivation of each crop, which will go a long way in ensuring food security.

 

The Solution

To get our hands on the required data, we explored a myriad of sources of satellite imagery. We ultimately settled on using images from Sentinel-2, largely due to the fact that the satellite mission boasts images of the best quality amongst other open-source images.

 

 

Is the best good enough?

 

Original Image

 

 

Despite my previous assertion that I am not an expert in satellite imagery, I believe that having seen the above image we can all agree that the quality of it is not quite up to scratch.

A real nightmare to label!

It is completely unfeasible to discern the demarcations between individual crop fields.

Of course, this isn’t completely unreasonable for open-source data. Satellite image vendors have to be especially careful when it comes to the distribution of such data due to privacy concerns.

How outrageous it would be if simply anyone can look up what our backyards look like on the Internet, right?

However, this inconvenience comes at a great detriment to our project. In order to clearly identify and label crops in an image that is relevant to us, we would require images of much higher quality than what we have.

Super-Resolution

Deep Learning practitioners love to apply what they know to solve the problems they face. You probably know where I am getting with this. If the quality of an image isn’t good enough, we try to enhance it of course! The process we like to call super-resolution.

 

Deep Image Prior

This is one of the first things that we tried, and here are the results.

 

Results of applying Deep Image Prior to the original image.

 

Quite noticeably there has been some improvement, the model has done an amazing job of smoothening out the rough edges in the photo. The pixelation problem has been pretty much-taken care of and everything blends in well.

However, in doing so the model has neglected finer details and that leads to an image that feels out of focus.

 

Decrappify

Naturally, we wouldn’t stop until we got something completely satisfactory, which led us to try this instead.

 

Results of applying Decrappify to the original image.

 

 

Now, it is quite obvious that this model has done something completely different than Deep Image Prior. Instead of attempting to ensure that the pixels blend in with each other, this model instead places great emphasis on refining each individual pixel. In doing so it neglects to consider how each pixel is actually related to its surrounding pixels.

Albeit being successful in injecting some life into the original image by making the colors more refined and attractive, the pixelation in the image remains an issue.

 

The Results

 

Results of running the original image through Deep Image Prior and then Decrappify.

 

When we first saw this, we couldn’t believe what we were watching. We have come such a long way from our original image! And to think that the approach taken to achieve such results was such a silly one.

Since each of the previous two models were no good individually, but they clearly were good at getting different things done, what if we combined the two of them?

So we ran the original image through Deep Image Prior, and subsequently fed the results of that through the Decrappify model, and voila!

Relative to the original image, the colors of the current image look incredibly realistic. The lucid demarcations of the crop fields will certainly come a long way in helping us label our data.

 

Our Methodology

The way we pulled this off was embarrassingly simple. We used Deep Image Prior which is found at its official Github repository. As for Decrappify, given our objectives, we figured that training it on satellite images would definitely help out. Having the two models readily set up, its just a matter of feeding images into them one after the other.

 

A Quick Look at the Models

For those of you that have made it this far and are curious about what the models actually are, here’s a brief overview of them.

Deep Image Prior

This method hardly conforms to conventional deep learning-based super-resolution approaches.

Typically, we would create a dataset of low and super-resolution image pairs, following which we train a model to map a low-resolution image to its high-resolution counterpart to increase crop cultivation. However, this particular model does none of the above, and as a result, does not have to be pre-trained prior to inference time. Instead, a randomly initialized deep neural network is trained on one particular image. That image could be one of your favorite sports stars, a picture of your pet, a painting that you like, or even random noise. Its task is then, to optimize its parameters to map the input image to the image that we are trying to super-resolve. In other words, we are training our network to overfit to our low-resolution image.

Why does this make sense?

It turns out that the structure of deep networks imposes a ‘naturalness prior’ over the generated image. Quite simply, this means that when overfitting/memorizing an image, deep networks prefer to learn the natural/smoother concepts first before moving on to the unnatural ones. That is to say that the convolutional neural network (CNN) will first ‘identify’ the colors that form shapes in various parts of the image and then proceed to materialize various textures in the image. As the optimization process goes on, CNN will latch on to finer details.

When generating an image, neural networks prefer natural-looking images as opposed to pixelated ones. Thus, we start the optimization process and allow it to continue to the point where it has captured most of the relevant details but has not learned any of the pixelations and noise. For super-resolution, we train it to a point such that the resulting image it creates closely resembles the original image when they are both downsampled. There exist multiple super-resolution images that could have produced each low-resolution image to increase crop cultivation.

And as it turns out, the most plausible image is also the one that doesn’t appear to be highly pixelated, this is because the structure of deep networks imposes a ‘naturalness prior’ on generated images.

We highly recommend this talk by Dmitry Ulyanov (who was the main author of the Deep Image Prior paper) to understand the above concepts in depth.

 

 

 

 

 

 

 

The super-resolution process of Deep Image Prior

 

 
Decrappify

In contrast with the previous model, here it is about to learn as much possible about satellite images. As a result, when we give it a low-quality image as an input, the model is able to bridge the gap between a low and high-quality version of it by using its knowledge of the world to fill in the blanks.

The model has a U-net architecture with a pre-trained ResNet backbone. But the part that is really interesting is in the loss function, which has been adapted from this paper. The objective of this model is to produce an output image of higher quality, such that when it is fed through a pre-trained VGG16 model, it produces minimal ‘style’ and ‘content’ loss relative to the ground truth image. The ‘style’ loss is relevant because we want the model to be able to be careful in creating a super-resolution image with a texture that is realistic of a satellite image to increase crop cultivation. The ‘content’ loss is responsible for encouraging the model to recreate intricate details in its higher quality output.

 

 

 

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up global collaboration.

Satellite Image Analysis to Identify Trees and Prevent Fires

Satellite Image Analysis to Identify Trees and Prevent Fires

The project goal was to build a Machine Learning model for tree identification on satellite images. The solution will prevent power outages and fires sparked by falling trees and storms. This will save lives, reduce CO2 emissions, and improve infrastructure inspection. The project was hosted by the Swedish AI startup Spacept.

 

Four weeks ago 35 AI experts and data scientists, from 16 countries came together through the Omdena platform. The community participants formed self-organized task groups and each task group either picked up a part or approach to solving the challenge.

 

Forming the Task Groups

Omdena’s platform is a self-organized learning environment and after the first kick-off call, the collaborators started to organize in task groups. Below are screenshots of some of the discussions that took place in the first days of the project.

 

.

Task Group 1: Labeling

We labeled over 1000 images. A large group of people makes it not only faster but also more accurate through our peer-to-peer review process.

Active collaborators: Leonardo Sanchez (Task Manager, Brazil), Arafat Bin Hossain (Bangladesh), Sim Keng Ying(Singapore), Alejandro Bautista Ramos (Mexico), Santosh Kumar Pydipalli (India), Gerardo Duran (Mexico), Annie Tran (USA), Steven Parra Giraldo (Colombia), Bishwa Karki (Nepal), Isaac Rodríguez Bribiesca (Mexico).

 

Labeled images

Task Group 2: Generating images through GANs

Given a training set, GANs can be used to generate new data with the same features as the training set.

Active Participants: Santiago Hincapie-Potes (Task Manager, Colombia), Amit Singh (Task Manager for DCGAN, India), Ramon Ontiveros (Mexico), Steven Parra Giraldo (Colombia), Isaac Rodríguez (Mexico), Rafael Villca (Bolivia), Bishwa Karki (Nepal).

 

Output from GAN

Task Group 3: Generating elevation model

The task group is using a Digital Elevation Model and triangulated irregular network. Knowing the elevation of the land as well as trees will help us to assess risk potential tree posses to overhead cables.

Active Participants: Gabriel Garcia Ojeda (Mexico)

 

 

Task Group 4: Sharpening the images

A set of image processes has been built, different combinations of filters were used and a basic pipeline to automate the process was implemented to test out the combinations. All in order to preprocess the set of labeled images to achieve more accurate results with the AI models.

Active Participants: Lukasz Kaczmarek (Task Manager, Poland) Cristian Vargas (Mexico), Rodolfo Ferro (Mexico), Ramon Ontiveros (Mexico).

 

Output after sharpening

Task Group 5: Detecting trees through Masked R-CNN model

Mask R-CNN was built by the Facebook AI research team. The model generates a set of bounding boxes that possibly contain the trees. The second step is to color based on certainty.

Active Participants: Kathelyn Zelaya (Task Manager, USA), Annie Tran (USA), Shubhajit Das (India), Shafie Mukhre (USA).

 

Masked RCNN output

Task Group 6: Detecting trees through U-Net and Deep U-Net model

U-Net was initially used for biomedical image segmentation, but because of the good results it was able to achieve, U-Net is being applied in a variety of other tasks. It is one of the best network architecture for image segmentation. We applied the same architecture to identifying trees and got very encouraging results, even when trained with less than 50 images.

Active Participants: Pawel Pisarski (Task Manager, Canada), Arafat Bin Hossain (Bangladesh), Rodolfo Ferro (Mexico), Juan Manuel Ciro Torre (Colombia), Leonardo Sanchez (Brazil).

The U-Net consists of a contracting path and an expansive path, which gives it the u-shaped format. The contracting path is a typical convolutional network that consists of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max-pooling operation.

 

U-Net architecture

 

One of the techniques called our attention: the Deep U-Net. Similarly to U-Nets, the Deep U-Nets have both sides (contraction and expansion side) but use U-connections to pass high-resolution features from the contraction side to the upsampled outputs. And additionally, it uses Plus connections to better extract information with less loss error.

 

Deep-U-Net Architecture

 

Deep-U-Net Architecture

Having discussed the architecture, a basic Deep U-Net solution was applied to the unique 144 images labeled that were then divided into 119 images and 119 masks for the training set, 22 images and 22 masks for the validation set, and 3 images and 3 masks for a test set. As images and masks were in 1,000 x 1,000 images, they were cropped into 512 x 512 images generating 476 images and 476 masks for the training set, 88 images and 88 masks for the validation set, and 12 images and 12 masks for the test set. Applying the Deep U-Net model with 10 epochs and a batch size equal to 4, the results for the 10 epochs — using Adam optimizer, a binary-cross-entropy loss and running over a GPU Geforce GTX 1060 — were quite encouraging, reaching 94% accuracy over validation.

 

Model Accuracy and Loss

 

Model Accuracy and Loss

Believing that accuracy could be improved a bit further, the basic solution was expanded using data augmentation. We generated through rotations, 8 augmented images per original image and had 3,808 images and 3,808 masks for the training set, and 704 images and 704 masks for the validation set.

We reproduced the previous model, keeping the basal learning rate as 0.001 but adjusting with a decay inversely proportional to the number of epochs and increasing the number of epochs to 100.

Doing this we reached more than 95% accuracy, which was above the expectation of our project partner.

The Deep U-Net model learned very well to distinguish trees in new images, even separating shadows among forests as not trees, reproducing what we humans did during the labeling process but with an even better performance.

A few results can be seen below and were generated using new images completely unseen before by the Deep U-Net.

 

Lithuania image (the model was trained on Australia with a different landscape)

Predictions over the test set

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Using AI To Prevent Gang Violence via Analyzing Tweets

Using AI To Prevent Gang Violence via Analyzing Tweets

Applying machine learning to understand gang language and detect threatening tweets related to gang violence.

 

The problem

“Some believe that bolstering school security will deter violence, but this reactionary measure only addresses part of the problem. Instead, we must identify threats, mitigate risk, and protect children and staff before an act of violence occurs.” — Jeniffer Peters, Founder of Voice4Impact (Project Partner)

Chicago is considered the most gang-infested city in the United States, with a population of over 100,000 active members from nearly 60 factions. Gang warfare and retaliation are common in Chicago. In 2020 Chicago has seen a 43% rise in killings so far compared to 2019.

 

 

The solution

It was noticed that gangs often use twitter to communicate with fellow gang members as well as threat other gang members. Gang language is a mixture of icons and some gang terms.

 

Sample Gang Language

 

The project team split the work into two parts:

  • Implement a machine-learning algorithm to understand gang language and detect threatening tweets related to gang violence.
  • Find co-relation between threatening tweets and actual gang violence.

 

Part 1: Detecting violent gang language and influential members

The goal was to classify tweets as threatening or non-threatening so that the threatening ones can be routed to intervention specialists who will then decide what action to take.

 

Step 1: Labeling tweets collaboratively

First, a tool was created to label tweets faster and train the machine learning model. We were only provided the raw tweets. Searching the web, we found LightTag, which is a product designed for exactly this but it is a paid product once you exceed the comically low number of free labels.

We needed a simpler solution that does everything we need, and nothing else. So, we turned to a trusted old friend: Google Spreadsheets. A custom Google Spreadsheet was made (the template publicly available here). It features a scoreboard, so labelers get credit for their contribution, and a mechanism to have at least two people label each tweet to ensure the quality of labels.

 

 

 

To ensure the quality of our labels, we decided we need at least two labels on every tweet, and if they are not the same, a third label would be required to break the tie. Row color-coding makes it easy to see which rows are finished. If the row has been labeled once, it will be colored green. If the row has been labeled twice and the two labels do not agree, it will be colored red. Also on the scoreboard page, is a count of how many tweets are labeled once, labeled twice with conflicting labels, and finished on each page.

 

Step 2: Sentiment analysis (with probability value) of tweets being violent

The sentiment analysis team built a machine learning model to predict whether the tweets are threatening or non-threatening. But first, we needed to address the challenges of an imbalanced dataset where over 90% of the tweet feed was non-threatening, and the scarcity small size of the labeled dataset. We tested multiple techniques, including loss functions specifically designed for imbalanced datasets, undersampling, transfer learning from existing word embeddings algorithms, and ensemble models. We then combined the reservoir of violent signal words to come up with probability value (the probability that a tweet is more prone to using violent words) against each tweet.

 

Step 3: Detect influential members in the twitter gang network

Next, we wanted to identify the influential members of the network. A network analysis resulted in a directed graph and by using the Girvan Newmann algorithm, the communities in the networks could be also detected. Using PageRank values of each node, the influential members were identified.

 

5 steps to build an effective network analysis of tweets

1. Using python’s networkX, a graph using the mentions and authors of the tweets were created

Network Analysis Gang Violence

Network analysis

 

A detailed article on the Network analysis.

The nodes represent mentions in the tweet/author of a tweet. Edge A →B means B was mentioned in the tweet posted by A.

2. Thousands of tweets were used to create a directed graph and using Girvan Newmann algorithm, the communities in the networks were detected. Also, using PageRank values of each node, the influential members in the network could be identified. This value is not crucial to the network analysis but can be useful if one tries to track any gang member who is influential in the network.

3. The members in the communities are either authors or mentions. So, the tweets were then tagged with the community number based on the mention or author names.

4. The total number of signal keywords in all the communities was calculated and so was the total number of signal words for individual communities.

5. The final result was a dataset of tweets that had the community tag and probability of using violent words — based on usage of signal words within the community relative to all the communities. For example, In the picture below, members from Community 1 who are authors or mentions in the tweets are more likely to be inclined towards using violent keywords. So, the tweets which contain authors/mentions from this community are contextually more violent.

 

 

Also, the network analysis can give an insight into which members are more influentials within the community. One can get a notion by looking at the PageRank values of the members of the community. The greater the PageRank, the more influential a member is.

 

Page Rank vs Gang Member

 

Part 2: Correlation between actual violence and tweets

Next, we wanted to understand, if there is any co-relation between actually Crimes and mention of ‘Gun’ in a threatening tweet.

Below is the correlation between the two metrics on the same day, 1-day, and 2-day shift.

 

Same day

 

1-day shift

 

2-day shift

 

Through this analysis, we can see that there is a correlation between the number of crimes and the use of a gun in threatening tweets with a 2-day shift. This can be very useful for authorities to prevent gang violence.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

An Attempt to Identify Cybersex Crimes through Artificial Intelligence

An Attempt to Identify Cybersex Crimes through Artificial Intelligence

Classifying the online chats between two persons as sexual abuse or non-sexual abuse using text mining and deep learning.

 

The problem

The vast growth in technology and social media has revolutionized our lives by transforming the way we connect and communicate. Unfortunately, the darker side of this development has exposed a lot of children and teenagers from various ages to become victims of online sexual abuse.

To help combat the severity of the problem, I joined an Omdena project together with the Zero Abuse Project. Among 45 Omdena collaborators from across 6 continents, the goal was to build AI models to identify patterns in the behavior of institutions when they cover-up incidents of sexual abuse.

The identification and analysis of sexual crimes assure public safety and has been made possible by leveraging AI. Natural Language Processing and various machine learning techniques have played a major in the successful identification of online sexual abuse.

 

The solution

The main idea of this task was to classify online chats between two persons as sexual abuse or non-sexual abuse. We planned in implementing this by using text mining and deep learning techniques such as LSTM-RNN. In the following example, our idea aimed at classifying the chats as predatory or non-predatory.

 

Classifying online chats 

We have used the open-source PAN2012 dataset provided in the context of the Sexual Predator Identification (SPI) Task in 2012 initiated by PAN (Plagiarism analysis, authorship identification, and near-duplicate detection) lab. However, the realistic data provided by PAN has a high noise level with unbalanced training samples and varying length of conversations.

The challenging part of this dataset was in changing the chat text abbreviations and cyber slang texts such as “u” for “you”, “ur” for “your” and “l8r” for “later”. Such words are necessary for feature selection and for improving the performance of the model used for the classification.

 

Wait, are we stuck with preprocessing?

Initially, with the huge dataset and high noise levels, preprocessing did seem like a herculean task! Well, 80% of the time goes into preprocessing in order to achieve the best results. We managed to implement it by using text mining techniques. We started off by carrying out a basic analysis of checking for null characters, finding the sentence length of each text message as well as finding out the words with the highest frequencies. We also implemented the removal of stopwords, stemming, and lemmatization. The aim of both stemming and lemmatization is to reduce the corpus size and complexity for creating embeddings from simpler words which is useful for sentiment analysis. Stopwords are words that are omitted since it does not provide value for the machine’s understanding of texts.

Furthermore, we realized our dataset contained loads of emojis, URLs, hashtags, misspelled words, and slangs. In order to reduce the noise levels to a greater extent, we had to remove the emoticons from the chats using regular expressions and change the misspelled words by creating a dictionary. The tricky part here involved converting the chat slang abbreviations since it was necessary for feature selection. Unfortunately, it was difficult to find a library or database of words that do that. We had to create a dictionary for that purpose.

slang_dict = {"aren't": "are not", "can't": "cannot", "couldn't": "could not","didn't": "did not","doesn't": "does not",
"don't": "do not","hadn't": "had not"......}
def process_data(data):
   data['text_clean']=data['text_clean'].str.lower()
   data['text_clean']=data['text_clean'].astype(str)
   data.replace(slang_dict,regex=True,inplace=True)
   display(data.head(2))
   return data

 

The Exploratory Data Analysis

We further tried to analyze the top 20 frequently words in the chatlogs as unigrams and bigrams. A unigram is an n-gram consisting of a single word from a sequence and bigrams contain two words from a sequence.

 

Top 20 Unigrams

 

From the analysis, we inferred that words such as “age”,” sex”, “hi” etc were very frequently used in the catalogs.

Moving into the language model and classification

The XML dataset provided by PAN2012 is unlabelled and manual labeling is a pretty difficult task considering the number of samples present in the dataset. To solve this situation, sentiment analysis was carried out to identify the polarity and subjectivity of the chatlogs. Polarity is a float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion, or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

Considering the different number of sentences in conversations (from 1 to more than 500), the extra-long conversations were padded by zeros and then split into parts, each with an equal length of 100. This strategy is helpful to prevent underfitting in the LSTM-RNN model when processing long conversations. These tokenized words were converted into word embeddings to be fed into the LSTM-RNN classifier using the GLoVe pre-trained model.

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating a global word-word co-occurrence matrix from a corpus.

# Co-occurence matrix
def fill_embedding_matrix(tokenizer):
   vocab_size = len(tokenizer.word_index) 
   embedding_matrix = np.zeros((vocab_size+1, 100)) 
   for word, i in tokenizer.word_index.items():
       embedding_vector = embeddings_index.get(word) 
       if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
   return embedding_matrix


The architecture of the LSTM-RNN classifier 

 

Each word embedding is fed into the binary LSTM-RNN classifier. It consists of one embedding layer, two LSTM-RNN layers with 200 units and 50 timesteps as well as a sigmoid layer that is implemented on the Tensorflow framework for the binary classification. The results could have been improved if labeling the chatlogs could be efficient and if the persisting noise in the dataset could be reduced. However, this task of classifying the sexual predators provided us a clearer perspective of an efficient feature selection and new approaches to solving the labeling problem in order to improve the accuracy of the LSTM-RNN classification.

 

More about Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

 

Sources 

[1] https://arxiv.org/ftp/arxiv/papers/1712/1712.03903.pdf

[2] https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html

[3] https://towardsdatascience.com/fasttext-sentiment-analysis-for-tweets-a-straightforward-guide-9a8c070449a2

Stay in touch via our newsletter.

Be notified (a few times a month) about top-notch articles, new real-world projects, and events with our community of changemakers.

Sign up here