**What is a heatmap?**

According to the Oxford dictionary, a heatmap (or heat map) is *“a representation of data in the form of a map or diagram in which data values are represented as colors*”. One of the most effective ways to express a heatmap that can be used on mathematical models is through the use of matrices where each cell represents a square portion of space in a given measuring distance system and the colors represent the intensity of the studied event that happened on each cell mapped.

In this case, the intensity represents the number of crimes women suffered in that area, at a specific time.

In figure 2 we can see an example of a heatmap predicted by a machine learning model plotted on a grid divided into 30 by 30 cells where each cell has a color representing the crime intensity for Delhi on August 13. The more reddish the cell, the riskier the place. The black dots represent the real crimes that occurred on the particular date reported by the **SafeCity **application.

**1. How to use past aggregation maps to have future insights on a specific region?**

It is time for machine learning. Heatmap prediction is one of the many fields studied by people that want to predict human behavior over time.

Heatmap prediction using machine learning is a three-dimensional problem that can also be called spatial-temporal problems as seen in figure 3. It involves a spatial component since a heatmap, in our case, is a two-dimensional matrix and involves a temporal component that varies on time and depends on the granularity that we decided to see it which is used by machine learning. This granularity that aggregates the events on each cell can be expressed in hours, days, months, and so on.

**2. Selecting the right model**

Being a passionate person for convolutional neural networks and machine learning, I decided to search for articles on the use of deep learning for crime forecasting that generates a heatmap.

The first paper found for the study brought the use of **Spatial-Temporal Residual Networks** (ST-ResNet) for crime forecasting. Bao et al. used this technique, as a regression problem to predict the number of crimes hourly on a grid divided map over a region in Los Angeles city where their great contribution was to apply a spatial and temporal regularization technique on the dataset.

This ANN model uses the concept of aggregating heatmaps according to the concept of trends, period, and closeness by using convolutional neural networks where each of these terms defines the temporal distance between the heatmap analysis, convolutional neural networks, and machine learning, and have its own internal architecture inside the model were it learns features related to those distances as we can see in figure 4. Residual convoluted units are used to learn the spatial dimension for each of those maps.

**2.1. A valuable lesson learned — Fail quickly and move forward**

Grabbing a dataset available by Kaggle containing data about crimes in Los Angeles, I tried for two weeks to replicate their study with all the technique recommended by them without results that resembles the ones they showed in their article with that level of perfection. I entered in contact with the authors asking for clarification but just got some vague answers.

Finally, I gave up on this approach when I talked with an author of another article that cites them and he said that even he failed to reproduce their results.

A valuable lesson to share here is that if you are not progressing with an idea, put a deadline to end it and move on to the next one. I searched for related articles that could cast some light on my problem and found alternative approaches.

**2.2. Finding the best-fit model**

Another deep learning model found during the study uses a more complex combination of convolutional neural networks, heatmap analysis, machine learning, and the LSTM model to predict the number and category of crimes called **SFTT **and proposed by Panagiotis et al.

The SFTT model, presented in figure 5, receives as input a sequence of heat maps aggregated in time by classes and outputs a future heat map with predicted hotspots that according to the article definition are places where at least a single crime happened and category probability.

Even though the model showed good results for the article’s authors I couldn’t manage to get good results. In my implementations or the model terribly overfits, or the number of parameters to train was astonishing reaching like 32 million and my machine learning algorithm was not capable of processing that in a reasonable time.

Unfortunately, I was still with the “perfect score” mentality from the St-ResNet article when working on this model and most of the time still getting frustrated with what I was achieving. Only later when I read Udo Schlegel’s master thesis, *Towards Crime Forecasting Using Deep Learning*, where he makes use of encoder-decoder and GAN’s to predict heatmaps showing results that looked more similar to the ones I found in the past, I changed my mind on crime prediction.

Crime predictability, as I would discover near the challenge’s end, is about predicting trends and regions and not lonely criminal minds. You can’t play Minority Report like in Tom Cruise movie in this field.

Even failing at using this model, I would reconsider retaking the study and implementation using it in the future since it can help immensely in the cases we would like to make a clear distinction between predicting ordinary cases and rape against women. In my opinion, the kind of intervention or warning needed in both situations should be treated differently since the impact on the last one can be a life-changer.

Finally, after a lot of study and research, I found an article that fitted the most with the solution I was looking for. Noah et al. proposed Convolutional neural networks, machine learning algorithms,** and the LSTM **model (figure 6) to predict burglary in Sweden by using heatmap analysis. They turned a regression problem into a classification one. Instead of predicting the number of crimes that could potentially occur in the future, they tried to classify each cell with a risky probability.

The model’s input is a sequence of binary maps in time where each cell contains one if at least an incident happened there (burglary in the article’s case) and zero otherwise and outputs a risk probability map for the next period. Since we had little data to work with, it was easier to turn our samples into binary maps than aggregated sum maps like the ones used for regression problems. Being loose in aggregation let the spatial dimension to be explored even further enabling to amplify the resolution of the heatmaps since accumulation was not a necessity anymore but just the simple existence of a single incident inside the debilitated cell square using convolutional neural networks and machine learning algorithms.

##### Build your portfolio with real-world projects from Omdena

**The Solution**

The model that fits best in our case was the Conv + LSTM.

The dataset provided was stored in a spreadsheet containing around 11.000 inputs where each row had a reported incident that happened to a woman on a certain place on the globe with information like incident date, description, latitude, longitude, place, category, and so on.

From this point the following pipeline was adopted in order to build a solution:

- Data Selection
- Data Pre Processing
- Modeling
- Evaluation
- Prediction

**1. Data selection**

Most of the open datasets used for crime studies contain lots of detailed samples reaching numbers that vary from 100.000 to 10.000.000 reports since they are generally provided by the local police department which contains a good computerized system capable of collecting, organizing, and storing this valuable information over a delimited region.

Since it’s not imaginable to build global scale heatmap analysis using machine learning algorithms for crime prediction with a useful resolution we selected the cities which concentrate most of the data, Delhi and Mumbai with approximately 3.000 samples for each of those places.

After the selection, a spatial and temporal boundary was made that could encompass the majority of useful data for training the algorithm.

For the spatial part, an algorithm that selects the best latitude and longitude boundaries based on a threshold number of events and grid division was used. The goal was to cut the number of places with an irrelevant number of occurrences. As in figure 7, the first dataset that encompassed Delhi had a lot of zeros near the latitude and longitude boundaries and after the application of the algorithm, we reduced the space where most of the relevant data were located with minimum loss.

On the temporal dimension, we had data from 2002 until 2019. Grouping them by month it was possible to verify that not all years had a relevant or useful amount of information so the strategy was to select just reports between the concentration range as it is possible to visualize in figure 8.

Once useful spatial and temporal data were selected in the next step we created the heatmaps.

**2. Data pre-processing**

In the data pre-processing step, we defined the spatial and temporal granularity of our aggregated data in order to produce the heatmap analysis using machine learning.

For this challenge, a 32 by 32 grid size was used on a daily basis. These values seem aggressive for a small dataset but as stated early, binary maps allowed more space between cells since they need just one incident on that place and daily granularity was used, even with lots of missing values between days, because with the data augmentation technique used below we still got some good results.

Heatmaps were made using the aggregation technique where first we converted the latitude and longitude from our data samples into a matrix coordinate and after summed all those samples that fell in the same coordinate on the same temporal granularity as demonstrated in figure 9 by using convolutional neural networks.

After creating the heatmaps we still had many missing daily maps that could potentially represent a problem when stacking them into a sequence for the input data for the ConvLSTM model. One solution could be to assume zero-valued heatmaps for these cases but since we already had too many sparse matrices doing this strategy would just contribute to model overfitting towards zero.

The strategy used to fill this gap was to upsample these missing heatmaps using a linear interpolation between the first and next heatmap on the time sequence using the total missing period as the division factor. With this data augmentation method, it was possible to rise the dataset from an amount of 586 binary maps to 1546 daily maps.

After synthetically creating the missing heatmaps a threshold value was arbitrarily chosen to convert them into binary maps (figure 10) since we stated that the model works with risky and non-risky hotspot prediction using heatmaps and convolutional neural networks.

**3. Modeling**

A deep learning model composed of Convolutional Network Networks and the LSTM, as in figure 6, was used as the main algorithm for heatmaps prediction. The model’s input is a sequence of temporal binary maps and the output of the next map of the sequence containing the risk probability of each cell.

Convolutional networks are great for learning the relation between nearby cells in the map while the LSTM networks learn the relation between map sequences in time. Combining them together and you have a powerful algorithm that learns the spatial and temporal properties of our dataset.

The training and target sets were made by grouping a sequence of 16 daily binary maps for input and using the next one as target creating 576 unique sequences. It is important and necessary to state that maps artificially created were skipped when selected as a target since they don’t translate into a real scenario.

For the train and test splitting the decision, the boundary was based on the relevant temporal data between the period of 2013 to 2017 where are samples before the second half of 2016 were selected to train the model and the second semester to validate it.

A total of 100 epochs was used to train the model with a learning rate of 0.0001 in batches of size 8. An amount of 20 % of the training data was used to validate the mode during the process. Adam optimizer was used to correct the learning direction.

Heatmaps are very large sparse matrices (matrix with lots of zeroes), so to equilibrated the loss that naturally towards the easy way meaning learning to predict only zeros we use a weighted cross-entropy function for backpropagation to put more emphasis on missing ones.

It is interesting to mention that on this model we are not encoding the heatmaps in a higher-dimensional space, but squashing them into lower dimension learning the most significant features using the Max Pooling 3D layers after the LSTM that learns the sequence features.

**4. Evaluation**

To evaluate the model we used a separate test set to validate its behavior on unseen data and avoid bias.

Despite the fact that we are dealing with a classification problem we cannot assume an arbitrary threshold like 0.5 and assume that all values above this line are considered risks and all below neutral cells. Our goal is to predict regions that have higher chances of an incident to occur and not exactly the spot that a crime will happen. Besides that, we trained on large sparse matrices so its natural that all values will lean towards zero.

A more practical way is to define a percentile among all cell values and decide that above this threshold we will classify it as risky. So for example, if we define that the risky threshold is above the 96th percentile, we first count the number of cells for each value predicted, group them and take the value that represents the 96th threshold. On numpy there is a function called np.percentile that is useful for this case. On Noah et al. they recommend taking the percentile for the entire test prediction instead of for each sample to average the value between different seasons.

After deciding the percentile value and converting the analog results into a binary map we measure the score against the true map. To not penalize the prediction so hard against missing results that can lead up to adjust the model in a way that we end up with a totally overfitted algorithm we give some score for the predicted cell in neighboring areas around hotspots since we didn’t miss so much.

Instead of the four classification labels, we have six:

The following table represents the labels for each classification:

With this table in mind, we count each of those labels for each predicted cell as seen in figure 12.

For the example above we have the following results depending on the selected percentile value:

The lower the percentile threshold, the higher the number of cells we classify as risky, and the higher is our model accuracy, but is it good?

We have to find the perfect balance between risk classification and model accuracy to get the best end-user experience.

A conservative view with a lower threshold would lead to a high number of correct classifications but at the same time creating the wrong perception for the users that the entire map is a dangerous zone. On the other side if we rise too much the percentile we would barely classify any place as a hotspot and incurring serious error: **FALSE NEGATIVE**.

There is not a single problem that can be worse for a person’s experience than trust on an application that is telling them that they are stepping into a safe zone and end being a victim of a crime. In fact, depending on the proportion it takes in the media it can be the end for the entire system we are building to help people.

So, between all those labels, what are the most important ones to look at? Noah at. al state in the article that **Correct**, **False Negative Neighbor **and **False Negative** should be taken into consideration when doing the model evaluation. The goal is to find the balance between **False Negative Vs. Correct** +** False Negative Neighbor** rate and **Accuracy**.

In figure 13 we can check the model evaluation for percentile that ranges from 85th to 99th threshold. The higher the value the more false negatives we have despite increasing the model’s accuracy. That happens because we evaluate more neutral cells (True Negatives) as correct. Those cells are also important to correctly identify since we want to guide the user on a safe space while avoid indicating too many places as risky while they are not.

**5. The prediction**

There are some guidelines suggested by Noah et al. to make the prediction in a way people have an easy understanding of the generated output. The first way is just to display the cells classified as risky with one color while muting the others while the second way is to define a group of threshold values to classify each cell.

So as a general guideline for model prediction is as following:

- Make a prediction for a single sample or a whole batch.
- Select a percentile to evaluate all cells from the output prediction that will be used as a threshold to classify all values for each cell that falls above as risky and below as neutral or another defining label.
- Convert the analog values into discrete ones using the threshold list from above.

In figure 14 there are two examples using the 96th and 99th threshold as risky values colored in red. The other values are made using lower threshold selections like the 91st, 75th, and 25th percentile.

The two black dots on the red represent the crime that occurred on that particular day and both were captured by the model. It is easy to check that the 99th percentile prediction gives a more accurate risky area but it has at the same time more chance to miss important cells since it aggressively tries to classify just areas where we are almost certain that something can happen. Tenting should be done using different percentiles in order to find the one that best adapts to the user’s needs.

An important problem to state is related to the border effect caused by the padding on the Convolutional Neural Networks and machine learning algorithms. Zero paddings are not the best way to deal with the borders, especially on small images like heatmaps, and they should be ignored when putting on production since they hardly translate into reality.

One way to improve the resolution of the output is to use a bilinear interpolation technique and increase the heatmap size providing a more fine-tuned resolution when presenting to the final user like the one shown in figure 15.

**Final considerations**

There are many solutions out there using deep learning models on crime forecasting using heatmaps and convolutional neural networks. This article tried to contribute to showing an approach that fitted best on the resources that we had for this challenge while telling what and why the other solutions didn’t work.

There is room for improvements such as:

- Testing different spatial dimensions (grid size)
- Testing different temporal dimensions (time granularity) like aggregating heatmaps in weeks, months, and so on.
- Testing different upsampling techniques for filling the gaps when no data is presented.
- Trying to implement another padding technique for removing border effect on the convolutional network.
- Trying to convert this problem from classification to regression and use the other proposed deep learning approaches.