# Applying CNNs To Images For Computer Vision And Text For NLP

Jun 11, 2021
###### Rate this post Convolutional Neural Networks (CNNs, ConvNets) have become crucial in artificial intelligence. These networks are reputable for excellently capturing spatial and dependency information in matrices (images, sentences), while effectively reducing their sizes without discarding information. In this tutorial, we walk you through the methods of applying CNNs in computer vision and NLP with the essential code needed!

Author: Henry Ndubuaku

## What is CONVOLUTION

Convolution generally refers to the mathematical combination of two functions to produce a third function. It is a reliable way to produce an output with a relationship between 2 functions.

Now a grayscale image is a matrix whose values are the intensities of each pixel in a display.

A coloured image would have more channels (similarly shaped matrixes) representing each contributing colour, usually Red, Blue, Green. The shape of the image matrix in this case would be 3x7x7.

Convolution is utilised to transform an image by applying a filter (a square matrix) to it, creating a feature map that summarizes the presence of detected features in the input. The filter could be for example a rotation or an edge detection matrix whose values/weights were mathematically constructed.

This is achieved by the dot product of the filter weights/values and the input (image pixel matrix). The filter is applied systematically to each overlapping part or filter-sized patch of the input data, left to right, top to bottom.

The distance at which the slide shifts when attempting to downsample the image is called the stride. To maintain the dimension of output as in input, the image is padded with zeros.

Multiple filters are used to learn diverse features, these filters per image channel hence form a complex and large feature map. Pooling is an operation used to reduce the dimension of each matrix in the tensor.

The most common technique is Max Pooling, where the maximum value in the chosen window is chosen. This is carried out systematically just like in convolution. Average pooling takes the average of the window.

## CNN

While these filters can be handcrafted, convolutional neural networks learn the filter values/weights during training in the context of a specific prediction problem. These learnable weight matrices that contribute to a model’s predictive power, updated during back-propagation are called parameters.

### MNIST Classification Example

The MNIST images are ready-to-use images of handwritten digits. The goal is to build a model that distinguishes between the 10 digits.

Step 1: Import files and load dataset

```from tensorflow import keras
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np

(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()```

Step 2: Normalize images by dividing each pixel by 255 (maximum value of pixels) to keep the values no more than 1. Benefits include efficiency and more.

`train_images, test_images = train_images / 255.0, test_images / 255.0`

Step 3: Expand the dimension of the images to

```train_images = np.expand_dims(train_images, axis=3)

test_images = np.expand_dims(test_images, axis=3)```

Step 4: One-hot encode the labels where each label is represented by a vector with length = num of classes. The value at the correct index = 1 while the rest is zero.

```test_labels = to_categorical(test_labels)

train_labels = to_categorical(train_labels)
```

Step 5: Check the structure of the images

```print("x_train shape:", train_images.shape)

print(train_images.shape, "train samples")

print(test_images.shape, "test samples")```

Output :

x_train shape: (60000, 28, 28, 1)

60000 train samples

10000 test samples

Step 6: Assign variables

```num_classes = 10

input_shape = (28, 28, 1)

num_filters = 8

filter_size = 3

pool_size = 2
```

Step 7: Create the model

```model = keras.Sequential([

layers.Conv2D(num_filters, filter_size, input_shape=(28, 28, 1)),

layers.MaxPooling2D(pool_size=pool_size),

layers.Flatten(),

layers.Dense(num_classes, activation='softmax'),

])```

Because this is a classification task, the output of the pooling layer needs to be flattened to enable general operations.

And passed through a fully connected layer units = num of classes, and activation = ‘Softmax’ to calculate the probabilities.

Step 8: Compile and train

```model.compile('adam',

loss='categorical_crossentropy',

metrics=['accuracy'],

)

model.fit(train_images, train_labels, batch_size=128, epochs=2, validation_split=0.1)
```

Step 9: Evaluate

```score = model.evaluate(test_images, test_labels, verbose=0)

print("Test loss:", score)

print("Test accuracy:", score)
```

## CNN in Natural Language Processing

Now consider a text classification task where the context of each word (surrounding word) has been proven to help distinguish the words (the purpose of n-grams). In computer vision tasks, the filters used in CNNs slide over patches of an image whereas in NLP tasks, the filters slide over the sentence matrices, a few words at a time, producing outputs that combine words and their contexts like n-grams, just more compact and efficient.

After tokenizing and representing each sentence with their integer values to get a dataset like

[[1,2,3,5,1],

[4,2,5,6,6],

[6,2,6,7,7]

.

.

[1,4,6,7,7]]

Where each vector is a sentence and each value a word (pad shorter sentences if needed), a simple keras architecture would be something like this.

```model4.add(Embedding(vocab_size, desired_embed_dim, input_length=256))

In the embedding layer, each word would be represented with a fixed-length vector, transforming each sentence into a matrix like:

[0.345, 0.33556, 0.235]

[0.345, 0.33556, 0.235]

[0.345, 0.33556, 0.235]

[0.345, 0.33556, 0.235]

[0.345, 0.33556, 0.235]

Where the values are determined by word similarity. The convolution would be carried out on this matrix. Alternatively, the embeddings can be extracted from a pre-trained model and formed into a matrix. Think word2vec, GloVe, ElMo, USE, BERT etc.

While using CNNs for text tasks, it is imperative to be wary of preprocessing as techniques like stop word removal and lemmatization would alter the sentence, hence the convolution could occur with the wrong word contexts.

It is generally assumed that CNNs are only suitable for text classifications unlike in images. Now imagine convolving a word matrix where each character is represented as an embedding vector, each word could be classified (think POS tagging and entity recognition). The biggest advantage of CNNs over recurrent networks is their ability to capture contextual information over really long sequences (like articles) efficiently and their swift convergence. Only transformers outperform CNNs in this regard due to their attention mechanism.

However, using CNNs for text generation is challenging, but CNNs and RNN variants can be creatively combined, leveraging each method’s strengths to solve problems in various domains (Computer Vision, NLP, robotics and more).

### Conclusion

We learnt how to extract features from an image using convolution, reduce the dimension by subsampling (pooling), flattened to facilitate classification based on the differences using a softmax activated dense layer. We also learnt to use the same principles in a document matrix for text classification.

##### More Articles
###### Develop Your Career and Make a Real-World Impact

Improve your skills, get job offers, and solve meaningful problems.

Apply & grow in our real-world projects

Explore

## Omdena

Building AI Solutions for Real-World Problems ###### Omdena School

Real-world data science and machine learning courses.