Applying CNNs To Images For Computer Vision And Text For NLP
June 11, 2021
Convolutional Neural Networks (CNNs, ConvNets) have become crucial in artificial intelligence. These networks are reputable for excellently capturing spatial and dependency information in matrices (images, sentences), while effectively reducing their sizes without discarding information. In this tutorial, we walk you through the methods of applying CNNs in computer vision and NLP with the essential code needed!
What is CONVOLUTION?
Convolution generally refers to the mathematical combination of two functions to produce a third function. It is a reliable way to produce an output with a relationship between 2 functions.
Now a grayscale image is a matrix whose values are the intensities of each pixel in a display.
A colored image would have more channels (similarly shaped matrixes) representing each contributing color, usually Red, Blue, Green. The shape of the image matrix in this case would be 3x7x7.
Convolution is utilised to transform an image by applying a filter (a square matrix) to it, creating a feature map that summarizes the presence of detected features in the input. The filter could be for example a rotation or an edge detection matrix whose values/weights were mathematically constructed.
This is achieved by the dot product of the filter weights/values and the input (image pixel matrix). The filter is applied systematically to each overlapping part or filter-sized patch of the input data, left to right, top to bottom.
The distance at which the slide shifts when attempting to downsample the image is called the stride. To maintain the dimension of output as in input, the image is padded with zeros.
Multiple filters are used to learn diverse features, these filters per image channel hence form a complex and large feature map. Pooling is an operation used to reduce the dimension of each matrix in the tensor.
The most common technique is Max Pooling, where the maximum value in the chosen window is chosen. This is carried out systematically just like in convolution. Average pooling takes the average of the window.
CNN
While these filters can be handcrafted, convolutional neural networks learn the filter values/weights during training in the context of a specific prediction problem. These learnable weight matrices that contribute to a model’s predictive power, updated during back-propagation are called parameters.
MNIST Classification Example
The MNIST images are ready-to-use images of handwritten digits. The goal is to build a model that distinguishes between the 10 digits.
Step 1: Import files and load dataset
from tensorflow import keras from tensorflow.keras import datasets, layers, models from tensorflow.keras.utils import to_categorical import matplotlib.pyplot as plt import numpy as np (train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()
Step 2: Normalize images by dividing each pixel by 255 (maximum value of pixels) to keep the values no more than 1. Benefits include efficiency and more.
train_images, test_images = train_images / 255.0, test_images / 255.0
Step 3: Expand the dimension of the images to
train_images = np.expand_dims(train_images, axis=3) test_images = np.expand_dims(test_images, axis=3)
Step 4: One-hot encode the labels where each label is represented by a vector with length = num of classes. The value at the correct index = 1 while the rest is zero.
test_labels = to_categorical(test_labels) train_labels = to_categorical(train_labels)
Step 5: Check the structure of the images
print("x_train shape:", train_images.shape) print(train_images.shape[0], "train samples") print(test_images.shape[0], "test samples")
Output :
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Step 6: Assign variables
num_classes = 10 input_shape = (28, 28, 1) num_filters = 8 filter_size = 3 pool_size = 2
Step 7: Create the model
model = keras.Sequential([ layers.Conv2D(num_filters, filter_size, input_shape=(28, 28, 1)), layers.MaxPooling2D(pool_size=pool_size), layers.Flatten(), layers.Dense(num_classes, activation='softmax'), ])
Because this is a classification task, the output of the pooling layer needs to be flattened to enable general operations.
And passed through a fully connected layer units = num of classes, and activation = ‘Softmax’ to calculate the probabilities.
Step 8: Compile and train
model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'], ) model.fit(train_images, train_labels, batch_size=128, epochs=2, validation_split=0.1)
Step 9: Evaluate
score = model.evaluate(test_images, test_labels, verbose=0) print("Test loss:", score[0]) print("Test accuracy:", score[1])
CNN in Natural Language Processing
Now consider a text classification task where the context of each word (surrounding word) has been proven to help distinguish the words (the purpose of n-grams). In computer vision tasks, the filters used in CNNs slide over patches of an image whereas in NLP tasks, the filters slide over the sentence matrices, a few words at a time, producing outputs that combine words and their contexts like n-grams, just more compact and efficient.
After tokenizing and representing each sentence with their integer values to get a dataset like
[[1,2,3,5,1],
[4,2,5,6,6],
[6,2,6,7,7]
.
.
[1,4,6,7,7]]
Where each vector is a sentence and each value a word (pad shorter sentences if needed), a simple keras architecture would be something like this.
model4.add(Embedding(vocab_size, desired_embed_dim, input_length=256)) model4.add(Conv1D(100,n_units)) model4.add(GlobalMaxPooling1D()) model4.add(Dense(1, activation='sigmoid'))
In the embedding layer, each word would be represented with a fixed-length vector, transforming each sentence into a matrix like:
[0.345, 0.33556, 0.235]
[0.345, 0.33556, 0.235]
[0.345, 0.33556, 0.235]
[0.345, 0.33556, 0.235]
[0.345, 0.33556, 0.235]
Where the values are determined by word similarity. The convolution would be carried out on this matrix. Alternatively, the embeddings can be extracted from a pre-trained model and formed into a matrix. Think word2vec, GloVe, ElMo, USE, BERT, etc.
While using CNNs for text tasks, it is imperative to be wary of preprocessing as techniques like stop word removal and lemmatization would alter the sentence, hence the convolution could occur with the wrong word contexts.
It is generally assumed that CNNs are only suitable for text classifications unlike in images. Now imagine convolving a word matrix where each character is represented as an embedding vector, and each word could be classified (think POS tagging and entity recognition). The biggest advantage of CNNs over recurrent networks is their ability to capture contextual information over really long sequences (like articles) efficiently and their swift convergence. Only transformers outperform CNNs in this regard due to their attention mechanism.
However, using CNNs for text generation is challenging, but CNNs and RNN variants can be creatively combined, leveraging each method’s strengths to solve problems in various domains (Computer Vision, NLP, robotics, and more).
Conclusion
We learned how to extract features from an image using convolution, reduce the dimension by subsampling (pooling), and flattened to facilitate classification based on the differences using a softmax-activated dense layer. We also learned to use the same principles in a document matrix for text classification.