Enhancing Drone Imagery with Super-Resolution Using Deep Learning

Mar 5, 2022
Reading Time
Rate this post
(15 votes)
Enhancing Drone Imagery with Super-Resolution Using Deep Learning

The aim of this tutorial is to walk you through the general application of super-resolution using deep learning to images that can be adopted to solve remote sensing, precision farming problems, or even video surveillance. This article will focus on building an SR model using the DIV2K dataset.

Authors: Samuel Theophilus, and Farida Kamal

Source: https://unsplash.com/photos/HeqXGxnsnX4 (Unsplash)


This article takes you through the steps needed to convert low-resolution images to high-resolution images. This can be used to enhance the quality of spatial images captured by remote sensors (Satellites, Aircraft & Drones). Omdena’s team wanted to improve the quality of drone images so that weed and crop detection models have better input images containing more detail about the landmass as a part of this project “Detecting Weeds & Crops Using Drone Imagery To Reduce Environmentally Devastating Herbicides Usage

Problem statement

The challenge was designed to use artificial intelligence to manage and reduce the use of herbicides on farmlands. The continuous use of herbicides on farms can cause soil and water contamination. The aim of this project was to develop ML models that are able to identify weed species, identify different crops, and be used in precision farming to boost crop yields for farmers.

Our role in the project was to use Super-resolution to enhance the quality of spatial images captured with drones by transforming low-resolution images into high-resolution images so that tasks such as crop/ weed detection and segmentation have better resulting outputs.

Traditional super-resolution methods such as Bi-cubic, Nearest Neighbor, and Bi-linear  Interpolation have been used in the past to enhance image quality, interpolation but these techniques have not been efficient in generating high-quality images as their resulting outputs are using blur and contain noise hence the need for better methods. In this project, we used supervised machine learning algorithms to build the super-resolution Model which took Low Resolution (LR) images as input and generated their corresponding High Resolution (HR) images as output.

Super-resolution pipeline

The pipeline begins with the creation of a data set of tiled images sliced using dimensions 512 x 512 from the large orthophotos (aerial photograph). These tiled images are then pre-processed (degraded and down-scaled) to generate HR images and their corresponding LR images (spilled into the train, test, and validation data for model training) after which developed models are evaluated for performance using SSIM, PSNR, and MSE scores.

Super-resolution using SRGAN

In this tutorial, we will walk you through an introduction to super-resolution using the Super-Resolution Generative Adversarial Network (SRGAN) using the DIV2K data set.

The paper “ Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” introduced a generative adversarial network for super-resolution which tries to reduce the peak signal-to-noise ratio by using perpetual loss in the generative model as compared to pixel similarity check. The Generator uses skip connections to try to generate high-resolution (HR) images from low-resolution (LR) images. The discriminator tries to guess real HR data from Fake HR (Computer-generated).

Step 1

You need to install the following libraries:

  • python==3.6
  • cudatoolkit==10.1
  • tqdm==4.32.1
  • Pillow==6.0.0
  • matplotlib==3.1.1
  • tensorflow==2.3.1
  • tensorflow-addons==0.11.2

Now import the libraries in your notebook cell:

  • import time
  • from PIL import Image
  • import numpy as np
  • import tensorflow as tf
  • import os
  • import matplotlib.pyplot as plt
  • from data import DIV2K
  • from model.srgan import generator, discriminator
  • from train import SrganTrainer, SrganGeneratorTrainer
  • %matplotlib inline

Step 2

Clone the GitHub repository “https://github.com/krasserm/super-resolution.git”  and connect your google drive account to Colab so you can import custom libraries & functions needed to train the SR model.

Connect to Google Drive using the code below. You are required to authenticate your google account in other to gain access to it from Colab:

from google.colab import drive

Then clone the repo:

!git clone https://github.com/krasserm/super-resolution.git

Keep in mind the path to the cloned repository as this will be helpful when importing functions. Review this notebook for more information.

Step 2.1: Loading the Dataset

Now that the libraries are installed, the next step will be to load the data set. DIV2K is an open-source data set included in the Tensorflow catalog. DIV2K is referred to as the DIVerse 2K RGB image data set. These images are high-quality images or high-resolution images that were used for the NTIRE Challenge in 2017 and 2018 challenges on Single Image Super Resolution(SISR). DIV2K has 1125 labels also called a picture type. Picture types included in the data set are Grass, Plant, Tree, Landscapes, and the likes of it. To explore more on this data set you can click on Know Your Data(KYD) which gives a detailed explanation of the types of images available in DIV2K.  The Div2k data set has different sets of images based on scales.

For the SRGAN model, the following criteria needs to be satisfied for choosing the images from the DIV2K data set. Based on the criteria below the images were sampled.

  • The images with scale = 2,3,4 or 8.
  • Downgrade operator  with values “bicubic”, “unknown”, “mild” or “difficult”

The `data.py` file contains a class `DIV2K` which automatically downloads the images for Training and Validation from the DIV2K data set. Manual download of the data set is not required for this example.

In the code below, we will use scale=2 and downgrade=’bi-cubic’ from the ‘train’ folder in the DIV2K data set. The class throws an error if the scale, downgrade, and subset values are out of scope. The code below is the implementation.

class DIV2K:
def __init__(self,






self._ntire_2018 = True

_scales = [2, 3, 4, 8]

if scale in _scales:

self.scale = scale


raise ValueError(f'scale must be in ${_scales}')

if subset == 'train':

self.image_ids = range(1, 801)

elif subset == 'valid':

self.image_ids = range(801, 901)


raise ValueError("subset must be 'train' or 'valid'")

_downgrades_a = ['bicubic', 'unknown']

_downgrades_b = ['mild', 'difficult']

if scale == 8 and downgrade != 'bicubic':

raise ValueError(f'scale 8 only allowed for bicubic downgrade')

if downgrade in _downgrades_b and scale != 4:

raise ValueError(f'{downgrade} downgrade requires scale 4')

if downgrade == 'bicubic' and scale == 8:

self.downgrade = 'x8'

elif downgrade in _downgrades_b:

self.downgrade = downgrade


self.downgrade = downgrade

self._ntire_2018 = False

self.subset = subset

self.images_dir = images_dir

self.caches_dir = caches_dir

os.makedirs(images_dir, exist_ok=True)

os.makedirs(caches_dir, exist_ok=True)

Step 2.2: Image Augmentation

Image transformations are done to ensure the dimensions of the images are appropriate as per the instructions of the SRGAN model. The images are cropped, flipped horizontally from left to right, and rotated 90 degrees. SRGAN is an up-scaling model which zooms the pictures more than 100%. Therefore the training images need to be down-sampled and fed into the training model which will produce an up-scaled image. And so, we implement the image transformations to resize the images into smaller dimensions. In our example, we resize the image to 96 x 96 pixels.  The implementation code is shown below.

def random_crop(lr_img, hr_img, hr_crop_size=96, scale=2):

lr_crop_size = hr_crop_size // scale

lr_img_shape = tf.shape(lr_img)[:2]

lr_w = tf.random.uniform(shape=(), maxval=lr_img_shape[1] - lr_crop_size + 1, dtype=tf.int32)

lr_h = tf.random.uniform(shape=(), maxval=lr_img_shape[0] - lr_crop_size + 1, dtype=tf.int32)

hr_w = lr_w * scale

hr_h = lr_h * scale

lr_img_cropped = lr_img[lr_h:lr_h + lr_crop_size, lr_w:lr_w + lr_crop_size]

hr_img_cropped = hr_img[hr_h:hr_h + hr_crop_size, hr_w:hr_w + hr_crop_size]

return lr_img_cropped, hr_img_cropped

def random_flip(lr_img, hr_img):

rn = tf.random.uniform(shape=(), maxval=1)

return tf.cond(rn < 0.5,

lambda: (lr_img, hr_img),

lambda: (tf.image.flip_left_right(lr_img),


def random_rotate(lr_img, hr_img):

rn = tf.random.uniform(shape=(), maxval=4, dtype=tf.int32)

return tf.image.rot90(lr_img, rn), tf.image.rot90(hr_img, rn)

Step 3

Define the SRGAN model:

Before training, let us define the model.

The SRGAN model architecture is implemented in the `srgan.py` file which contains the definition of the SRGAN model. The SRGAN model is a Convolutional Neural Network(CNN) model.

SRGAN is a combination of 2 CNN’s.

  • Generator Model (SRResnet)
  • Discriminator Model (SRGAN)

Generator Model (SRResnet):

The main task of the generator model is to produce super-resolution images.

Input Layer:

SRResnet is the generator model used for creating a HR image. The architecture image shows the design for the implementation of the generator model. It starts with an Input Layer where the images are fed into the model.

Conv2D layer:

The first layer is a Convolution layer with 64 filters and 16 random images which is defined by num_res_blocks=16, kernel size=6, and padding strides = ‘same’.

In a CNN network, the dimensions of the image decreases (or is referred to as down-sampling), with the increase in pooling layers. But, for Super-resolution you need to increase the dimensions of images (or referred to as up-scaling). And so, the model uses zero-padding with strides of 1 in the Convolution layer and the Pooling layers are removed.

Batch Normalization (BN):

Batch Normalization layer standardizes the input from the convolution layer. A BN layer is inserted in between each layer to stabilize the learning process of the model. For the SRResnet model batch normalization model constant of 0.8 is added next to the convolution layer.

PReLU – Parametric Rectified Linear Unit activation function:

The second layer is a PReLU which is a type of rectifier(ReLU) used for activation functions due to their speed in training complex datasets. ReLU is a linear activation function. A parameter is added to the ReLU which is derived from the model creating a variant of ReLU called Parametric Rectified Linear Unit function.

The output of PReLU is fed into the next sets of layers referred to as Residual Blocks. These blocks are a common set of blocks that are iterated 16 times. Therefore, there are 16 residual blocks as per the model architecture.

Each Residual block consists of Convolution Layer(Conv), Batch Normalization layer(BN), PReLU, Conv, BN and Elementwise Sum.

Elementwise Sum= Residual block BN output + Output of PReLU from the previous step.

Additional Layers:

The output from the residual block is fed into a Conv, BN and Elementwise Sum blocks.

Elementwise Sum = Output from the 16th residual block + Output from the initial ReLU block(skip connection).

Up-sampling Layers:

There are 2 sets of up-sampling blocks to help increase the dimensions of the image(The current image dimensions in this step is 96 x 96 before it is fed into the up-sampling layer).

Each up-sampling block consists of Conv, (pixelshuffler X2) and PReLU. Pixelshuffler shuffles the image pixel by 2 times. Each up-sampling block increases the dimensions of the image to 256 x 256. Therefore, for 2 Up-sampling blocks, we get image dimensions of 512 x 512 in the final output.

Super-resolution Layer:

This layer consists of a Conv layer where the output from up-sampling is fed into this convolution layer which produces the Super-resolution image. The activation function used in ‘tanh’ and a batch denormalization is to produce the Super-resolution output image.

The code below is a definition of the Generator Model. For more information, visit the link to the SRGAN implementation on GitHub.

def sr_resnet(num_filters=64, num_res_blocks=16):

x_in = Input(shape=(None, None, 3))

x = Lambda(normalize_01)(x_in)

x = Conv2D(num_filters, kernel_size=9, padding='same')(x)

x = x_1 = PReLU(shared_axes=[1, 2])(x)

for _ in range(num_res_blocks):

x = res_block(x, num_filters)

x = Conv2D(num_filters, kernel_size=3, padding='same')(x)

x = BatchNormalization()(x)

x = Add()([x_1, x])

x = upsample(x, num_filters * 4)

x = upsample(x, num_filters * 4)

x = Conv2D(3, kernel_size=9, padding='same', activation='tanh')(x)

x = Lambda(denormalize_m11)(x)

return Model(x_in, x)

generator = sr_resnet

Discriminator Model (SRGAN):

Discriminator architecture is a General Adversarial Network(GAN) model. GAN has two models (generator, discriminator) that compete against each other. The discriminator model is an image classification model that identifies if the images fed into the discriminator are real or fake. Real images refer to the original High-Resolution image. Fake images refer to the generated Super-resolution image.  In other words, both the generator and discriminator are trained together until the discriminator fails to differentiate between the original image or the fake image. In this process, the generator in turn learns to create Super Resolution images that are similar to the original image.

Input Layer:

There are two inputs that are fed into the model.

  1. Original High-resolution image

These are images from the DIV2K data set.

  1. Generated Super-resolution image

These are images generated by the SRResnet generator from the previous section.


The first layer in the discriminator block is a convolution layer with 64 filters, kernel size=3, strides=1, and padding = ‘same’. This layer configuration is similar to the SRResnet generator’s first layer, however, in the discriminator model, the convolution layer takes 2 image inputs, whereas the generator takes a single image input.

Leaky ReLU:
ReLU is an activation function used for the feature extraction of images. Leaky ReLU is a linear activation function which permits the usage of adding a constant to the function. The Leaky ReLU constant is 0.2 for the discriminator model. This helps avoid max-pooling throughout the network.

Convolutional Blocks:

The output of Leaky ReLU is fed into a set of layers that are called Convolutional blocks.

Each Convolutional block consists of a Conv,BN and Leaky ReLU layer. Each convolutional block is iterated 8 times based on the design of the discriminator. The output of this residual block is 512 x 512 dimensions images with reduced resolution.

Dense(1024), Leaky ReLU,Dense(1):

The output of Convolutional blocks is fed into a Dense layer with 1024  neurons connected to each other which is followed by a LeakyRelu of constant 0.2 and another Dense layer with Sigmoid activation to obtain the probability of the Classification for Fake image or Real image.

The code below is a definition of the Discriminator Model. For more information, visit the link to the SRGAN implementation on GitHub.

def discriminator_block(x_in, num_filters, strides=1, batchnorm=True, momentum=0.8):

x = Conv2D(num_filters, kernel_size=3, strides=strides, padding='same')(x_in)

if batchnorm:

x = BatchNormalization(momentum=momentum)(x)

return LeakyReLU(alpha=0.2)(x)

def discriminator(num_filters=64):

x_in = Input(shape=(HR_SIZE, HR_SIZE, 3))

x = Lambda(normalize_m11)(x_in)

x = discriminator_block(x, num_filters, batchnorm=False)

x = discriminator_block(x, num_filters, strides=2)

x = discriminator_block(x, num_filters * 2)

x = discriminator_block(x, num_filters * 2, strides=2)

x = discriminator_block(x, num_filters * 4)

x = discriminator_block(x, num_filters * 4, strides=2)

x = discriminator_block(x, num_filters * 8)

x = discriminator_block(x, num_filters * 8, strides=2)

x = Flatten()(x)

x = Dense(1024)(x)

x = LeakyReLU(alpha=0.2)(x)

x = Dense(1, activation='sigmoid')(x)

return Model(x_in, x)

Step 4

Train Model:

The next step is to Train the SRGAN model on the Train Dataset and use the Test data set to find the predicted results.

The ‘example-SRGAN.ipynb’ file contains all the function calls for the model definition detailed in Step 3.

The SRResnet model is trained using the SrganGeneratorTrainer function. The SRResNet networks were trained with a learning rate of 1 million(1000000) steps with 1000 epochs per training step. The weights for the model are saved in a “pre_generator.h5” file.

The discriminator GAN purpose is to combine 2 models and make it compete against each other(explained in Step 3). Therefore, the SRResnet model’s weights(‘pre_generator.h5’) and the discriminator model(GAN) are trained together to get the super-resolution images output. The classifier predicts the realistic version of the fake image as the real image which is the Super-resolution image.  The weights for SRGAN(SRResnet+Discrimnator GAN)are stored in ‘gan_generator.h5’.

Train Pre Trainer:

pre_trainer = SrganGeneratorTrainer(model=generator(), checkpoint_dir=f'.ckpt/pre_generator')







Train GAN:

gan_generator = generator()


gan_trainer = SrganTrainer(generator=gan_generator, discriminator=discriminator())

gan_trainer.train(train_ds, steps=200000)



Using the “gan_generator.h5’ weights, call the ‘resolve_and_plot’ function from the ‘common.py’ file to view the output of the images. The images shown below are the Low resolution(‘LR’), Pre super-resolution(SR (PRE)), and Discriminator generated image(SR (GAN))

Step 5

Metrics Evaluation (Code and Score):


The Peak Signal-to-Noise Ratio (PSNR) is an image evaluation metric that calculates the ratio between the maximum possible power of an image and the power of corrupting noise that affects the quality of its representation. PSNR is commonly used to measure the quality of reconstruction of lossy compression. When evaluating PSNR scores, values closer to 100 are considered good results, and values drawing close to zero are considered bad performance.

The code below is a function implementation of the PSNR metric using the NumPy and math python libraries.

def PSNR(gen_image, tar_image):

gen_converted_img=np.squeeze(gen_image, axis=0)

tar_converted_img=np.squeeze(tar_image, axis=0)

mse = np.mean((tar_converted_img -gen_converted_img) ** 2)

if(mse == 0):

return 100

max_pixel = 255.0

psnr = 20 * log10(max_pixel / sqrt(mse))

The trained SRGAN model had a PSNR score of 29.125967.

Evaluate visual results:

Using the “gan_generator.h5’ weights, call the ‘resolve_and_plot’ function from the ‘common.py’ file to view the output of the images. The images shown below are the

  • Low resolution(‘LR’),
  • Pre superresolution(SR (PRE) – by the Pre-Trainer Model) and
  • Discriminator generated image(SR (GAN)- by the GAN Model)

RESULTS: Visual evaluation of PRE-Train vs GAN Model predictions

The image below shows a final comparison between the Low-resolution image, GAN generated image, and actual High-Resolution image:

RESULTS: Visual evaluation of SRGAN Model prediction


In this article, we provide a step-by-step guide to enhancing image quality using a Super-resolution SRGAN Model. We started by walking you through a practical application of Super-solution in precision farming and showing you the project pipeline. We then ran an introductory SR tutorial using the DIV2K data set. There are a great number of deep learning methods which can be used to build high-performing Super-resolution models. We hope this article has given you a solid start and would love to see you build better-performing models.


One of the major limitations you will face is the need for more computing power and RAM. Due to the size of input and output images (128 x 128, 512 x 512) and the number of layers used in the Super-resolution model training, High RAM access is required to prevent RAM from crashing during model training (this was a limitation during the SkyMaps project).

Also, SRGAN was trained on an NVIDIA Tesla M40 GPU using a random sample of 350 thousand images from the ImageNet database. Therefore training on lower computation specifications could lead to a longer training time.


Do you like this article?
(15 votes)

Want to build your portfolio with real projects?

NGOs Events