Cross-Domain Deep Learning Model for Detecting Deep Fakes Images
February 7, 2022
In this article, I describe how our team created a cross-domain deep learning network model to detect deep fake images.
The project partner: The use case stems from DuckDuckGoose AI which hosted an Omdena Challenge as part of Omdena´s AI Incubator for impact startups.
Improving Deep Fake Algorithm Challenge
Deepfake technology is a relatively novel technique for creating or manipulating images or videos. The rapid evolution of deep learning techniques has resulted in a wide range of possibilities for creating such material. The negative effect of this technology is that it is now easier than ever to disseminate disinformation, spread revenge porn, commit financial frauds, hoaxes, or disrupt government functioning. Therefore it is of paramount importance to be able to manipulate Deepfake detection in the footage.
Introduction
The Deep Fake cross domain detection model aims to leverage spatial as well frequency features of an image to distinguish real images from the fake images. We will use 2 different models, the first one is a frequency domain model (Model A) and the second one a spatial domain model (Model B) to detect deep fakes. We will then combine them to create a cross domain model and train it. Model performance evaluation will be carried out on the test data from the same source as that of the training data. We also evaluate the performance of the models on the out of distribution dataset i.e. test data is sourced from different sources than those used for training the deep fake image detector. We will refer to these test sets as test set 1 and test set 2 respectively in this article.
The cross domain detection model will be trained on 2 GPUs using Tensorflow and Keras deep learning library. The training set consists of 70,000(25,820 real / 44,180 fake) images, out of which 80% will be used for training and 20% for validation. The testing set 1 and 2 contains 70,000 (27,431 real / 42,569 fake)and 35,000 images (18,552 real / 16,448 fake) respectively. We will create the model by first defining the tensorflow input pipeline for Model A and B. We will train Model A and B separately and then together.
Cross-domain model step by step (with code)
Tensorflow Input Pipeline
We define the tensorflow input pipeline for the model by first parsing the input images and resizing them. After the parsing, either discrete cosine transformation (DCT-II) is applied or Image Augmentation is applied depending on the model used. At the end, we create a batch of input images and apply standard prefetch and cache functions.
# Parse the Image def parse_function(filename, label): IMG_SHAPE = 224 image_string = tf.io.read_file(filename) image = tf.image.decode_png(image_string, channels=3) #This will convert to float values in [0, 1] image = tf.image.convert_image_dtype(image, tf.float32) resized_image = tf.image.resize(image, [IMG_SHAPE, IMG_SHAPE]) return resized_image, label
Frequency Domain Model (Model A)
Frequency domain-based deep learning model is based on performing frequency analysis on the input images. We useDCT to perform frequency analysis on the images and then use a simple neural network classifier to detect whether an image is real or fake. DCT is a technique applied to image pixels in spatial domain in order to transform them into a frequency domain.
# Applying Discrete Cosine Transformation - II on the images def dct_preprocess(image, label): img_t = tf.transpose(image,perm=[2, 1, 0]) X1 = tf.signal.dct(img_t, type=2, norm="ortho") X1_t = tf.transpose(X1,perm=[0, 2, 1]) X2 = tf.signal.dct(X1_t, type=2, norm="ortho") array_X2 = tf.transpose(X2, perm=[1, 2, 0]) # converting dct coefficients into log scale epsilon=1e-12 array_X2_log = tf.math.log(tf.math.abs(array_X2) + epsilon) return array_X2_log, label
The complete input pipeline for training and validation images is compiled as per the code below.
# Input pipeline for training and validation dataset train_ratio = 0.80 train_dataset = all_train_dataset.take(ds_size*train_ratio) valid_dataset = all_train_dataset.skip(ds_size*train_ratio) batch_size = 40 train_dataset = train_dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.map(dct_preprocess, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE).cache() valid_dataset = valid_dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) valid_dataset = valid_dataset.map(dct_preprocess, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) valid_dataset = valid_dataset.batch(batch_size) valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE).cache()
After the input pre-processing is defined, we create a simple CNN-based model as defined below with convolutional layers and a final output layer with sigmoid activation. We compile the model by defining the binary cross-entropy loss and an Adam optimiser. We train the model for 100 epochs and define early stopping criteria. The frequency model is based on the Leveraging Frequency Analysis for Deep Fake Image Recognition paper.
# Simple CNN Model ( using DCT-II pre-processing) IMG_SHAPE = 224 x = Input(shape = (IMG_SHAPE, IMG_SHAPE, 3)) x1 = Conv2D(3, 3, padding="same", activation="relu")(x) x1 = BatchNormalization()(x1) x2 = Conv2D(8, 3, padding="same", activation="relu")(x1) x2 = BatchNormalization()(x2) x2 = AveragePooling2D()(x2) # 64 x3 = Conv2D(16, 3, padding="same", activation="relu")(x2) x3 = BatchNormalization()(x3) x3 = AveragePooling2D()(x3) # 32 x4 = Conv2D(32, 3, padding="same", activation="relu")(x3) x4 = BatchNormalization()(x4) y = Flatten()(x4) y = Dropout(0.5)(y) y = Dense(1, activation='sigmoid')(y) model = KerasModel(inputs=x, outputs=y) model.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy'])
The model is trained and at the end of 47 epochs, we have 96.7% training and 95.4% validation accuracy.
Now we measure the model performance on test set 1 and test set 2 as below:
# Performance of Frequency Based Deep Fake Detector Model # Accuracy on test set 1 700/700 [==============================] - 2821s 4s/step - loss: 0.7314 - accuracy: 0.8050 Test accuracy: 80.50%... # Accuracy on test set 2 350/350 [==============================] - 1371s 4s/step - loss: 1.7464 - accuracy: 0.6209 Test accuracy: 62.09%...
Spatial Domain Model(Model B)
For the Spatial Domain model, we do not apply the DCT pre-processing. Instead we apply image augmentation to the parsed input images as below.
# Image Augmentation def train_preprocess(image, label): IMG_SHAPE = 224 image = tf.image.random_flip_left_right(image) image = tf.image.random_flip_up_down(image) image = tf.image.random_brightness(image, max_delta=32.0 / 255.0) image = tf.image.random_saturation(image, lower=0.5, upper=1.5) # random gaussian filter if tf.random.uniform(shape=[], minval=0.0, maxval=1.0) < 0.5: image = tfa.image.gaussian_filter2d(image) else: image # random invert image if tf.random.uniform([]) < 0.5: image = (1-image) else: image # random crop if tf.random.uniform([]) < 0.5: image = tf.image.resize(tf.image.central_crop(image, central_fraction=0.5),[IMG_SHAPE, IMG_SHAPE]) else: image # random rotate if tf.random.uniform([]) < 0.5: image = tf.image.rot90(image) else: image image = tf.clip_by_value(image, 0.0, 1.0) return image, label
The complete input pipeline for training and validation images is compiled as per the code below.
# training and validation dataset train_ratio = 0.80 train_dataset = all_train_dataset.take(ds_size*train_ratio) valid_dataset = all_train_dataset.skip(ds_size*train_ratio) batch_size = 40 train_dataset = train_dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.map(train_preprocess, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE).cache() valid_dataset = valid_dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) valid_dataset = valid_dataset.batch(batch_size) valid_dataset = valid_dataset.prefetch(tf.data.AUTOTUNE).cache()
The deep learning model for spatial domain is based on Mesoinception net model from the research paper MesoNet- Compact Facial Video Forgery Detection Network.
# Spatial Domain Model ( Meso-inceptionnet) # define inception layer def InceptionLayer(a, b, c, d): def func(x): x1 = Conv2D(a, (1, 1), padding='same', activation='relu')(x) x2 = Conv2D(b, (1, 1), padding='same', activation='relu')(x) x2 = Conv2D(b, (3, 3), padding='same', activation='relu')(x2) x3 = Conv2D(c, (1, 1), padding='same', activation='relu')(x) x3 = Conv2D(c, (3, 3), dilation_rate = 2, strides = 1, padding='same', activation='relu')(x3) x4 = Conv2D(d, (1, 1), padding='same', activation='relu')(x) x4 = Conv2D(d, (3, 3), dilation_rate = 3, strides = 1, padding='same', activation='relu')(x4) y = Concatenate(axis = -1)([x1, x2, x3, x4]) return y return func # meso - inception net model IMG_SHAPE = 224 x = Input(shape = (IMG_SHAPE, IMG_SHAPE, 3)) x1 = InceptionLayer(1, 4, 4, 2)(x) x1 = BatchNormalization()(x1) x1 = MaxPooling2D(pool_size=(2, 2), padding='same')(x1) x2 = InceptionLayer(2, 4, 4, 2)(x1) x2 = BatchNormalization()(x2) x2 = MaxPooling2D(pool_size=(2, 2), padding='same')(x2) x3 = Conv2D(16, (5, 5), padding='same', activation = 'relu')(x2) x3 = BatchNormalization()(x3) x3 = MaxPooling2D(pool_size=(2, 2), padding='same')(x3) x4 = Conv2D(16, (5, 5), padding='same', activation = 'relu')(x3) x4 = BatchNormalization()(x4) x4 = MaxPooling2D(pool_size=(4, 4), padding='same')(x4) y = Flatten()(x4) y = Dropout(0.5)(y) y = Dense(16)(y) y = LeakyReLU(alpha=0.1)(y) y = Dropout(0.5)(y) y = Dense(1, activation = 'sigmoid')(y) model = KerasModel(inputs = x, outputs = y) model.compile(loss='mean_squared_error',optimizer="adam",metrics=['accuracy'])
Similar to the frequency domain model, we train the spatial domain model for 100 epochs and define early stopping criteria.
At the end of 58 epochs, we have 82% training accuracy and 83% validation accuracy.
Performance of the above model on test set 1 and test set 2 is as below:
# Performance of Spatial Domain Deep Fake Detector Model # Accuracy on test set 1 700/700 [==============================] - 253s 357ms/step - loss: 0.1408 - accuracy: 0.8049 Test accuracy: 80.49%.. # Accuracy on test set 2 350/350 [==============================] - 80s 228ms/step - loss: 0.2274 - accuracy: 0.6349 Test accuracy: 63.49%...
Cross Domain Model (Model A & B combined)
We can now combine model A and B into a single cross-domain model, so the model combines the spatial and frequency domain features and uses it for classification between fake and real images. The model is inspired by the research paper Deepfake Detection Method Based on Cross-Domain Fusion.
The input pipeline needs to be slightly modified as we need the same image, going to 2 different parts of the network. The input parsing of the image is the same but the preprocessing and DCT transformation step is now modified as below to include 2 images in the output of the processing step.
@tf.function def train_preprocess(image, label): IMG_SHAPE = 224 image1 = tf.image.random_flip_left_right(image) image1 = tf.image.random_flip_up_down(image) image1 = tf.image.random_brightness(image, max_delta=32.0 / 255.0) image1 = tf.image.random_saturation(image, lower=0.5, upper=1.5) # random gaussian filter if tf.random.uniform(shape=[], minval=0.0, maxval=1.0) < 0.5: image1 = tfa.image.gaussian_filter2d(image) else: image1 = image # random invert image if tf.random.uniform([]) < 0.5: image1 = (1-image) else: image1 = image # random crop if tf.random.uniform([]) < 0.5: image1 = tf.image.resize(tf.image.central_crop(image, central_fraction=0.5),[IMG_SHAPE, IMG_SHAPE]) else: image1 = image # random rotate if tf.random.uniform([]) < 0.5: image1 = tf.image.rot90(image) else: image1 = image image1 = tf.clip_by_value(image, 0.0, 1.0) return image1,image, label @tf.function def dct_preprocess(image1,image, label): img_t = tf.transpose(image,perm=[2, 1, 0]) X1 = tf.signal.dct(img_t, type=2, norm="ortho") X1_t = tf.transpose(X1,perm=[0, 2, 1]) X2 = tf.signal.dct(X1_t, type=2, norm="ortho") array_X2 = tf.transpose(X2, perm=[1, 2, 0]) epsilon=1e-12 array_X2_log = tf.math.log(tf.math.abs(array_X2) + epsilon) return (image1,array_X2_log), label
The input training pipeline is as below. We parse the input first, then apply image augmentation to the first set of images. Then the DCT-II transformation is applied to the second set of the image.
# training and validation dataset train_ratio = 0.80 train_dataset = all_train_dataset.take(ds_size*train_ratio) valid_dataset = all_train_dataset.skip(ds_size*train_ratio) batch_size = 40 train_dataset = train_dataset.map(parse_function, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.map(train_preprocess, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.map(dct_preprocess, num_parallel_calls=tf.data.AUTOTUNE,deterministic=False) train_dataset = train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE).cache() The architecture of the cross domain model concatenates the output of spatial domain model (y1) and frequency domain model (y2) into single input y as shown in the code block below. It is then followed by a dense layer and output layer. ## add the model layers y = tf.keras.layers.Concatenate()([y1,y2]) y = Dropout(0.5)(y) y = Dense(64)(y) y = LeakyReLU(alpha=0.1)(y) y = Dropout(0.5)(y) y = Dense(1, activation = 'sigmoid')(y) model = KerasModel(inputs = [x,u], outputs = y) #early stopping to monitor the validation loss and avoid overfitting early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, restore_best_weights=True) #reducing learning rate on plateau rlrop = ReduceLROnPlateau(monitor='val_loss', mode='min', patience= 5, factor= 0.5, min_lr= 1e-6, verbose=1) model.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy'])
At the end of 46 epochs, we have 98% training and 97% validation accuracy. The testing accuracy of test set 1 is increased by 5% but the test set 2 accuracy remains the same. This may be due to lack of variation in training images or test set 2 belonging to a completely different distribution as the training set. Potential future work may include exploring transformers, cross-attentions, HAAR Transformation on images, etc.
# Performance of Cross Domain Deep Fake Detector Model # Accuracy on test set 1 700/700 [==============================] - 2498s 4s/step - loss: 0.6864 - accuracy: 0.8546 Test accuracy: 85.46%... # Accuracy on test set 2 350/350 [==============================] - 1231s 4s/step - loss: 1.6561 - accuracy: 0.6378 Test accuracy: 63.78%...
Conclusion
In this article, we combined the spatial and frequency domain features of an image to create a cross-domain network to detect fake vs real images. The cross-domain model showed improvement in testing accuracy by 5% compared to individual spatial and frequency domain models. This cross-domain network can also be created using different spatial networks like a vision transformer combined with different frequency analysis functions like HAAR transform etc. Hope you found this article an interesting read. Thanks to Omdena for providing an opportunity to work on the deep fake image detection challenge.
—
This article is written by Sanjana Tule.
Ready to test your skills?
If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects