In September 2019 Reynold Journalism Institute , Columbia Missouri opened a deepfake verification Competition. RJI Student Innovation Competition challenge is to create a program, tool or prototype for photo, video or audio verification.
RJI 3rd Place Winner :
On February 8 Our team FakeLab Won the 3rd place at RJI Student Innovation Competition.
What is DeepFakes
Deepfakes are images, videos or voices that have been manipulated through the use of sophisticated machine-learning algorithms to make it almost impossible to differentiate between what is real and what isn’t.
Four Popular techniques for video deepfake creation:
Face2Face: (Facial reenactment): transfer expression from source video to target photo, using model based approach
FaceSwap: facial identity manipulation, a graphics-based approach that uses, for each frame, landmarks to create a 3D models of the source, then projects it onto the target by minimizing the distance between landmarks
Deepfakes: facial identity manipulation, first uses facial recognition to crop the face, then train two autoencoder and one shared autoencoder for source and target. To produce a deepfake, the target is run through the source autoencoder and stuck to images using Poisson image editing.
NeuralTextures: Facial reenactment using GANs
We at University of Missouri Kansas City came up with solution to detect deepfakes called FakeLab.
Fakelab is a tool for journalists, media houses and tech companies that helps in identifying manipulated photos, videos and audio shared on their platforms.
Below are detailed steps of our work for DeepFake Video
1. Data preparation
For our experiments, our data preparation pipeline was split into three streams:
1. We downloaded 10 random deepfake videos and 10 random real videos, both sourced from YouTube. In our selection process, we made sure to choose as many different faces as possible, in different contexts, different backgrounds and different lighting conditions. We also tried to make our data as diverse as possible, to ensure the dataset was representative enough of real-world videos typically encountered on the internet. All the manipulated videos we use in the YouTube dataset fall under the deepfake category.
2. From the FaceForensics++ dataset, we download 10% of the overall data used in the paper. This ratio is picked to ensure we have comparable sizes with the real-world dataset downloaded from YouTube. The FaceForensics++ data consists of original videos on top of which manipulations are applied, making parallel samples between real and fake samples. In other words, for a real video A, the FaceForensics++ dataset preparation creates 4 forged versions of the same video (video A), using the different forgery techniques discussed in above introduction.
3. Finally, we create a third amalgamated dataset consisting of the real-world YouTube and FaceForensics++ datasets by combining both the training datasets discussed above.
2. Data pre-processing
We apply the same data pre-processing strategy for all datasets:
1. Extract frames from each video.
2. Make use of reliable face tracking technology like openCV and dlib to detect faces in each frame and crop the image around the face.
When detecting videos created with facial manipulation, it is both possible to train the detectors on the entire frames of the videos, or simply crop the area surrounding the face, and apply the detector exclusively on this cropped area.
At this stage, we obtained the following data statistics:
3. Data augmentation
Since the datasets are relatively small — 100 faces for the entire training dataset—the model is prone to overfitting by memorizing faces. To combat this, we employed a series of different yet simple data augmentation techniques:
Random horizontal flip
Random rotation and scaling
Random perspective (distortion of the image)
Random brightness, contrast and saturation jitter
Random region masking (replace portions of input images with empty blocks)
We should note here that data augmentation is only applied to train images. All of the test data is only normalized, with the model applied on the untouched test images. For the test data, we only apply normalization. No further tweaks are applied to the input test data.
For training purpose we used the pre-trained Restnet18 model architecture on Imagenet due to the . As Resenet18 is smaller model it do faster iterations, and to reduce the chances of the data overfitting.
We unfreeze the weights to be fine tuned on the deepfake detection task. The reasoning behind our unfreezing of the convolutional layers is to move the weights from learning to detect what humans would perceive as the typical set of facial features — eyes, ears, noses, etc. — towards learning features that are more useful for algorithmic deepfake detection — artifacts, skin color change, blur, etc.
The reason we do this is in hopes of discouraging the model from merely memorizing the differences between real and fake faces, instead encouraging it to look for more useful features instead. If we used a fully connected classifier that uses facial features extracted by ResNet18, such as eyes or ears, that would encourage the model to memorize that certain faces are associated with the label ‘real,’ while others are correlated with the label ‘fake.’ To overcome this, we also fine-tuned the ResNet18 layers to start looking for other artifacts useful in deepfake detection, such as blur or two sets of eyebrows appearing on a single face. Such features would enable our fully connected classifier to generalize better to never-before-seen faces.
We also replaced the fully connected layer of the ResNet18 with a classifier architecture made of one or many dense layers, with hidden relu activations. In all of our training we either made the classifier linear (i.e. one fully connected layer) or added a single hidden layer with a ReLU activation. This Boolean decision functions as a hyperparameter that we can switch on and off during our experiments.
For regularization purposes, we employ dropout on the convolutional layers’ outputs, as well as a dropout and a batch-norm on the hidden fully connected layer output (where appropriate). We also employ a weight decay using the L2 norm on all model trainable weights (including ResNet pre-trained convolutional weights).
To analyse the performance of our models, we propose to analyze both accuracy and AUC as metrics: We visualize accuracy results as it is a simple intuitive metric. We also visualize the AUC for its discriminatory power, which will help us recognize if the models are learning a decision boundary between real videos and deepfakes, or just randomly guessing.
In order to get the inference for test video we have created the Web UI for keeping the record of all inference run on test files.
Here is Demo
All the source code we used in model training is available at GitHub here
Check out: Fakelab – A Deepfake Audio Detection Tool