How to Deploy a Real-Time Computer Vision Model in Production

Apr 4, 2022
Reading Time
Rate this post
(5 votes)
How to Deploy a Real-Time Computer Vision Model in Production

In this article, we’ll cover how to build a real-time computer vision model for production through our real-world case studies: image classification, and detecting pathologies.

Authors: Aruna Sri Turlapati and Nandhini Nallamuthu.

Slides from the webinar


What is a Real-Time Computer Vision Model? 

In general, any Machine Learning/Deep Learning model that does instantaneous predictions is referred to as a real-time model and the same logic goes for Computer Vision models as well. The real-time models find applications in a multitude of domains and one such prominent example is self-driving cars, where the model must spontaneously detect the position of other vehicles to adjust the acceleration or change the direction. Realtime CV models are also used in critical applications like remote surgery, smart farming, biometric access, industrial robots, and medical imaging in edge devices. 

One noteworthy challenge with these models running in real-time is the latency, as the predictions must be at high speed, and the model size is compromised for faster response time. Another significant factor that comes into play is the end device installed, it could be coral pi with a USB accelerator or the ones from the NVIDIA family such as DGX A100. Depending upon the device type, the trained models are converted and compiled in specific formats to be compatible with the Edge devices. Moreover, the Edge devices are bound by restricted computational power and storage, resulting in the requirement of highly compact models. 

The article discusses the challenges and optimization techniques used for Real-Time Computer Vision applications and a particular case study to deploy a CV model on the coral pi with Google Edge TPU specifically for Image classifiers and object detectors. 

Why Edge devices? Some of the prime reasons for employing this hardware include, 

  • Ease of use and optimal performance on dedicated devices
  • Limited network connectivity (remote areas)
  • Low latency requirement 
  • Better security, privacy 

However, the hardware on the Edge has limited capacity, hence the models need to be converted/optimized for a low footprint, low latency, low power consumption, less memory, and offline inference. This Edge inference brings its challenges and requirements in network design and optimization for the Computer Vision model.

The real-time CV application has to purposefully split the workload, coordinate, and rightly balance to leverage the advantages of cloud and Edge components. The data preprocessing, and model training happens on the cloud while inference happens on the deployed Edge model. Edge AI makes immediate predictions and helps in real-time decisions, while Cloud AI takes care of the non-real-time components of the pipeline like data processing, training, and processing new data from Edge for deeper insights. The advent of 5G wireless technology made possible reliable wireless connectivity for IoT devices and opened doors for Edge AI applications such as remote surgery.

Model Optimization Techniques for Edge Deployment

Model optimization techniques are used for converting the model suitable for Edge by achieving model compression with minimal loss in accuracy. The most widely used techniques model pruning and quantization are discussed here. 

Model Pruning

Pruning is a process where the insignificant neurons, and connections that have only a minor impact on predictions are removed from the model. Pruning sets near-zero weights to zero to effectively reduce the weights table to achieve model compression. Pruned models are the same size on disk, but they are compressed effectively. 

Model Pruning

Original image – Link

Model Quantization 

Quantization is a process of trimming the precision of the learnable parameters(static) & activations (dynamic) from 32-bit floating point to either float16 or integer, which can be achieved via Tensorflow TFlite converter. Since we are more inclined towards the Google coral Edge TPU (accepts only integer quantized models), let’s dive deep into the int quantization process. This is one of the most preferred optimization techniques as the int ops can be executed at a faster rate. 

By default, the deep learning libraries store the weights and variables in FP32 precision. By switching to INT8 precision, a significant improvement is observed, 4x reduction in model size and 2x to 4x in terms of speed of the model inference.

Model Quantization

Original image: Link 

Quantization can be done post-training or during the training of the model. Post-training quantization can be performed easily with minimal resources, while Quantization Aware Training gives optimal size and accuracy tradeoff. The type of quantization method to be used is decided based on the latency, size, and other requirements of the CV application and hardware limitations on the Edge device.

Post Training Quantization:

  • Static quantization in Float16 or INT can be performed on the model.
    Float16 gives insignificant accuracy loss, with no data requirements while INT8 with unlabelled data gives small accuracy loss and better reduction in size.
  • Dynamic Quantization gives the best of the INT8, Float16 methods. Here the weights are quantized post training and activations are quantized dynamically during inference.

Quantization Aware Training:

  • It provides better model accuracy while providing other benefits of quantization. It emulates inference time quantization, creating a model that downstream tools to produce quantized models.

The quantization techniques are applied for a TensorFlow CV model, deployed on an Edge TPU using TensorflowLite, and discussed in detail here.

The TensorFlow models are serialized in the proto buffer format, thus contributing to the substantial model size. Moreover, the raw trained models have learnable parameters (weights & biases) and activation outputs expressed in 32-bit floating-point numbers, resulting in higher computational time. But such models cannot be directly installed on the Edge devices as they are resource-demanding. In addition to that, TPUEdge compilers permit only integer quantized models, and the heavy models cannot be compiled as such. Therefore, we need to compress the deep learning models before the installation. 


Original Image Source

To achieve lightweight models, TensorFlow Lite (TFlite) has been introduced. TFlite models are basically the original models converted using flat buffers format (serialization) and due to the condensed size, they can be deployed on mobile, IoT as well as on Raspberry pi with TPU. Note: Serialization/deserialization is the process of encoding and decoding the data structure mainly used in transmission (portability). Since the flat buffers do not need additional parsing steps at the time of inference (i.e., the in-memory structure is the same as the disk stored), they are fast, and the entire model is closely packed. For the same reason, not all the TensorFlow Ops can be converted to TensorFlow lite format and for some TensorFlow operations, they need to be fused into a complex format during the conversion. 

The TFlite process has two major components (i) Converter and (ii) Interpreter. The converter owns the responsibility of shrinking the model whereas the interpreter will be invoked at the time of inference. The TFlite models can be further squashed by incorporating various optimization techniques such as Quantization, Weight Pruning, and Clustering. For this article, we’ll explore more on Quantization. 

Post-training integer quantization: As the name suggests, the quantization steps are performed after the intended model is trained (i.e) the training process follows the usual norm. During the quantization, the weights and activation outputs are transformed to 8-bit fixed-point numbers (via transfer function rather than a simple round off). In other words, we can say, mapping the range of floating-point value distribution (grouping into buckets) to the integer span in a uniform manner. So, the crucial parameters here are the left(min) and right(max) limits of the float-point, which needs to be generalized to perform the sectioning. 

Post-training integer quantization

Original Image Source

If we take the case of weights (static quantization), figuring out the min-max value is straightforward as the trained model has this information already allocated. Whereas the activation (dynamic quantization) range is a challenging one as it depends purely on the input passed into the network. One way to compute the limits is once the input is fed into the model (i.e) calculating at the runtime before the mathematical operations. Even though this is a meaningful and accurate representation, the downside of this approach (Hybrid quantization) is that the device must support the floating-point operations, and moreover response time will have some adverse impact. 

The second method is when a small subset of input data (representative) is fed during the conversion process which acts as a calibration to locate the appropriate limits. Another point to be considered while quantizing the weights is the level at which the process is performed. Since every channel out from the convolutions have a varying level of distribution, it is recommended to have the min-max representation at the channel level. Furthermore, the model is converted to accept only integer inputs and outputs data of integer format. Thus, it brings forth a lighter model (4x smaller) that boosts up the inference speed. 

Scale = floating-point bucket size (i.e) (max-min) / 2^bits 

Zero_point = Integer value mapped to the corresponding floating point 0 

quantize_value = float_value / scale + zero_point 

float_value = (quantize_value — zero_point) * scale 

Quantization aware training (QAT): In the above approach, there will be a small dip in the accuracy as the trained parameters (float32) are altered to the integer range. To overcome this shortcoming, another effective procedure is Quantization aware training, where inference-time quantization error is introduced during the training resulting in a robust model. Here, the model learns the exact mapping between the integer and floating-point values. For instance, it takes the floating-point learnt weights, converts them to integer, and then again back to the floating point before passing it to a subsequent layer. 

Since the process takes place in the training itself, the accuracy will not be influenced to a greater extent while converting the model. As we can see from the below table, the baseline accuracy is maintained with the QAT process when compared to the post-training quantization. 

Original Image Source

Original Image Source

Case Study: Image Classification

To understand the entire process, we’ll take a simple convolution architecture and do the conversion step by step. The input for our experiment is taken from the Stanford house dataset. To download the dataset in ‘.h5’ format, refer to Kaggle house dataset

model = Sequential()
model.add(Conv2D(filters=32, kernel_size=3, activation='relu', input_shape=(32,32,1)))
model.add(Conv2D(filters=32, kernel_size=3, activation='relu')) model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))

Before diving into the TFlite conversion, we need baseline metrics for the performance and model size comparison. So, we take the raw TensorFlow model and then compute the accuracy on the test set along with the model size. 

#accuracy of the testset
loss, accuracy = model.evaluate(X_test, y_test)
print('test accuracy:', accuracy)
test accuracy: 0.8297777771949768
#save the model with weights'raw_tensorflow_model.h5')
print('size of the raw model file in bytes:',
size of the model file in bytes: 38711280 ~ 36.9179 MB

Conversion1: A simple compression of Tensorflow model into Tensorflow lite model(without quantization). 

#start the TFlite conversion process
converter = tf.lite.TFLiteConverter.from_keras_model(model)tflite_model= converter.convert()
#saving the TFlite converted model
with open('/content/mydrive/MyDrive/SVHN/tflite_model.tflite', 'wb') as f:  f.write(tflite_model)
print('size of the tflite model in bytes:',
size of the tflite model in bytes: 12891912 ~ 12.29468 MB

If we notice the size of the model, there is a 3x times reduction. To compute the accuracy of the model we can use the tf.lite.Interpreter in the google colab. But if we are evaluating the model on the Edge TPU, then it is better to go for tflite_runtime.interpreter, as it is a small packaged version of tf.lite with only the libraries required for interpretation and this saves the disk space in the Edge devices.

interpreter =
tf.lite.Interpreter('/content/mydrive/MyDrive/SVHN/tflite_model.tflite') interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

Since we need to evaluate the entire batch of the test set, the input tensors have to be resized to accommodate the entire length of the test set. 

interpreter.resize_tensor_input(input_details[0]['index'], (18000, 32, 32,1))
interpreter.resize_tensor_input(output_details[0]['index'], (18000, 10)) interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print("input details shape:", input_details[0]['shape']) print("input type:", input_details[0]['dtype'])
print("output details shape:", output_details[0]['shape']) print("output type:", output_details[0]['dtype'])

input details shape: [18000 32 32 1]
input type: <class 'numpy.float32'>
output details shape: [18000 10]
output type: <class 'numpy.float32'>
test_float_numpy = np.array(X_test, dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], test_float_numpy) interpreter.invoke()
tflite_model_predictions =
prediction_classes = np.argmax(tflite_model_predictions, axis=1) y_test_classes = np.argmax(y_test, axis=1)
acc = accuracy_score(prediction_classes, y_test_classes)
print('accuracy of the converted model without integer quantization:', acc)

accuracy of the converted model without integer quantization: 0.8297777777777777

Since the floating-point values are maintained throughout, there is no drop in accuracy. Since the model accepts only the float32 values, we can always specify this as a datatype.

Conversion2: A simple compression of Tensorflow model into Tensorflow lite model with quantization. The main parameters here include a representative data set to decide the scaling factor for the input range and the supported ops should be in the integer format. In addition to that, for Edge TPU compilers both input and output should be given as integer datatype. 

def representative_data_gen():
  for input_value in  yield [input_value]
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.representative_dataset = representative_data_gen
    # Ensure that if any ops can't be quantized, the converter throws an # error
    Converter.target_spec.supported_ops =[tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)

    converter.inference_input_type = tf.uint8
    converter.inference_output_type = tf.uint8
    tflite_model_quant = converter.convert()
    with open('/content/mydrive/MyDrive/SVHN/quant_tflite_model.tflite', 'wb') as f:
    print('size of the quant_tflite model in bytes:',

size of the quant_tflite model in bytes: 3228128 ~ 3.0785 MB

When compared to the original model as well as the non-quantized TFlite model, the quantized version results in a better compact model. Since the input has to be given in uint8 format, the floating-point values have to be converted to an integer using the scaling factor obtained during the quantization. 

interpreter = 

tf.lite.Interpreter('/content/mydrive/MyDrive/SVHN/quant_tflite_model.tflit e') 


input_details = interpreter.get_input_details() 

output_details = interpreter.get_output_details() 


[{'dtype': numpy.uint8, 'index': 0, 'name': 

'serving_default_conv2d_input:0', 'quantization': (0.997555673122406, 0), 'quantization_parameters': {'quantized_dimension': 0, 'scales': array([0.9975557], dtype=float32), 'zero_points': array([0], dtype=int32)}, 'shape': array([ 1, 32, 32, 1], dtype=int32), 'shape_signature': array([-1, 32, 32, 1], dtype=int32), 

'sparsity_parameters': {}}]

Applying scaling and the zero_points to the input, and again rescaling back to the floating point for predictions: 

quant_predict_ls = [] 
input_scale, input_zero_point = input_details[0]["quantization"] 
#now we'll spin through the test set for the conversion and interpretation for curr_image in X_test: 
 test_image = curr_image / input_scale + input_zero_point  test_image =     np.expand_dims(test_image, 
 interpreter.set_tensor(input_details[0]["index"], test_image)   
 quantize_output = interpreter.get_tensor(output_details[0]["index"])[0]
 #convert the output to the floating point 
 float_value = (quantize_output-input_zero_point) * input_scale
print('sample quantized output before float conversion:', quantize_output)
sample quantized output before float conversion: [ 0 0 255 0 0 0 0 0 0 0]
prediction_classes = np.argmax(np.array(quant_predict_ls), axis=1) y_test_classes = np.argmax(y_test, axis=1)
acc = accuracy_score(prediction_classes, y_test_classes)
print('accuracy of the converted model with integer quantization:', acc)
accuracy of the converted model with integer quantization: 0.8264444444444444

For the integer quantized model, there is a small reduction in the accuracy when compared to the non-quantized model. In this case, it is insignificant. The above list will have the values in integer, displaying a sample output in integer. 

Once we’ve validated the quantized model, the subsequent process will be Edge compilation. If we are using google colab, we can use the below command for the purpose of compiling. 

! curl | sudo apt-key add - 
! echo "deb coral-Edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-Edgetpu.list ! sudo apt-get update 
! sudo apt-get install Edgetpu-compiler 

After we have installed the compiler, the next step is to perform the compilation by issuing the below command. 

! edgetpu_compiler --min_runtime_version 16 -a -d -s

After the execution of  the command, we get the output as shown :

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit Number of operations that will run on Edge TPU: 8 Number of operations that will run on CPU: 1 

Operator Count Status 

FULLY_CONNECTED 2 Mapped to Edge TPU QUANTIZE 2 Mapped to Edge TPU 

SOFTMAX 1 Mapped to Edge TPU CONV_2D 2 Mapped to Edge TPU RESHAPE 1 Mapped to Edge TPU LEAKY_RELU 1 Operation not supported 

Since the Edge compilers do not support Leaky_Relu, the operation is not mapped to TPU, instead, it will fall back to the CPU and thus increase the computation time. If we change the activation function to ‘relu’, then it will be mapped to TPU for faster processing. It is always better to have an eye on the TPU-supported operations for efficient computation. 

The output of the compilation step produces two outputs logfile: ‘quant_tflite_model_edgetpu.log’ which is basically a compilation log and an Edge compiled file ‘quant_tflite_model_edgetpu.tflite’ that can be installed in the EdgeTPU for making inferences. 

Other Real-world Computer Vision Models Deployment Case Studies

About 87% of machine learning models are never deployed in production. The deployment of a Real-time Computer Vision Model was discussed with experts at the Omdena event How to Deploy Real-Time Computer Vision Models in Production.

You can learn the best strategies and tools to effectively deploy a Real-Time Computer Vision Model. In addition, two Omdena Incubated Startups will present real-world case studies (Weedbot and Envisionit).

How to Deploy Real-Time Computer Vision Models in Production

Original Image: Omdena

Have a look at another real-world Computer Vision model deployment tutorial as a mobile app using Docker for an Ultrasound solution on detecting pathologies through computer vision on 2D pictures and video streams.

Original Image: Omdena Endpoint mobile

Original Image: Omdena Endpoint mobile

Here you can find another real-world Computer Vision project tutorial on the precise classification and location of crops and weeds for smart farming, where several methods such as instance and semantic segmentation are applied.

Fig.10, the predicted results of Yolact++ using Weedbot data.

Original Image: Omdena


Do you like this article?
(5 votes)

Want to build your portfolio with real projects?

NGOs Events