Optimization of Edge based Inference Pipeline for Weed Control
May 7, 2021
This article captures, the optimizations explored for the AI-enabled edge-based weed controller on Nvidia Jetson Xavier AGX. The experiments include TensorRT quantization, calibrations, benchmarking of inference pipeline based on YolactEdge, Bonnetal models, and the enhancements of non-model parts of the pipeline.
Introduction
Omdena team of 40 members across the globe collaborated on the Weedbot challenge, optimization of real-time computer vision application on Nvidia Jetson Xavier AGX, that would help in eliminating weeds on the field with a laser beam.
How did we achieve low latency and high precision targets in a span of 8 weeks?
Background
U.S. Geological Survey (USGS) scientists report that glyphosate, known commercially by many trade names, and its degradation product AMPA (aminomethylphosphonic acid) are transported off-site from agricultural and urban sources and occur widely in the environment.
It is very common in agricultural practices to use chemical herbicides in the removal of weeds to enhance the production of the crop. However, these chemicals are proven to have harmful effects on the environment, as much as 55% of the toxic residue is observed in soil, water and atmosphere. This impacts the life in water bodies, the surrounding plant, and algal species that support the agroecosystems, resulting in biodiversity loss.
It has adverse effects on the humans working on the crop and traces of the herbicide are seen in the crop and thereby enter the human body causing toxicity.
To counter these effects, sustainability practices have gained prominence and in this path technology-enabled weed control supports organic farmers to a great extent in replacing manual work. This would facilitate pesticide-free food production and reduce the final price for such food, encouraging people to buy organic food and follow a healthy lifestyle.
AI Weed Machinery
The Hardware and needed libraries
AI-based laser weeding machinery is based on NVIDIA Jetson AGX Xavier that enables high-performance edge AI applications. It has a high-speed camera, 512-core GPU, and 8-core ARM CP, 32GB of memory. It runs Linux and provides 32 TeraOPS of compute performance in user-configurable 10/15/30W power profiles.
At just 100 x 87 mm, Jetson AGX Xavier offers big workstation performance, making it ideal for autonomous machines like delivery and logistics robots, factory systems, and large industrial UAVs.
Designed specifically for autonomous machines, Jetson AGX Xavier has the performance to handle obstacle detection algorithms critical to next-generation robots. It gives GPU workstation-class performance with up to an unparalleled 32 TeraOPS (TOPS) of peak compute and 750 Gbps of high-speed I/O in a compact form factor. NVIDIA’s rich set of AI tools and workflows enables developers to train and deploy neural networks quickly.
Jetpack4.4.1 is installed on the Jetson hardware which includes libraries needed for the pipeline such as Tensor RT 7.1.3 with support for quantized models for Int8 calibration, CuDNN for high-performance primitives for deep learning frameworks, CUDA10.2.
NVIDIA’s TensorRT is a high-performance deep learning inference runtime library for image classification, segmentation, and object detection neural networks. TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables to optimize inference for all deep learning frameworks. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications
A high precision, low latency inference pipeline is built using real-time instance, semantic segmentation frameworks like YolactEdge, and Bonnetal models with a lightweight backbone such as Mobilenet.
To learn more about YolactEdge and how it was used in this project, read here.
Image Annotations for training the model
The camera on the hardware captures a high frame-rate video stream from the top of the crop row. Images with a resolution of 3008x3008px are being used for training the model, each image contains multiple plant annotations. MS COCO annotation is used for labeling the images by the image labeling tool CVAT and albumentations are applied to the image dataset to boost the performance of the network model.
The images are further annotated to increase the number of classes of plants. Dataset with additional annotation classes have given better MAP on the inference over initial annotations.
Optimization of pipeline
Let us dive into the optimizations of the pipeline and benchmarking of the models explored. Broadly the enhancements are investigated in these areas :
- Tensor RT optimizations to the model
- Partial code conversion to C++
- Post-processing of images using Cupy
Optimizing Deep Learning Models
When optimizing deep learning models, there are different approaches that one can take such as pruning, quantization, and knowledge distillation to reduce the latency and size of the model with minimal loss in accuracy. These techniques are used to compress the size of the models and make the models faster.
Model pruning
The first approach of model pruning involves reducing the size of the final neural network by reducing the number of parameters in order to reduce memory, latency, battery, and hardware consumption without sacrificing accuracy, deploy lightweight models on device. The famous paper titled “Lottery Ticket Hypothesis” by Jonathan Frankle and Michael Carbin shows that inside neural networks there exist a sub-network (“lottery tickets”) which when trained in isolation performs on-par with the whole network. This indicates that not all connections in the network are important and some can be omitted through this process of iterative pruning. This resultant pruned model will be lightweight and faster without loss of accuracy. There are different types of pruning such as iterative pruning, weight pruning, and neuron pruning.
Model Quantization
By default in all DL libraries, the variables and weights of the neural networks are stored with FP32 (float32) precision. If INT8 precision is used, there is 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. INT8 models provide 2 to 4 times speedup compared to FP32 models. Also, there are different ways to perform quantization such as dynamic quantization, static quantization, and quantization aware training (QAT). The post-quantization methods i.e. dynamic and static are the simplest where a model is trained with FP32 precision and during prediction, its weights are quantized to either FP16 or INT32. There is a performance loss associated with this type of quantization. In QAT, quantization is performed during training. There is almost none to a very low loss in accuracy using this approach.
Knowledge Distillation
The final approach for compressing neural networks involves using knowledge distillation technique. In knowledge distillation, the knowledge is transferred from a large teacher model to a small student model. This idea of knowledge distillation was introduced in a paper titled “Distilling the Knowledge in a Neural Network” by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. The basic approach is to first train a teacher, a large model on the dataset that generalizes well. A knowledge dataset is created where instead of ground-truths, the predictions of the teacher model are used as targets. These targets are also called soft-targets. Finally, we train a small student model on this knowledge dataset that has performance on par with the teacher model. This is another approach to get small models with loss of accuracy that is faster and memory-efficient for edge devices.
Nvidia TensorRT optimization
For this project, a quantization approach using Nvidia TensorRT library was explored. NVIDIA TensorRT is a C++ library optimized for providing high-performance inference. After training any model, TensorRT supports using ONNX parser, Caffe parser, or UFF parser to convert parsed model from saved trained model format to TensorRT format. So, it is required to save the model as an ONNX model or TF model to support TensorRT parsing. This parsed network is passed to TensorRT build step to build an optimized inference engine based on various optimization parameters. These options include batch size, workspace size, mixed precision, bounds on dynamic shapes, etc, and many more. The build step creates an inference engine using TensorRT. The various precision for supported devices are FP32, FP16 or INT8. If INT8 mode is used, TensorRT requires a calibration dataset which is used for appropriately adjusting the scaling of quantization. This optimized inference engine can further be saved in a serialized format.
Experiments were conducted on image resolution of 1920x1200px and 560x560px to observe the precision to latency time tradeoff for the trained networks. With lower resolution, the inference speed reduces significantly due to a lesser number of computations, while higher resolutions maintain higher precision.
It is observed in the benchmarking of the trained models that the Int8 calibration model gave the best results with the lowest latency time for 560x560px resolution.
Benchmark Results of Bonnetal Model
Let us move to the optimizations of non-model parts of the pipeline
Python code conversion to C++
There are several tasks that run in parallel with the model inference on a CPU thread. It is noted that conversion of these tasks to C++ routines has given 3.25x performance, compared to the equivalent functionality in python. It is a preferred choice to use C++ library for high-performance low latency applications, as in our case. This is mainly due to Python being an interpreted language, while C++ code is compiled down to machine code and the compiler optimizations available with C++.
Image Preprocessing
Cuda enabled C++/Python libraries are made use for image preprocessing before feeding into the inference network for enhanced performance.
Part of the code is converted to CUDA C++ routines to make use of GPU acceleration available on the hardware. However, it is observed that GPU acceleration did not help in reducing the conversion time. It is mainly due to less number of computations and the overhead in moving to GPU memory.
Pybind wrapper for C++ routines for access from python scripts is developed as the rest of the pipeline is in python as is the case with most of the segmentation frameworks.
pybind11 is a lightweight header-only library that exposes C++ types in Python and vice versa, mainly to create Python bindings. pybind11 can map the core C++ features to Python like functions accepting and returning custom data structures, instance methods and static methods, overloaded functions, etc. This compact library makes use of C++ 11 features (tuples, lambda functions ) leading to simpler ways of binding in python.
Numpy type ‘ndarray’ is overloaded to enable passing of parameters between python and C++modules. Pybind11 wrapper is created as an extension library and installed as a python package, which the python script can import and use the routines seamlessly.
Cupy Enhancements
The pipeline also includes tasks dealing with post-processing of images. All these operations are performed using NumPy library. For further reduction in latency of the pipeline, the NumPy routines are converted to Cupy to utilize the CUDA operations available on Jetson Xavier AGX.
However as these calculations are considerably less, as compared to mask generation, the Cupy conversions did not yield improvement in the latency time.
Conclusion
The desired latency time for the inference pipeline close to 12ms was achieved with good precision by adapting the TensorRT optimizations and CUDA enhanced libraries available for Nvidia Jetson Xavier AGX.
References
- NVIDIA JetPack SDK
- NVIDIA TensorRT
- CUDA Toolkit Documentation v11.2.1
- PRBonn/bonnetal: Bonnet and then some! Deep Learning Framework for various Image Recognition Tasks. Photogrammetry and Robotics Lab, University of Bonn
- dudeperf3ct/bonnetal
- haotian-liu/yolact_edge: The first competitive instance segmentation approach that runs on small edge devices at real-time speeds.
- Knowledge Distillation: A Survey Jianping Gou 1 · Baosheng Yu · Stephen J. Maybank · Dacheng Tao
- Efficient Weights Quantization of Convolutional Neural Networks Using Kernel Density Estimation based Non-uniform Quantizer by Sanghyun Seo and Juntae Kim
This article is written by Aruna Sri T.
You might also like