In order to reach the precise classification and location of crop and weeds for smart farming, several methods such as instance and semantic segmentation are applied. The target is to reach not only high accuracy regarding location and classification but also the highest possible speed.
In this article, we discuss the implementation and results of YolactEdge, a fully-convolutional model for real-time instance segmentation. This method is suitable to be implemented on MS COCO evaluated on a single Titan XP which makes it very fast.
The real-time object detection started with one-stage object detectors such as SSD (Single Shot MultiBox Detector) and YOLO (You Only Look Once). In order to speed up and reach instance segmentation, an improvement to two-stage instance segmentation methods which depends on feature localization on mask production was needed. Two-stage object detectors, use a Region Proposal Network to generate regions of interest in the first stage and then send the region proposals down the pipeline for object classification and bounding-box regression . These methods are better in accuracy but too slow for real-time object detection.
Fig.1 shows a single-stage detector architecture. This method considers a simple regression problem to detect objects. It takes an input image and learns its class probabilities and bounding box coordinates. Since it is detecting an object and its pose in a single forward pass by relying on convolutions, therefore it is faster than two-stage detectors.
Yolact as a real-time instance segmentation with generating a dictionary of non-local prototype masks over the entire image, and then predicting a set of linear combination coefficients per instance came to the stage. The method, afterward, linearly combines the prototypes using the corresponding predicted coefficients for each instance and then crop with a predicted bounding box .
Yolact++ however, incorporates deformable convolutions  into the backbone network, which provide more flexible feature sampling and strengthen its capability of handling instances with different scales, aspect ratios, and rotations. Yolact++ network is improved by using an efficient and fast mask rescoring network, which re-ranks the mask predictions according to their mask quality. Then improve the backbone network with deformable convolutions so that feature sampling aligns better with instances, which results in a better backbone detector and more precise mask prototypes.
This network is further improved to YolactEdge, a novel real-time instance segmentation approach that runs accurately on edge devices at real-time speeds .
YolactEdge is a video segmentation model, designed to work on successive frames and not frame by frame like normal segmentation models. It is based on Yolact and shares the same architecture and does not alter it but at the same time delivers about a 5x speedup from Yolact, while keeping about the same accuracy (mAP).
YolactEdge is the optimal choice if you want to do instance segmentation on the edge as it achieves real-time speeds (>30 FPS) on the Jetson AGX Xavier. The improvement is done in two optimization technique levels, systematic and algorithmic.
- Systematic level: TensorRT inference engine : TensorRT is NVIDIA’s deep learning inference optimizer, it changes the weights from FP32 to a combination of INT8 and FP16 which greatly improves inference time. It also keeps the accuracy with quantizing the network parameters to fewer bits.
- Algorithmic level: Exploiting Temporal Redundancy in Video : This is where YolactEdge truly shines and takes advantage of having a video and not just a single frame, and that’s what we will be exploring in detail in this part.
Fig.2, presents the original Yolact architecture which consists of 4 stages:
- A backbone
- A feature pyramid network
- A prediction head
- A Protonet
The first two are for feature extraction and can be seen in other models and the last two are Yolact specific, therefore we will group the backbone and the FPN as the feature extraction part and the prediction head and the Protonet as the prediction part.
The prediction part is explained in detail in the Yolact article  and uses the feature maps extracted from the feature extraction part to produce the final masks.
It is also to be noted that the feature extraction is more computationally expensive than the prediction and takes about 60% of the computation cost.
The feature extraction consists of ResNet + FPN and each of them consists of 7 stages, the ResNet down-samples the input frame and the FPN down-samples and up-samples the output of the ResNet while taking C5, C4, C3 as skip connections for P5, P4, P3 respectively.
The FPN feature maps are what is used for prediction in later pipelines of the architecture.
It must be noted that ResNet is the most resource-hungry part of this network with C4 and C5 taking 41.84% of the whole network resources.
Why temporal redundancy?
Since this is a video it’s highly probable that the next frame is similar to the previous one with some changes and the higher feature maps (C4-C5) are more resilient to change than the lower ones (C1-C2-C3), we can start to think that extracting the same features for every frame might seem computationally wasteful since there must be shared features that you can ‘transform’ from a frame to another and that’s exactly what YolactEdge does and we will explore next.
Two types of frames
YolactEdge extends YOLACT  to video by transforming a subset of the features from keyframes (left side of Fig.4) to non-keyframes (right side of Fig.4), to reduce expensive backbone computation.
We will separate frames into :
- Keyframes: where the whole feature extraction will be computed
- Non-keyframes: where some features will be computed and the other will be ‘transformed’
As we’ve seen C4-C5 are the most computationally expensive so we will compute them only for keyframes, as for non-keyframes they won’t be computed and P5-P4 will be warped from the previous non-key frame and thus we can share or propagate features across frames.
It starts with the question of how we can predict the way features will look in the next non-key frames?
The answer is here: if I showed you a video of someone throwing a ball, pause and ask you where you think the ball will be after I press resume? You will probably ask me which direction the ball is moving and based on that you will probably predict the next position of the ball.
It’s the same thing here, to warp features we need to know how objects are moving in the frames in computer vision that is called ‘Optical Flow Estimation’ and there are neural networks designed specifically to compute the 2-D flow field given two successive frames as input and outputs a 2-D flow field where it estimates the direction where every object in the picture is moving and in YolactEdge the network to compute it is inspired by FlowNet .
An example of a flow field given two frames is shown in Fig.5.
A smart implementation of FlowNet
FlowNet stacks two images together and then runs them through some convolutions to extract features, then these features are refined by recursively upsampling and concatenating them, and then these final features are used to predict the flow field.
To reduce the overhead of recalculating features, YolactEdge reuses the features C1-C3 calculated by the ResNet concatenate them, uses a few convolution layers, refines them the same way in FlowNet, and uses that to make the flow field prediction.
Then P4-P5 is warped using the flow field, the values of P4-P5 in the previous key-frame, and bilinear interpolation.
How do we distribute frames to key and non-key frames?
Key-frames should be framed that have a significant change in them over the previous key-frame but YolactEdge did not go that far as explained in :
‘It is not guaranteed that the randomly selected keyframes are free from motion blur. A smart way to select keyframes would be interesting future work.’
But for now, YolactEdge takes key-frames after a constant number of k frames, e.g.; if n is equal to 10 every 10 frames a frame will be taken as a key-frame and the rest as non-key frames.
Here is the example of Pseudocode.
We use Nfeat_all to denote computing the whole backbone and FPN (C1-C5) and (P3-P7),
Nfeat_partial_1 to denote computing (C1-C3),
Flow to denote calculating the optical flow between keyframe and current non-key frame using backbone extracted features,
Warp to denote transforming P5 and P4 from the previous keyframe to get(W5, W4),
Nfeat_partial_2 to denote computing (P3,P6,P7) using (W5,W4,C3),
Ntask to denote the rest of the network (the prediction head, the Protonet),
_k to denote belonging to a keyframe, _i to a non-key frame.
Comprehensive documentation and installation of YolactEdge network is provided in  and GitHub repository: https://github.com/haotian-liu/yolact_edge
The input data for training Yolact++ and YolactEdge is MS COCO with JSON annotations.
The dataset used in this article was provided by the Weedbot team and contained 750 images of size 3008 x 3008. For each image, a JSON annotation label was created.
The dataset was provided in two data folders with two separate annotations for each. First we combined the images in one dataset and for that we combined the annotation file as well.
To prepare images, we rotated images with an image editor using windows built in “rotate” function from context menu then uploaded them as new tasks then uploaded COCO JSON to check if annotations are rotated vertically as well.
To improve the results we also increased annotation classes and re-annotated images using CVAT tool. This has been applied to 10000 annotations within 750 images.
YolactEdge down-samples images from higher resolution to network resolution 550 x 500 and this slows down the training. So to solve this, we resize the images and annotations to network resolution, thus training goes much faster and we will have more images (10462 images) to train the model which leads us to better results. A Python script is used to crop the images to network resolution. The principal is new_coordinate = resize_ratio x old_coordinate. It is using the carrot annotations as reference points.
With the new annotations, we got four additional classes. To solve the inconsistency caused by different views of the annotators and border cases, two more classes (Class 2-3 and Class 3-4) were introduced. The distribution of the different classes in the dataset are shown in Fig.7.
Results and Conclusion
We tried several real-time object recognition architectures that could potentially be used for a weed-crop segmentation, including publications that present work in Bonnet, U-Net, Yolact++. In general, all of these object detection models struggle with the trade-offs between speed and accuracy.
In this article, we discuss the instance segmentation frameworks on Weedbot dataset.
In the first attempt, we trained our dataset with Yolact++. Installation, setup, and configuration are provided in the GitHub repository https://github.com/dbolya/yolact
In order to use YOLACT++, compiling the DCNv2 code is necessary. We modified the configuration for our dataset, for training on Yolact++, and set up the training with resnet50. The prediction is shown in Fig 8.
The results of accuracy were acceptable, however, the inference time results in the speed of about 400 ms, which was far from our target (12 ms).
Therefore, we adjusted the data and trained them with the YolactEdge network. We used the model setup provided in GitHub repository: https://github.com/haotian-liu/yolact_edge
Installation, Setup, and Documentation containing TensorRT installation are provided in GitHub repository: https://github.com/haotian-liu/yolact_edge/blob/master/INSTALL.md
Nvidia TensorRT provides INT8 and FP16 optimization modes for inference which significantly reduces latency with minimal to no loss in accuracy. Different precision modes of TensorRT provided a huge speed up as seen in Fig.9. We took this approach for training the YolactEdge model, an instance segmentation model, optimized for edge devices (Fig.10).
To measure the performance of YolactEdge and compare the results of the predictions with other studies, evaluation metrics used to measure the accuracy of the object detector on the weedbot dataset. We used evaluation measures such as; F1-score, mean Intersection over Union (IoU), precision and recall are measured. To compute these metrics, a confusion matrix is calculated between the prediction and the ground truth. Thereafter, true positive (TP), true negative (TN), false positive (FP), and false-negative (FN) are computed. The computing formula are as below:
Fig.11, shows benchmark inference times on Jetson Xavier platform for yolactedge models with different precision modes given an image input size of 550x550px. The results show promising speeds where we get almost 1.5x speedup just by converting pytorch model (FP32) to FP16 precision. And we achieve a 2.5–3x speedup by converting pytorch model (FP32) to INT8 precision. The best model achieved using this approach i.e. 58 ms provides a 10x better results compared to what we had from Yolact++ i.e 400 ms.
To improve the results regarding the accuracy and speed the below points can be considered:
- Experiments can be done around temporal redundancy and observe how they affect the benchmarking times.
- Experiments around different training sizes and new datasets can provide further mAP improvements.
- Same as bonnet, the inference in C++ can provide further improvements in terms of speed.
-  Liu, Haotian, et al. “YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS).” arXiv preprint arXiv:2012.12259 (2020).
-  Bolya, Daniel, et al. “YOLACT++: Better real-time instance segmentation.” arXiv preprint arXiv:1912.06218 (2019).
-  Bolya, Daniel, et al. “Yolact: Real-time instance segmentation.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
-  NVIDIA Deep Learning TensorRT Documentation
-  Stein, Elias, Siyu Liu, and John Sun. “Real-Time Object Detection on an Edge Device.”
-  Soviany, Petru, and Radu Tudor Ionescu. “Optimizing the trade-off between single-stage and two-stage deep object detectors using image difficulty prediction.” 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). IEEE, 2018.
-  Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.