Detecting Weeds Using YOLACTEdge Instance Segmentation for Smart Farming

Explore how YOLACTEdge delivers fast, accurate weed detection for robotic and automated agriculture systems.

Omdena

December 16, 2025

14 minutes read

This case study shows how YOLACTEdge enables real time, field ready weed detection by delivering instance segmentation at edge device speeds. Through hardware aware optimisation and temporal feature reuse, the approach achieves fast, reliable inference suitable for robotic and automated farming systems. The result is more precise weed control, reduced chemical usage and a practical foundation for scalable, climate resilient smart agriculture.

Introduction

Smart farming depends on accurately distinguishing crops from weeds to enable automated weeding and precision interventions. This challenge highlights the growing importance of computer vision in agriculture for enabling automated field analysis and precision crop management. While modern instance segmentation techniques provide pixel-level accuracy, real-world agricultural systems also demand high-speed inference. Models deployed on robotic platforms must operate in real time to support continuous decision-making in dynamic field conditions.

This article examines how the YOLACT family of instance segmentation models, particularly YOLACTEdge, balances accuracy and speed for weed detection in smart farming. Through architectural efficiency, hardware-aware optimisation and the reuse of temporal information in video streams, YOLACTEdge delivers real-time performance on edge devices. These capabilities not only improve automated weed control but also support climate-resilient farming by enabling rapid, field-scale data collection for AI-driven agricultural intelligence.

Similar edge-AI approaches are increasingly being adopted across the sustainable agriculture ecosystem, as seen in how leading organizations are applying AI to improve efficiency, resilience, and environmental outcomes across farming systems.

Real‑time Instance Segmentation In Agriculture

Early real‑time object detection relied on one‑stage detectors like Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO). These algorithms treat detection as a simple regression problem that predicts class probabilities and bounding‑box coordinates in a single forward pass through a convolutional network. Because they avoid region proposal and per‑region processing, one‑stage methods are faster than two‑stage detectors, though they often trade off some accuracy.

Two‑stage detectors, on the other hand, employ a Region Proposal Network to generate candidate regions before classifying them and refining their bounding boxes. This separation improves accuracy but incurs additional computation, making two‑stage pipelines too slow for real‑time agricultural robots. In practice, real-time instance segmentation complements drone-based crop monitoring pipelines, where weed detection using computer vision helps distinguish crops and weeds from aerial imagery for targeted field interventions.

Fig.1. Architecture of a convolutional neural network with a single-stage detector

The architecture of a single‑stage detector is illustrated in Fig. 1. The model takes an input image, processes it through convolutional layers, and outputs class predictions and bounding boxes directly without a separate proposal stage. This streamlined approach offers the speed needed for field robotics.

The Yolact Family Of Models

To bridge the gap between speed and accuracy, the Yolact framework introduced a real‑time instance segmentation model that generates a dictionary of prototype masks over the entire image and then predicts a set of linear combination coefficients for each detected instance. The predicted masks are assembled by linearly combining the prototypes and cropping them with the corresponding bounding boxes. This design yields competitive accuracy while maintaining fast inference times.

The architecture of Yolact comprises four stages feature backbone, feature pyramid, prediction head and Protonet as shown in Fig. 2. The backbone (often a ResNet) extracts hierarchical features; the feature pyramid network (FPN) upsamples and downsamples these features to create a multi‑scale representation; the prediction head outputs class, bounding box and mask coefficients; and the Protonet generates prototypes used to assemble final masks.

Yolact architecture consists of four stages; feature backbone, feature pyramid, prediction head, and Protonet [1].

Fig.2. Yolact architecture consists of four stages; feature backbone, feature pyramid, prediction head, and Protonet

Yolact++ builds on this foundation by incorporating deformable convolutions into the backbone, improving the model’s ability to handle objects of varying scales, aspect ratios and rotations. It also introduces an efficient mask rescoring network that re‑ranks mask predictions based on quality. These enhancements deliver higher accuracy while maintaining real‑time speeds.

Despite these advances, further optimization was needed for deployment on resource‑constrained edge devices. YolactEdge adapts Yolact for video inputs and introduces systematic and algorithmic optimizations that increase speed fivefold while retaining comparable accuracy. The next sections explore how these improvements are achieved.

These models build on broader image segmentation techniques that allow machines to separate objects and regions within images, a core capability for many computer vision applications.

YolactEdge Model Architecture

YolactEdge is designed for video segmentation, operating on successive frames rather than individual images. It shares Yolact’s overall layout backbone, FPN, prediction head and Protonet but introduces optimizations at two levels:

Systematic optimization using the TensorRT inference engine: TensorRT is NVIDIA’s deep learning optimizer that converts floating‑point weights (FP32) to reduced‑precision formats such as INT8 and FP16. This quantization dramatically accelerates inference while preserving accuracy.
Algorithmic optimization by exploiting temporal redundancy in video: successive frames often share high‑level features, especially at coarser scales. Instead of recomputing the full feature hierarchy for every frame, YolactEdge computes expensive features (C4 and C5) only on keyframes and reuses them for the following non‑keyframes. We detail this process below.

This type of hardware-aware optimisation is a core principle of modern edge pipelines, where models are designed end-to-end for low latency, limited compute, and reliable real-time inference in production environments.

Feature Extraction

The feature extraction component consists of a ResNet backbone and an FPN. ResNet progressively downsamples the input frame, producing feature maps C1–C5; the FPN then upsamples and downsamples these maps to generate multi‑scale feature maps P3–P7 that feed into the prediction head. ResNet layers C4 and C5 are the most computationally intensive, consuming over 40 % of the network’s resources.

Fig.3. Feature extraction part of Yolact network architecture.

Figure 3 illustrates the ResNet + FPN feature extractor: low‑level features are propagated through lateral connections to build the pyramid. The FPN feature maps serve as inputs for mask and bounding‑box prediction.

Temporal Redundancy And Keyframes

Because adjacent video frames are highly similar, especially at higher feature levels, recomputing C4 and C5 for every frame is wasteful. YolactEdge designates certain frames as keyframes, on which it computes the full feature extraction. Between keyframes, it processes non‑keyframes by computing only the lighter C1–C3 layers and warping the high‑level feature maps P4 and P5 from the previous frame. This warping exploits the spatial transformation of objects across frames, enabling the reuse of expensive features. Keyframes are selected at fixed intervals (every k frames) rather than based on motion analysis, an area noted for future research.

Feature Warping

To transform features from a keyframe to a subsequent non‑keyframe, YolactEdge estimates the motion of objects using optical flow. A neural network inspired by FlowNet computes a 2‑D flow field between two frames. The flow field indicates how each pixel moves; using it, P4 and P5 from the keyframe are warped and interpolated to align with the non‑keyframe.

Fig.4. YolactEdge network architecture.

Figure 4 depicts the overall YolactEdge network. On the left, the full feature extraction (blue) is computed for the previous keyframe. On the right, only low‑level features (light blue) are computed for the current non‑keyframe. The high‑level features (gray) are warped (yellow) using optical flow before they are combined with the current low‑level features and passed to the prediction head and Protonet for mask assembly.

An example of a flow field is shown in Fig. 5; the third image visualizes the magnitude and direction of movement between two input frames.

Fig.5. An example of a flow field

To reduce the overhead of computing optical flow, YolactEdge reuses the C1–C3 features produced by ResNet and feeds them through a small convolutional network to predict the flow field. The designers refer to this lightweight network as FeatFlowNet. Figure 6 contrasts the original FlowNetS architecture with FeatFlowNet: in FlowNetS the entire network processes the two images through a stack of convolutions, whereas FeatFlowNet leverages precomputed backbone features.

Fig.6, FlowNet structure, consists of two parts; a) FlowNetS, b) FeatFlowNet

By warping high‑level features and recomputing only the lightweight layers between keyframes, YolactEdge achieves real‑time performance on edge devices. On a Jetson AGX Xavier, it exceeds 30 frames per second, and on an RTX 2080 Ti it reaches 170 FPS.

Pseudocode Overview

Although the original article references pseudocode, it does not include a code listing. At a high level, the algorithm can be summarized as follows:

For each input frame i, determine whether it is a keyframe.
If it is a keyframe, compute all backbone (C1–C5) and FPN features (P3–P7).
If it is a non‑keyframe, compute only the partial features (C1–C3), use optical flow to warp P4 and P5 from the previous keyframe, and then compute the remaining FPN layers using the warped features and the current C3.
Pass the resulting feature maps to the prediction head and Protonet to generate mask coefficients and prototypes.
Assemble the final instance masks by linearly combining prototypes with the mask coefficients and cropping them using the predicted bounding boxes.

Dataset Preparation

The authors trained YolactEdge on MS COCO and on a custom Weedbot dataset. The Weedbot dataset contained 750 images with a resolution of 3008 × 3008 pixels and corresponding JSON annotations. Because the data were collected in two batches with separate annotation files, the images and annotations were merged into a single dataset.

To prepare the images, the team rotated them using a simple image editor to ensure consistent orientation, then uploaded the adjusted images and corresponding COCO JSON annotations to verify that bounding boxes and masks were correctly rotated. They also increased the number of annotation classes using the CVAT tool, adding fine‑grained categories to better capture variation among plants. In total, 10 000 annotations were created within the 750 images.

For efficient training, the high‑resolution images were downsampled to the network input size of 550 × 550 pixels. A Python script resized and cropped the images, using carrot annotations as reference points; the resized coordinates were computed as new_coordinate = resize_ratio × old_coordinate. This preprocessing increased the number of training samples to 10 462 images and dramatically accelerated training.

Fig 7. Distribution of the different plant classes

Due to differences in annotation viewpoints, two additional classes (labelled Class 2–3 and Class 3–4) were introduced to handle border cases. Figure 7 shows the distribution of the five final classes, demonstrating that classes 3 and 4 are the most prevalent.

Results and Wvaluation

Training Yolact++ And Baseline Results

Fig.8. the predicted results of Yolact++ using Weedbot data.

The Weedbot dataset was first used to train Yolact++. Installation, setup and configuration details are provided in the associated GitHub repository (https://github.com/dbolya/yolact). The model was trained with a ResNet50 backbone. The predicted instance segmentation masks on Weedbot data are illustrated in Fig. 8. Although Yolact++ achieved acceptable accuracy, its inference time was around 400 ms per image, far slower than the target of 12 ms.

Transition to YolactEdge

To meet stringent speed requirements, the training was repeated using YolactEdge. The implementation, installation and documentation are available at https://github.com/haotian-liu/yolact_edge. YolactEdge exploits TensorRT for quantization into FP16 and INT8 precision modes and leverages the temporal redundancy optimizations described earlier.

Figure 9 compares the inference times of different precision modes. Converting the baseline PyTorch model (FP32) to FP16 yields a 1.5× speedup, while converting to INT8 achieves a 2.5–3× speedup, reducing latency from ~138 ms to about 59 ms. These improvements bring inference time closer to the 12 ms target and illustrate the benefits of hardware‑aware optimization.

Fig.9. Comparing different precision modes of TensorRT

Figure 10 presents an instance of multi‑class YolactEdge training. Here the original 3008 × 3008 images were cropped to 1920 × 1200 and then resized to 550 × 550, enabling the network to process more examples per second. The figure shows dense weed and crop instances labelled with class‑specific colours and bounding boxes, demonstrating that YolactEdge maintains good segmentation quality even after aggressive downsampling.

Fig.10. YolactEdge training on multi-class — 3008 X 3008

Evaluation Metrics

To quantify performance, the authors used common metrics for object detection and segmentation: F1‑score, mean Intersection over Union (mIoU), precision, recall and True Detection Rate (TDR). These metrics are computed using the confusion matrix elements—true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN)—calculated between predictions and ground truth. The formulas are summarized in Fig. 11.

Fig.13, shows benchmark inference times on Jetson Xavier platform for yolactedge models with different precision modes given an image input size of 550x550px. The results show promising speeds where we get almost 1.5x speedup just by converting pytorch model (FP32) to FP16 precision. And we achieve a 2.5–3x speedup by converting pytorch model (FP32) to INT8 precision. The best model achieved using this approach i.e. 58 ms provides a 10x better results compared to what we had from Yolact++ i.e 400 ms.

Equations for IoU, Recall, TDR and Precision

The F1‑score combines precision and recall into a single measure:

Equation for F1 Score

Experimental Results

Evaluation on the Weedbot dataset showed that YolactEdge outperformed the baseline Yolact++ both in speed and (after quantization) in accuracy. Table 1 shows detailed evaluation results. Models executed on different inference engines (PyTorch, FP16, INT8) are compared in terms of mean IoU, F1‑score, precision and recall. Lower‑precision modes achieve similar or better accuracy while significantly reducing inference time.

Fig.11, Results of YolactEdge training on Weedbot data and a server with 32GB of RAM and 24 GB per GPU (48 GB total).

Validation results of YolactEdge with different precision modes and execution configurations

Another result table (Table 2) from the authors’ logs lists box and mask Average Precision (AP) at various IoU thresholds. The highest AP values occur around 0.55–0.65 IoU, illustrating the sensitivity of performance to the evaluation threshold.

Box and mask average precision across IoU thresholds

Conclusion

This case study demonstrates how YolactEdge brings real‑time instance segmentation to smart farming by exploiting hardware‑aware quantization and temporal redundancy. In experiments on the Weedbot dataset, YolactEdge achieved a 5× speedup over the baseline Yolact architecture while maintaining competitive accuracy. Quantization to FP16 or INT8 precision using TensorRT further reduced inference time, reaching 58 ms per image. Such performance is crucial not only for automated weeding but also for time‑critical tasks like weather forecasting for agriculture and AI weather prediction, where rapid processing of field imagery enables farmers to adapt to changing conditions and improve climate‑resilient farming strategies.

Several avenues can enhance these results:

Temporal redundancy experiments: exploring different keyframe intervals or adaptive keyframe selection based on motion blur could yield further speedups.
Training with varied data: expanding the dataset and experimenting with different image resolutions may improve mAP scores.
Low‑level implementation: rewriting inference in C++ instead of Python could deliver additional latency reductions, similar to improvements seen in other frameworks.

Ultimately, integrating fast, accurate weed detection into a broader smart‑agriculture pipeline including precision spraying, crop health monitoring and weather‑aware decision support will accelerate the adoption of climate‑resilient farming practices. By combining instance segmentation with AI weather prediction and weather forecasting for agriculture, farmers can make timely interventions that reduce pesticide use, improve yields and adapt to changing climate conditions. When combined with edge-optimised inference pipelines and field-ready deployment, instance segmentation models like YOLACTEdge become practical building blocks for scalable, real-world smart farming systems.

FAQs

What is YOLACTEdge in smart farming?

YOLACTEdge is a real-time instance segmentation model optimized for edge devices, enabling fast and accurate weed detection in agricultural fields.

How does YOLACTEdge detect weeds in real time?

It combines lightweight neural architecture, TensorRT optimization, and temporal feature reuse across video frames to achieve high-speed inference.

Why is instance segmentation important for weed detection?

Instance segmentation identifies individual weeds at pixel level, allowing precise removal instead of blanket herbicide application.

How is YOLACTEdge different from YOLO or SSD?

Unlike bounding-box detectors, YOLACTEdge provides pixel-accurate masks while maintaining real-time speed on edge hardware.

Can YOLACTEdge run on agricultural robots?

Yes, it is designed for edge deployment and achieves real-time performance on devices like NVIDIA Jetson, making it suitable for field robots.

How does temporal feature reuse improve performance?

By reusing high-level features from keyframes, the model avoids redundant computation and significantly speeds up inference on video streams.

Does YOLACTEdge reduce chemical usage in farming?

Yes, precise weed segmentation enables targeted spraying or mechanical removal, reducing herbicide use and environmental impact.

Is YOLACTEdge suitable for climate-resilient agriculture?

Yes, its speed and accuracy support scalable automation, helping farmers adapt to labor shortages and climate variability.