Conveyor belts are an essential material handling tool for various industrial processes, and one of the most effective ways to quickly and continuously transport large amounts of materials or products. However, high throughput rates make it difficult for operators to detect defective products and remove them from the production line, especially in cases when several products are aligned on vertical sections of the belt. These challenges necessitate automatic anomaly detection tools that can track products on conveyors without interrupting processes.
Amazon Web Services (AWS) provides a Lookout for Vision service for detecting anomalies in images. This is a convenient AutoML service, but it often requires the development of custom video stream preprocessors to integrate the service for use in real-world scenarios, as well as the preparation of labeled datasets to train the anomaly detection model. In this blog post, we demonstrate how to build an advanced video preprocessor and integrate it with AWS Lookout for Vision using a food processing use case as an example. A few videos of food production lines are presented below to illustrate the use case.
Confection |
Tomatoes |
Eggs |
Bakery |
We also address the problem of data collection and labeling, which can be challenging in the development of a computer vision solution, and we demonstrate how the useful data can be collected and tracked across the video stream. This is helpful in a variety of scenarios and enables visual quality control without disrupting the production line.
To develop the full-fledged visual quality control solution, we start by taking a simple video stream of a lot of items moving on a conveyor belt, and build a simple non-ML-based bounding box detection around these objects of interest, thereby localizing the objects. We then use the detected bounding boxes and extract the cropped objects in order to send them to the annotation service and/or the model prediction part of Lookout for Vision.
Next, to be able to track the exact number of objects and to identify the defective objects in the video, we use a non-ML-based tracking algorithm to track and number the products, both defective and non-defective. This enables us to identify the source of the cropped objects from the video in the images.
This approach minimizes the friction that company data scientists face when labeling objects in a complex, disorderly scenario, and reduces the amount of time spent annotating simple labels (anomaly or no anomaly) for the dataset. In addition, this methodology provides a continuous estimate of localization of the defect/defective product in the continuous stream of video.
Once the images are extracted and annotated, we can further augment the data using a variety of augmentation techniques to simulate scenarios common to manufacturing, and improve the already collected and cropped dataset derived from the video.
This entire process allows even a small amount of video data to be sufficient to train a robust deep learning model using the services of AWS Lookout for Vision, and makes human intervention in the training process painless and quick.
This entire process is summarized in the following workflow diagram:
The minimal AWS environment that implements this workflow can be as follows:
We implement this minimal environment in the next sections to create a solution prototype. A fully-fledged production solution is usually much more complex and can involve more components related to monitoring, administration, and process automation. An example architecture of a more realistic solution is shown below for illustration purposes (adapted from [1]).
To implement the prototype, we created an example video with chocolate cookies moving on a conveyor belt:
We have two types of anomalies in this video:
Note that some anomalies intersect.
We created 66 seconds of this example video, with ~200 objects in total. We split the video in 2 parts of 36 and 30 seconds. The first part was used for training and hyperparameter tuning purposes, and the second part was used for testing in the end, to evaluate the final quality of the solution.
For our task, and many similar tasks that involve objects on a conveyor belt, even simple object detection methods can produce very accurate results. Usually, all objects are very different from the "belt" (background) itself in terms of color and brightness. This means that simple color thresholding (which is similar to the chroma key technique) can quite accurately find all object masks. In this case, using classical computer vision algorithms, it’s easy to extract the contours of individual objects or their bounding boxes from the obtained masks. We used built-in functions from the OpenCV library for this purpose, but there are many other alternatives.
Unfortunately, it doesn’t work perfectly in our scenario for 2 reasons:
Both these issues are solved by adding a few more processing steps to the algorithm. There are many simple non-ML-based approaches that can be used to solve these issues. For our case, we added shadow and edge mask detection and subtracted these masks from the object’s mask.
For edge detection, a classical computer vision algorithm can be used. For example, Canny edge detection [2]. As illustrated below, shadows can be extracted using simple color thresholding:All these steps require some hyperparameter tuning; for example, selecting proper threshold constants. This tuning should be based on the dataset images, and can be done manually or using hyperparameter tuning techniques such as grid search, as shown below:
We also used morphological transformations, such as erosion and dilation [3], to enlarge or smooth masks, and to reduce the effect of noise and algorithm inaccuracies on the result.As a next step, we use an object tracking algorithm to connect the same objects between different frames. There are many different classical object tracking algorithms that differ according to tracking accuracy and FPS throughput, and most of the basic ones have their own limitations. The main limitation is the fact that the simplest algorithms can track only a single object, and we need to track dozens of them. In order to overcome this limitation, we implemented a higher-level tracking algorithm, which utilizes the capabilities of simple classical trackers, and is able to work with multiple objects.
The algorithm addresses the following features of the video for our use-case scenario:
The idea is to use the object detection algorithm to initialize bounding boxes for the new objects, and then track them using a new instance of a basic tracker. The fact that an object cannot jump arbitrarily between frames is used to distinguish if the detected object is a new instance or an old one, and for that we implemented a distance threshold between object centroids. In the case of partially visible objects entering the frame, we enable the replacement of a bounding box with a larger box as the object moves along the conveyor belt in subsequent frames.
This approach is illustrated in the example below. Note that the blue objects have entered the frame recently, so the boxes could increase. For example, the box for object 4 was increased between frames.
Although both detection and tracking algorithms are quite simple, and therefore relatively fast, naive application of them to each frame is not suitable for real-time processing. Therefore, we used a few optimizations techniques, including:
Detection algorithm:
Tracking algorithm:
In order to use AWS Lookout for Vision, we first needed to train the models for our particular use case. This step is actually very simple and doesn't require much machine learning knowledge. The only thing you need is to create a good dataset. Once you prepare a dataset, the rest is done by pressing a single button.
The dataset should contain examples of normal and anomalous images. In the documentation, it is stated that you need at least 20 examples of normal objects and at least 10 examples of anomalous ones. The number of example images directly affects the model's quality (within reasonable limits), so the more samples you have, the more accurate the models will be.
In our scenario, we used images of individual objects for anomaly classification. To get the images of individual objects we used our object detection pipeline. As a result, we had a set of numbered objects, each of them with bounding boxes from different frames. We then manually created "normal" or "anomalous" labels for every object.
AWS Lookout for Vision models work only with fixed-size images, so after cropping bounding boxes, we rescaled all images to the same square size (384x384) and padded them where required.
Additionally, it is important not to use the same object for training and testing the model. When this happens we encounter “data leak”, where the model simply memorizes the anomalous objects in the training dataset instead of learning general patterns. This then results in a high score on the training test, which means the model will continue overfitting to train samples, and will not work properly for unseen examples during its usage later.
Further, Lookout for Vision has 2 options for train/test split:
In the second case, it is not recommended to have 2 different images of the same object, because they could fall into different splits and cause the aforementioned issue. In the first case, it’s possible to use multiple images of the same object, but you need to make sure all of them fall into the same split (either train or test).
The pipeline described in the previous sections enables the continuous detection and count of anomalies in the video stream:
We measured the quality of our solution for 2 parts separately:
For object detection, we used precision, recall, and f1 score as the main metrics. These metrics are better suited to our use case because they evaluate object detection only, even if the "predicted" (found by the algorithm) bounding box is slightly different in size to the true one, and they don't take into account the area of intersection of predicted and true boxes.
For our holdout test video, containing 96 objects, there were no false positive bounding boxes found, but in 2 cases a bounding box joined 2 objects into one. In real-life scenarios, this problem can be mitigated with better lighting, different camera positions, or tracking with multiple cameras. Nevertheless, the results of the model were:
For anomaly detection, AWS Lookout for Vision automatically calculates metrics on the test part of the dataset. Results were:
It is important to note here that we had a very limited number of objects because of our artificial demo example video. Our training dataset contained about 60 different objects, of which only 20 were anomalous. In a real-life scenario, we would have much longer videos and the ability to iteratively increase the dataset to improve the model until it reaches the required quality.
The model training process is usually continuous and could take weeks or months. With AWS Lookout for Vision, we were able to get reasonable quality after the very first training run, which took ~1 hour, even with a very small dataset.
To summarize, our methodology has several advantages:
As noted, this methodology does not work directly off the bat, and in order to implement this approach you need to consider the following:
In this post, we described a typical end-to-end methodology for implementing anomaly detection technologies into a live video stream using smart, non-ML-based preprocessing pipelines and techniques. This blog post covers in detail a few standard computer vision techniques, such as morphological operations, object detection and object tracking without ML, and shows how they can be leveraged for model building on AWS Lookout for Vision. Finally, we describe the functions and techniques that are used to pull together all of this information and re-encode it into the video stream for a clean video that shows the detected anomalies, their position and their count.
As a further step, the proposed solution could be advanced by improving the data quality prior to model building, and the amount of data required to train a good model could be minimized using active learning methodologies.