Curate Your Training Set for a More Robust Model
Curating a dataset with a well-balanced distribution of samples can be a challenging task, especially when dealing with datasets that are sparse or have limited metadata. Our product, Curate, aims to address these challenges and help users curate a training or validation dataset that contains more rare edge cases. This can lead to a more robust model that performs well in real-world scenarios.
With our Auto-Curate feature, users can automate the curation process based on an AI technology called "embedding." This technology helps the AI understand and compare visual similarities between images, such as background, color, composition, and angle. Using this technology, Curate provides various AI-based data curation features, including the ability to curate a dataset of unlabeled images with even distribution and minimal redundancy of data, curate only images that are rare or have a high likelihood of being edge cases, and curate only images that are representative of the dataset and occur frequently.
The LOCO dataset is valuable for researchers and developers working on logistics-related computer vision problems. It is the first scene understanding dataset designed explicitly for logistics, covering the detection of logistics-specific objects. The dataset comprises 37,988 images captured in five logistics environments using low-cost cameras. Of those images, 5,593 have been manually annotated, resulting in 152,421 annotations. The annotations cover a range of logistics-specific classes, such as forklifts, pallet trucks, pallets, small load carriers, and stillages. With its comprehensive coverage of logistics objects and settings, the LOCO dataset provides a rich resource for those working on improving logistics-related computer vision algorithms.
In the experimental setup, we first curated the validation set using our Auto-Curate feature to include more edge cases and rare cases that may be difficult for a model to predict. This curated validation set is a set of 1,000 images sampled from the official validation set containing 2,277 images.
We then create two training sets to compare – a curated and a randomly sampled training set. We called the two different training sets the curated-train-set and random-train-set, respectively. Both train sets comprise 1,000 images from the official training set, which contains 2,820 images.
With this experimental setup, we then fine-tuned a pre-trained model with these two different training sets and evaluated their performance on the curated validation set.
Impact of Data Curation on Dataset Composition
The table provided below shows the composition of the randomly sampled training set compared to the curated training set. Both sets contain 1,000 images, but the percentage distribution across classes is noticeably different. The randomly sampled training set has a similar class distribution to the entire LOCO dataset, which is to be expected.
However, we can also immediately see that the curated training set has a different class distribution. The curated training set is designed to over-sample under-represented classes and under-sample over-represented classes, resulting in a well-balanced dataset. This means that our auto-curate feature can help address class imbalances often present in datasets, leading to improved model performance.
Next, we also inspect how well our Auto-Curate feature takes into account the variability or distribution of images that belong to the same object class. We’ll choose the “small load carrier” class as an example.
A small load carrier is a type of logistics equipment used to transport smaller items, usually within a warehouse or manufacturing facility, and typically looks like the image above.
Our Auto-Curate feature, in addition to considering the balance between different classes, also accounts for the various ways images or objects of the same class look. This is called intra-class variability, and you can think of it as the million different ways a “person” image could look different – i.e., the differences in pose, clothing, height, gender, background, lighting, camera angle, and so on. Ensuring that your training (and validation) dataset contains a diverse set of images for each class is crucial.
We utilize the embedding values of each image to determine which images are common and which are rare. Our system clusters the embeddings based on similarity, and each cluster is of a different size. When there is a large cluster with many images, that indicates that a particular type of image or object is common in the dataset and is a typical example of that class. Conversely, when there is a small cluster with only a few images (or even just one), that indicates the image or object is rare and is likely to be an edge case.
In the example above, we show four edge case images that belong to a small cluster. It's apparent that the images are more difficult to discern than common cases in a dataset.
To ensure that the curated dataset contains enough of these rare, edge examples of each class, our Auto-Curate algorithm is designed to include them in the training set. By doing so, the model that is trained becomes more robust to these cases. In the next section, we will explore how this change in class distribution and the inclusion of edge cases in the training set impact model performance.
Impact of Data Curation on Model Performance
Precision and recall are two commonly used metrics for evaluating the performance of machine learning models. Precision measures how often a model correctly identifies a positive class (i.e., the class of interest) out of all the times it makes a positive prediction. In simpler terms, precision tells us how many of the objects identified as a particular class are actually that class. On the other hand, recall tells us how many of the actual objects belonging to a particular class were identified as that class.
F-1 score is a metric that combines precision and recall into a single score. It is a harmonic mean of precision and recall, and it ranges from 0 to 1, with 1 being the best possible score. F-1 score is often used to measure overall model performance, especially when the dataset is imbalanced (i.e., one class occurs more frequently than another).
The results of our experiment are truly exciting! By utilizing our Auto-Curate feature to carefully select for us which data to include in the training set, we saw a remarkable 14.5% increase in F-1 score on average across all object classes.
This increase in performance is particularly impressive because it was achieved without any additional data. It simply required our users to carefully curate which image to use in the training dataset using our Auto-Curate feature, which selected edge cases and underrepresented classes to ensure a more robust and balanced dataset. This showcases the power of machine learning and the importance of thoughtful data curation. With our tool, users can achieve significant performance improvements with minimal additional effort, making developing accurate and reliable models easier than ever.
Building More Robust Models with Superb Curate
Curate's Auto-Curate feature is just one of the many powerful tools our product offers to improve the accuracy and robustness of your machine-learning models. By utilizing its AI-based data curation features, Curate empowers machine learning teams to efficiently curate their training datasets and build more robust, high-performance models.
With Curate, users don't have to resort to labeling more data to improve model performance. Instead, they can focus on selecting the most valuable data and curating a set of images that can help train their models more effectively. This saves valuable time and resources while simultaneously producing better models.
If you're interested in learning more about how Curate can optimize your machine-learning models, we encourage you to contact our team, who’d be happy to run you through a personalized demo! With its user-friendly interface and powerful capabilities, Curate is the perfect tool for anyone looking to expand the capabilities of their machine-learning models.
Schedule a call with our sales team today to get started.