With the rise of the data-centric AI movement (of which computer vision is a subset), the spotlight has been shifting from algorithm design to dataset development. Data is the highest contributor to model performance for many modern neural network architectures. Adding layers to the network, skipping connections, or tuning certain hyperparameters have limited model performance effects. Many practitioners spend countless hours creating and curating labeled data to train state-of-the-art architectures at the penalty of algorithm development. Additionally, dataset creation is one of the most costly and demanding components of the entire computation pipeline. Therefore, good practices for data quality are critical to ensuring successful outcomes.
Why Have Data Quality Solutions Become Essential for Computer Vision?
In short, the growing importance of analytics and ML applications demands modern data quality solutions:
3**. Potential sources of error have increased**. The volume, variety, velocity, and veracity of data continue to increase alongside the number and types of data sources and providers.
Labeled datasets are among the most desired assets computer vision practitioners seek. Even though computer vision scientists and engineers are continuously researching novel methods to reduce the dependency of models on labeled data (i.e., active learning, self-supervised learning, adversarial learning, etc.), supervised learning techniques remain popular when it comes to computer vision models in production. Many potential sources of error can impact the quality of labeled data, including a lack of proper data management, instruction ambiguity, data misinterpretation due to a low signal-to-noise ratio in the source data, and the cognitive degree of difficulty required for certain labeling operations. If not detected early, these errors can have devastating effects, from cost considerations to underwhelming model performance. Thus, there is a need for instrumenting frameworks with proper mechanisms to monitor data quality as labeling efforts progress.
The 6 Dimensions of Data Quality
As a concept, data is of high quality if it fits the intended purpose of use. In the context of ML, data is of high quality if it correctly represents the real-world construct that the data describes, meaning that it is representative of the underlying population and scenarios. While good quality differs from case to case, there are common dimensions of data quality that can be measured.
Adapted from Collibra’s The 6 Dimensions of Data Quality
The Collibra team put together a nice list with six dimensions of data quality: completeness, accuracy, consistency, validity, uniqueness, and integrity. Since Superb AI focuses on computer vision data, let’s examine how these dimensions fit into the context of visual data.
1. Completeness: Is your dataset sufficient to produce meaningful insights? Are there no “gaps” in your dataset? Does your dataset cover all the edge cases? Visual data can’t be considered complete until all the object classes of interest have been labeled, which is a requirement to kickstart the modeling process. These vital labels help your model learn and make predictions.
The State of Data Quality
According to a 2021 survey conducted by Datafold, data quality and reliability are top KPIs for data teams, followed by improving data accessibility, collaboration, and documentation. Data quality can’t be owned by any single team and needs to be addressed on a company level (in the same way security is) and requires close collaboration between teams. Unfortunately, most teams currently don’t have adequate processes and tools to address data quality issues.
Considering that data teams identify data quality as their primary KPI while lacking tools and processes to manage that, it is not surprising that they are haunted by manual work, as many routine tasks such as testing the changes to ETL code or tracing data dependencies can take days without proper automation. They need to write ad-hoc data quality checks or ask others before using the data for their work. A few teams use automated tests and data catalogs as a source of truth for data quality.
Source: Choosing a Data Quality Tool
Sarah Krosnik has mapped out the data quality tooling landscape and placed them under one of four categories based on the approach each tool takes on data quality:
1. Auto-profiling data tools (Bigeye, Datafold, Monte Carlo, Lightup, Metaplane) are hosted tools that automatically profile data through either ML or statistical methods and alert upon changes based on historical behavior. Choose them if your team has a high budget, many data sources you don’t control, and fewer technical resources or time to create and maintain custom tests.
There have not been many data quality tools that deal with unstructured visual data from my research. All of the tools mentioned above only deal with structured tabular data. Therefore, there’s an emerging opportunity to design such a tool given the untapped potential of visual data, which has a larger footprint than structured data and is powering more novel computer vision applications.
Designing A Data Quality Tool For Computer Vision
Should we care about the quality of our visual datasets? If the goal is to build algorithms that can understand the visual world, having high-quality datasets will be crucial. We outline below three recommendations for designing a data quality tool for computer vision.
1 - Detect and Avoid Bias
Torralba and Efros, 2011 assessed the quality of various computer vision datasets based on cross-dataset generalization (training on one dataset and testing on another dataset). Their comparative analysis illustrates different types of bias in these datasets: selection bias (datasets often prefer particular kinds of images), capture bias (photographers tend to capture objects in similar ways), label bias (semantic categories are often poorly defined, and different labelers may assign different labels to the same type of object), and negative set bias (if what considered by the dataset as “the rest of the world” is unbalanced, that could produce models that are overconfident and not very discriminative).
To minimize the effects of bias during dataset construction, a data quality tool for computer vision should be able to:
1. Verify that the data is obtained from multiple sources to decrease selection bias.
2. Perform various data transformations to reduce capture bias.
3. Design rigorous labeling guidelines with vetted personnel and built-in quality control to negate label bias.
4. Add negatives from other datasets or use algorithms to actively mine hard negatives from a huge unlabeled set to remedy negative set bias.
2 - Tackle Quality Aspects
In an exploratory study on deep learning, He et al., 2019 considered the four aspects of data quality based on AI, including:
To solve the issues associated with the aspects mentioned above, a data quality tool for computer vision should be capable of:
1. Rebalancing samples among classes so that not any few classes are overly represented in the training set.
2. Suggesting a min/max threshold on the optimal number of samples required to train the model for the specific task.
3. Identifying label errors and providing sufficient quality control to fix them.
4. Adding noise to samples in the training set to help reduce generalization error and improve model accuracy on the test set.
3 - Offer Visual Analyses
Alsallakh etl al., 2022 presented visualization techniques that help analyze the fundamental properties of computer vision datasets. These techniques include pixel-level component analysis (principal component analysis, independent component analysis), spatial analysis (spatial distribution of bounding boxes or segmentation masks for different object classes), average image analysis (averaging a collection of images), metadata analysis (aspect ratios and resolution, image sharpness, geographic distribution), and analysis using trained models (feature saliency in a given input, input optimization, concept-based interpretation).
To improve understanding of computer vision datasets, a data quality tool for computer vision should offer visual analysis techniques mentioned above:
1. Pixel-component analysis is helpful in understanding which image features are behind significant variations in the dataset and (accordingly) predicting their potential importance for the model.
2**. Spatial analysis** is helpful in uncovering potential shortcomings of a dataset and assessing whether popular data augmentation methods are suited to mitigate any skewness in the spatial distribution.
The understanding of the quality of data used to train a model, the clarity of the labeling process, and the knowledge of the strengths and weaknesses of the ground-truth data used to evaluate the models will lead to increased traceability, verification, and transparency in computer vision systems. In this article, we have given a tour of the data quality tooling landscape and proposed ideas to design a robust data quality tool for computer vision applications.
At Superb AI, we are building a CV DataOps platform to help computer vision teams automate data preparation at scale and make building and iterating on datasets quick, systematic, and repeatable. Our custom auto-label or upcoming AI features like mislabel detection and embedding store utilize the collected data to set data quality rules, which should be adaptive to the data we collect.