Data has become the new source code, and we need a way to manage it.
Data is so important that many of the leading practitioners in AI are pushing for data to be at the center of the ML workflow. For many years, code has been at the center of software development. And we have developed amazing tools and processes to create great software, becoming more agile and effective. But today, with the upsurging of machine learning software, curating the right data for machine learning applications is the most crucial element. Without tools and processes to develop datasets, we can’t create models with real-world impact.
The management of these stages is anything but trivial. Selecting data sources, generating labels, retraining models, all of these are key components in the data curation lifecycle, and we typically perform them in an ad-hoc fashion. So what can we do to keep our efforts from snowballing out of control?
We need a data-centric approach. We need tooling to support data development.
In this blog, we’re combining two key tools to improve the data-centric operations: Superb AI Suite and Pachyderm Hub. Together these two tools bring data labeling and data versioning to your data operations workflow.
Superb AI Suite: Labeled Data At Scale
Superb AI has introduced a revolutionary way for ML teams to drastically decrease the time it takes to deliver high-quality training datasets. Instead of relying on human labelers for a majority of the data preparation workflow, teams can now implement a much more time- and cost-efficient pipeline with the Superb AI Suite.
Superb’s ML-first approach to labeling looks like the diagram above:
If your model is low-performing, you need a new data set to augment your existing ground-truth dataset. Next, you run them to your pre-trained model and upload the model predictions into our platform. Then, Suite will help you find and re-label the failure cases. Finally, you can train Suite auto-label on these edge cases to drive performance up.
This cycle repeats over and over again. With each iteration, your model will cover more and more edge cases.
You can try this out for free with Superb AI Suite.
Pachyderm: Versioned Data + Automation
Pachyderm is the data foundation for machine learning. It is the GitHub for your data-driven applications.
Under the hood, Pachyderm forms this foundation by combining two key components:
Similar to git, with Pachyderm’s data versioning, you can organize and iterate on your data with repos and commits. But instead of being limited to text files and structured data, Pachyderm allows you to version any type of data - images, audio, video, text - anything. The versioning system is optimized to scale to large datasets of any type, making it a perfect pairing for Superb AI, giving you cohesive reproducibility.
Pachyderm’s pipelines allow you to connect your code to your data repositories. They can be used to automate many components of the machine learning life cycle (such as data preparation, testing, model training) by re-running pipelines when new data is committed. Together, Pachyderm pipelines and versioning give you end-to-end lineage for your machine learning workflows.
You can try this out for free with Pachyderm Hub.
Pachyderm as Superb AI’s Versioned Storage
In this integration, we provide an automated pipeline to version data labeled from Superb AI. This means that we get all the benefits from Superb AI Suite to ingest our data, label it and manage our agile labeling workflows and all the benefits from Pachyderm to version and automate the rest of our ML lifecycle.
The pipeline itself automatically pulls data from Superb AI Suite into a Pachyderm Hub cluster, versioning it as a commit. This simply works by securely creating a Pachyderm secret for our Superb AI access API key. This key can then be used to create a pipeline that pulls our Superb AI data into a Pachyderm data repository.
We automate this by using a cron pipeline that automatically pulls new data according to a schedule (in our example, every 2 minutes). The output dataset will be committed to our “sample_project” data repository.
Once we have our data in Pachyderm, we can build the rest of our MLOps pipelines to test, pre-process, and train our models.
Data-centric development is key to producing machine learning models that operate in the real world. Together, Superb AI and Pachyderm unify the data preparation stage to be reliable and agile, ensuring we can continue to feed our models with good data and reduce data bugs.
Check out the full code for this integration on GitHub.
Both Superb AI and Pachyderm are part of the AI Infrastructure Alliance and dedicated to building the foundation of Artificial Intelligence applications of today and tomorrow.