MLOps Starts With Training Data Management

Superb AI Inc. company logo

Superb AI

2021/2/9 | 8 min read
Q&A with Superb AI’s Tech Advocate, James Le, and Chief Research Officer, KH Kim, on how training data management is foundational to MLOps

Leading this interview is James Le, Tech Advocate for Superb AI, a Y Combinator-backed company that radically improves AI production workflows with an advanced training data management platform. James completed his MS. in Computer Science from Rochester Institute of Technology, where his research lies at the intersection of deep learning and recommendation systems. His professional experience spans across data science, product management, and technical writing.

KH Kim is the Chief Research Officer for Superb AI and is currently working on Superb AI’s advanced auto labeling technology. He is developing advanced training techniques such as few-shot, transfer and Bayesian deep learning so that any ML team can quickly spin up and train a custom auto-labeling model on the Superb AI platform with limited data points. Along with uncertainty estimation, he hopes to help teams quickly deploy true automation and dynamic active learning workflows.

KH’s journey began like that of any committed ML engineer. Upon completing his Ph.D. in Computer Science from Pohang University of Science and Technology, he held various high profile research roles with Microsoft, Intel, and Samsung. Before coming over to Superb AI, KH was the Lead Algorithm Engineer for StradVision – a well-funded startup focused on building computer vision software for autonomous vehicles. He has published research works on topics such as object detection, semi-supervised learning, and metric learning. Along the way, he has been granted over 100 patents for his work on computer vision.

In this rare interview, KH talks about his time as an engineer, synthesizes his experience in ML/AI development, and shares his observations on the expanding world of MLOps.

KH Kim and Hyun Kim discussing the future of Superb AI

James : Hey KH. Thank you for your time.

KH : My pleasure!

James : For some of us who aren’t familiar with you and your background, can you start us off with a little intro about yourself, your previous role, and the projects you worked on?

KH : Sure. My name is KH Kim, and I’m currently the Chief Research Officer for Superb AI. I focus on optimizing our deep learning architecture and infrastructure, along with developing our auto labeling systems. Before joining Superb AI, I was the Lead Algorithm Engineer at StradVision – a Seoul-based hyper-growth startup focused on vision software for autonomous systems. I was in charge of building out lightweight neural networks and AI-driven data collection systems.

James : It says here that you had over 100 patents for your work at StradVision?

KH : That’s correct. They were mostly around deep learning algorithms and camera-based computer vision techniques for autonomous driving. More specifically, I worked on efficient model architecture for low-power, real-time inferences, computer vision algorithms that are robust against mode changes and adversarial attacks, and advanced training methodologies such as continual learning, semi-supervised learning, and generative adversarial networks.*

James : Very interesting! How exactly are you applying this to your role at Superb AI?

KH : A big focus of my Ph.D. thesis was around semi-supervised learning. I focused on developing efficient algorithms for computing the pairwise distance between data points by capturing and exploiting the datasets’ underlying non-Euclidean structures. These algorithms had been applied to various machine learning and information retrieval tasks. I am bringing this research method into many of our product offerings, in addition to other approaches such as transfer learning, active learning, and uncertainty estimation techniques based on Bayesian deep learning.

James : That sounds a bit robust for a labeling tool?

KH : Haha. Well, I can see why one would identify us as a labeling tool, but our labeling interface is just a small part of our platform. Labeling is (obviously) a crucial part of the machine learning workflow, and there are open source solutions designed to tackle it (CVAT, VoTT, DataTurks, etc.). However, let’s suppose that your ML team is looking to build out a scalable and repeatable production workflow. In that scenario, you’ll probably want to look beyond open source tools and adopt a well-polished platform that can bring an additional layer of efficiency and automation without having to build everything from scratch. That’s what I’m helping Superb AI do: transform the way ML teams approach MLOps, starting with training data management.

James : From what I’m gathering, practicing proper MLOps is paramount for ML teams to succeed. However, such practice is not widely adopted yet. Is that a safe assumption?

KH : From my experience, that is correct. In order to practice and implement MLOps effectively, you must first understand how DevOps has transformed traditional software development. Without going into too many details, DevOps’ core practice is to combine software development and IT operations into one repeatable workflow so that engineering teams can deliver better software quickly. This practice is augmented by the fact that teams can access a robust ecosystem of solutions that can be rapidly adopted and integrated into their workflow. DevOps ultimately allows organizations to quickly build, develop, test, deploy, and manage code.

James : So is it fair to say that MLOps doesn’t have the same ecosystem support level?

KH : In many ways, yes. While we have seen a rise in verticalized MLOps solutions in the last 1-2 years that I don’t see slowing down, adoption and implementation of MLOps certainly have some catching up to do.

James : Why are MLOps teams slow to adopt verticalized solutions? Are ML teams typically building their own tools?

KH : Hmm. Good question. I can’t speak for the industry as a whole but from what I’ve seen, heard, and personally experienced, ML teams have tended to “build” rather than “buy.” I reckon that this desire to build seems to come from a few places: to avoid vendor lock-in, to have total control over what the service does, or to hope that it will be cheaper to build than buy. The truth is that these factors are less important than you would think. You should only build custom services if that can provide a sustainable advantage for your business. I would imagine that paradigm is starting to shift now. However, I bet there are still many ML teams out there trying to ship AI applications based on fragmented workflows built in-house.

James : So why try to build these internally in the first place? Why not just go out and procure a solution like Superb AI?

KH : I think it’s a combination of a lot of things. ML teams typically start by developing deep learning algorithms and training/testing/validating these models using, for example, pre-labeled open-source datasets. These stages don’t require much tooling. Once the model performance reaches an acceptable level, larger and more curated datasets will be required to further train and validate these models to handle real-world scenarios (especially for industries like autonomous driving). At this point, most ML teams will look to either build, adopt open source, or buy solutions to handle the increasing scale of labeling and management of new datasets. I would guess that most, if not all, will probably either try to build or use an open-source labeling tool to handle this. Don’t get me wrong. Open-source is critical for any emerging technology as it is the fundamental building block of democratizing adoption, which drives maturation and ultimately scale. And specific to preparing training datasets, open-source labeling tools could be sufficient at first, especially if your datasets are small and you are working with a few labelers. But as these two variables start to increase and the need to scale becomes more imperative, you’ll quickly hit bottlenecks, as I have personally seen.

James : What kind of bottlenecks do ML teams typically encounter when building and implementing their own tools or trying to combine open source solutions?

KH : Too many to count, to be honest. I used to work with enormous datasets. The preprocessing step for them was extraordinarily painstaking. We ended up building a customized tool to visualize data distribution and summary statistics. Also, collaboration became more and more important from an efficiency perspective. What happens when a model fails? Who’s responsible for collecting new edge case data? Who’s in charge of validating model upgrades? How do we build a better workflow around new dataset requests? If we receive an unlabeled dataset, what’s a workflow to correct this? We practically had to use a combination of email and PowerPoint to communicate across all these scenarios because building something didn’t make sense, and repurposing an existing solution didn’t add any value. In the end, the lack of a platform that enables seamless communication could add unnecessary time to an already long workflow.

James : Speaking of a long workflow, based on your experience, what would an average production timeline look like for most ML teams?

KH : Well, for first-time projects, in particular, you could be looking at around 6 months for getting the first model into deployment. And it’s not crazy to say that almost half that time would be spent on building and managing training datasets. It can also be longer if your team doesn’t have the human resources or expertise.

James : Wow. 6 months. That length of a period for the initial iteration could open up many risk factors, no?

KH : You’re absolutely right. Imagine spending 3 months collecting, labeling, and pre-processing large datasets on a preconceived model design. Then another 2-to-3 months retraining and refining your model. Then in month 6, you and your team find out your PoC (Proof of Concept) doesn’t work as initially designed. I actually had a chance to look at the 2021 State of Enterprise ML Report from Algorithmia, and apparently only 11% of organizations can put a model into production within a week, and 64% take a month or longer. Furthermore, the time required to deploy a model (once it’s been developed) is actually getting longer, despite increased budgets and hiring. This could mean a significant loss in money spent on engineering resources and valuable research time. Larger teams with considerable capital might still be able to regroup, but this situation could be a substantial setback for smaller teams with fewer resources.

James : How would an ML org or team drastically reduce that kind of risk?

KH : Well, the only way is to automate as much as you can, iterate faster, and catch errors earlier in the workflow. And that is a big part of what MLOps as a practice aims to accomplish.

James : Any advice for the readers on how to go about doing that?

KH : Well, that’s precisely why I joined Superb AI. Much like DevOps, where you can conceivably push code into production in a matter of minutes, we felt that there was still much to be done in terms of drastically reducing the time it takes for companies to push models into production. For some quick context, the most significant difference between traditional software development and ML is that it’s not just the code that ML teams have to deal with. Machine learning is a combination of code (data processing, model training, API calling, etc.), models (learning from the data and making predictions), and data (from which these models learn). The last component includes training data, test data, validation data, real-world data, and more. If you refer to what I mentioned earlier about building these datasets and the bottlenecks that inherently exist in most data pipelines, we knew a proper MLOps workflow had to first start with proper training data management. Once that is established, teams can begin to build out the rest of the workflow, including model lifecycle management, versioning, monitoring, governance, and ultimately deployment.

James : Lastly, are you and the Superb AI team looking to ship more features or products outside of just training data management?

KH : I think there are various directions our product can head towards naturally, and quite frankly, it already is. Aside from our Custom AutoML product which we just launched, I can’t give any more details at the moment. But in general, our goal is to build a platform that provides ML teams the ability to create tighter iteration cycles between data management and training. That could include consistent data/model/hyperparameter versioning, data and model validation, monitoring, ability to plug into existing frameworks and training environments, you name it. I think there’s just so much value we can provide by building a platform that allows teams to collectively and seamlessly approach data and training with a unified approach. We want to help drive the evolution and adoption of MLOps so that teams can shift focus from building workflow tools back to building impactful AI technology.

James : Thank you so much KH, this has been a pleasure. I can’t wait to see what more Superb AI has to contribute to the world of MLOps.

KH: Thank you!


K.-H. Kim and S. Choi (2007). Neighbor search with global geometry: A minimax message passing algorithm. ICML-2007.

K.-H. Kim and S. Choi (2013). Walking on minimax paths for k-NN search. AAAI-2013.

K.-H. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park (2016). PVANET: Deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021.

Subscribe to our newsletter

Stay updated latest MLOps news and our product releases

About Superb AI

Superb AI is an enterprise-level training data platform that is reinventing the way ML teams manage and deliver training data within organizations. Launched in 2018, the Superb AI Suite provides a unique blend of automation, collaboration and plug-and-play modularity, helping teams drastically reduce the time it takes to prepare high quality training datasets. If you want to experience the transformation, sign up for free today.

Join The Ground Truth Community

The Ground Truth is a community newsletter featuring computer vision news, research, learning resources, MLOps, best practices, events, podcasts, and much more. Read The Ground Truth now.


Designed for Data-Centric Teams

We’ve built a platform for everyone involved in the journey from training to production - from data scientists and engineers to ML engineers, product leaders, labelers, and everyone in between. Get started today for free and see just how much faster you can go from ideation to precision models.