Everyone at this point might have heard about the AI data centric approach. This trendy term is actually very familiar to me since being in Data Operations means in essence that you are already applying a Data Centric Approach to all AI projects. As we all know by now, a very good model with crappy data, will get you…well…a crappy model performance.
Let’s take a step back, I have been a Data Ops person (DOps) for the past 8 years in various companies and industries. I fell into it by chance and looking back, I think I got extremely lucky to have found a fulfilling career in a field that does not cease to amaze me. Nowadays, AI is everywhere, even where you would not expect it at all, and this is only the beginning. I thought that since the field of Data Ops is still quite new, it might be beneficial to explain what a Data Operations person does and why this role is crucial to the development of a successful machine learning project.
When a company works on building an AI product, they are actually building a Machine Learning model that will detect and predict whatever the product is focused on. Let’s imagine that your product is supposed to detect a picture of a dress and return you a very similar match on a specific website. In reality data scientists are working on training an algorithm that will detect different parts of the dress: the color, the pattern, the length of the sleeves, the hemline, the neckline and also the style. To do so, they will need to feed their model with a lot of images containing the labels associated with the different parts from each dress.
This is where the Data Operations person will come in handy. DOps acts as a kind of super powerful program manager and is in charge of organizing the entire data workflow process. They will work on building the data pipeline by executing all the steps below (of course depending on the organization, DOps might do all or only a few of these steps):
First and foremost, talking to the technical team is the first step, no matter what. You need to understand what the data scientists and engineering team are up to. What do they want to recognize and predict with their model? What volume of data do they need annotated? What is the budget allocated to this project? What is the timeline for delivery? What is the accuracy target that they are aiming for when training and testing their model once they get the labels?
From there, DOps will have to find the data to label, source it, scrape it, or find a data vendor that will deliver the data they need to be labeled. The unstructured data type can be audio, video, image or text. In my “dress example”, either DOps works for an online retail shop and they have a lot of images of dresses in stock already from their website, or they don’t and then they will have to find a way to gather all the different types of dresses they want the product to recognize. DOps might even use other techniques such as data augmentation or synthetic data in some cases. And a model usually needs between hundreds to thousands of good examples per item to reach a good recognition rate.
Once DOps gets their hands on the data, the data needs to be cleaned and organized. Making sure that the data received is actually relevant, does not contain PII, is good quality…That might entail screening the data or potentially having an internal or outsourced team doing it.
Find the right data labeling tool that will host your upcoming labeling task and allow your workforce to connect to do the job. This part can be tricky as there are hundreds of labeling tools on the market, you will have to find the right one for your own use case. Not an easy task for sure, it takes time to test and select the right tool for the job.
From there, DOps will have to design a labeling task on the selected data labeling tool, build the ontology, create guidelines (set of rules that will be applied while labeling the data and covering all different use cases that might show up in the dataset). To make sure the task makes sense, DOps will label themselves a sample and analyze the output to make sure this is what the internal technical customers need.
Then DOps will find, select and train a labeling vendor on the specific task, run some quality control checks to ensure quality is not only reached but also maintained before scaling up and working on the entire dataset
As the labels come back from the vendor, DOps will continuously run quality control checks, give feedback when the quality has not been good enough and retrain the labeling team to correct the labels. This feedback loop is an essential part of the annotation process.
As the labels come back and are approved by DOps, they will be delivered to the technical counterparts and converted into the internal accepted data format before getting fed to their model. If the model performance is good, then the labels will keep getting delivered to them to improve the performance overtime. If this is not as good as expected, it could be across the entire dataset, it could be on specific labels only, then DOps will have to dig into it, understand how and why the quality of the labels got affected, mitigate and find a resolution to the issue. Several reasons could have impacted the quality, this becomes more of an investigator job to understand the why and fix it.
All these phases are quite intense and can be orchestrated perfectly only if the Data Operations person has the following skills:
Program management skills: organizing simultaneously different projects, keeping the different counterparts informed, understand each need and make sure to deliver the projects on time and within budget
Vendor management skills: being able to select, test and negotiate a data vendor, a labeling tool and a data labeling vendor while making sure that all these partnerships make sense and ensure a win-win relationship for everyone
Technical Project management skills: creating and designing data labeling tasks require to understand the technicalities and vocabulary used by the data science and engineering team and translate them into oversimplified tasks that will be worked on by a labeling vendor that has no expertise in the domain DOps is working in.
Quality control skills: while making sure that the data pipeline is organized logically and runs smoothly, DOps will also be responsible for the quality of the output and understanding how many people should annotate the same data, how the quality control should be operated at the vendor’s site and how to detect false positive and false negative and so much more…to ensure that the labels delivered to the internal customers will be highly relevant.
I am passionate about my job, and the adventure got even better since I joined MLtwist a year ago. As you can imagine, the Data Operations role is not boring, sometimes I wish I was bored even a little… A lot of things could go wrong at every step. This is really hands-on and can quickly become overwhelming for someone who does not enjoy multitasking. However, when finally the product you have helped build comes out, you get a sense of pride that can make you forget all the sweat and stress you encountered along the way.
If you want to know more about the Data Ops field, think about joining our community “Data Ops for AI” on Linkedin to learn about this relatively new career path, share knowledge on the matter, relevant news and job posts, and read how some of us got where we are at right now.
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data