Raghu Banda Extra AI Data Labeling Ops
MLtwist COO Audrey Smith spoke with Raghu Banda of XTrawAI.com in early 2023 about data labeling ops for AI. This conversation covers data labeling operations, data privacy, AI model transparency and more.
All right, welcome back to Extra AI, our podcast series on machine learning and AI applications. And today I have a guest, Audrey Smith. Today the topic is that we are gonna discuss is about data labeling ops in the context of AI. So I welcome you on board, Audrey, welcome on board.
AS: Thank you, Vagu, thank you for having me.
RB: So a little bit, a quick background about Audrey. Audrey, Audrey Smith is a chief operating officer at ML Twist, a platform that builds automated AI data pipelines at the click of a button. She has led AI data labeling operations for the past eight years at different companies like Google, Amazon, and Labelbox. Audrey has deployed data labeling pipelines across text, image, audio, and video in a dozen different verticals. She also co-launched the DataOps for AI community and mentors within the data labeling operations industry. So I’m really excited to have you on board, Audrey. So just like how we do in our podcast to ease our audience into the podcast conversations, I start with a teaser question. So these are generally I put it in different ways, but let me put it in a little different way. Maybe, can you tell us a little bit about your background and how you are connected to AI?
specifically ML or data or labeling? Absolutely.
AS: So I’m actually like a non-technical person in the tech industry. And I started my career as an in-house lawyer in France. I have a legal background. But when I moved to California eight years ago, I decided to explore the tech industry and I then started as a laborer. That’s how I started my data operations journey.
And this is how I learned about machine learning project and how the labeling work I was performing was linked to AI products. So I was very excited and interested into that. And so I went to Google, where I worked on data policies for various projects, from GDPR compliance to many projects on user experience, such as for Google Shopping or even YouTube.
And then I got hired by Amazon where I led labeling operation efforts for various machine learning teams. I touched on a lot of different data types. This is where I touch on image, even 3D, videos and text and audio. And I learned really what data ops involves. For instance, creating a data pipeline that was efficient, not only in terms of cost and quality, but also in terms of speed.
I also learned how to work with various stakeholders from the product team, the marketing team, but also the engineering, of course, and data science teams to the leadership level as well. Good, amazing. I think you have an amazing background and that too coming as a non-technical person from an in-house, you being an in-house lawyer and from France and understanding the different challenges in the GDPR and other things. I think it makes much more sense.
in the data labeling operations and the data labeling space to understand all that. So. Yes, absolutely. So let me put another question before we get into the real meat of our conversation, or today’s conversation. So could you provide some thoughts about these different upcoming AI advancements in the current world? I know we keep hearing a lot of things happening nowadays with ChatGPT and all.
Yeah, absolutely. I think it’s a very hot topic right now. I think we should probably address here what’s happening with Chadi PT and Bard and the other LLMs that have been released by Meta, but also many others in the past few weeks. What’s interesting is that the world is waking up to the amazing impact that AI can have on our day today, right? This is the first time that you can feel that everyone feels involved in some ways.
They have, everyone has an opinion, they get excited or they get scared about AI. It used to be a strange concept that people outside of the tech world would look at with some sort of distance. But now it feels like LinkedIn is exploding with first from technical and non-technical people on what is happening with these LLMs. So the revolution was already there, but now everyone feels like this is happening to them as well.
What I just want to highlight is that this is not magical, like you can see on some of the spots. This is not like robots taking over the world, like you can see that also on some other fields. So this is truly like humans generating data, feeding models. And this is completely public. This is not a secret. You can read articles about
Googlers that have been asked to actually annotate the data for BARD. And you can see also an article published by OpenAI where they literally say that they hired contractors from Upwork and ScaleAI to provide the data for chat GPT. So in my opinion, it’s very important to be careful about the data, right? Like to make sure that this is accurate and very high quality.
So data operation roles will actually become more and more important in the upcoming months and years, just because this is what we do. We are the guardians of this data centric approach that can protect the quality of the data. Right. Right. No, I completely agree with what you said, Audrey. I think it’s becoming much more interesting, not only for the tech world, the whole AI space as such with the
like the entry of chat GPT or this large language models, whether it is from barred or meta. So it’s now opened up this whole new world of large language models, even for the non-tech world and how people interact with these different large language models. And like you rightfully mentioned, data plays a very big role here and data operations specifically. And that’s where today’s conversation also makes a little more interesting.
about hearing your prospects coming back from the background of GDPR and data quality and data labeling and data ops. So maybe let’s take a quick break and then come back and get into our real conversation of today’s topic. Sure.
All right, so now let’s get started on our real conversation, today’s topic, DataOps and the role DataOps plays in AI. So I would like to ask you this question, Audrey. DataOps is a new role in AI ecosystem, the whole AI professional ecosystem. What do you think about the AI data-centric approach, all this trending that’s happening, thanks to Andrew and Gia, I think?
We are no, maybe we can give a quick background on that. Sure. So basically we estimate that the cost of data can account for as much as 80% of the total cost of developing an AI product, right? So it’s truly crucial that the data is right. Data operation wars are in my opinion crucial to the success of any machine learning project because their main focus is data quality. That’s just what we do.
This is interesting that this is trending right now thanks to Andrew Ng, who is the CEO of Lending.ai. But this is not new. Having been in data operation for the past eight years, this is what my job has always been about. How to make sure that the data feeding the model is consistent and accurate. But that said, I’m very glad that the world is out there, that data quality is how the number one priority when creating a model.
And I hope this will stay the same as we enter a new AI acceleration phase with the explosion of chat GPT, data quality should be more than ever the number one goal and for obvious reasons linked to responsible AI. We need to make sure that misinformation for instance, is not going to flood the internet in the upcoming months or years. So yes, it’s very important. Awesome. I like, I like the way you have put your thoughts.
Maybe I have a follow up question on that, right? Like what are the current, these different AI innovations that might be interesting for you or your company as such, when you’re talking about these new AI innovations? Right, so I think that GPT is also a version, even for us at ML Twist. ML Twist is here to accelerate AI production by bridging pipelines between the different tools that are out there in the AI ecosystem.
If you think about data labeling tools, data augmentation tools, all the way up to machine learning operation platforms, we want to connect and build a data flow that is going to be secure, but also will provide high quality data. And ShadGPT is a perfect source of data augmentation or synthetic data that could be used to start training the model. And by the way, we are currently testing, integrating with ShadGPT to provide pre-labeling for our own customers. So,
Technically, this is like a version for us as well. And we at ML Twist extract, load, and transform the data to help it flow across the entire data ecosystem. So it makes sense that we are now trying to leverage Chatt GPT as a new source of data. Beautiful. I like the aspect that you folks from ML Twist are already working on these concepts and how they are also looking into how
do this kind of integration with ChatGPT because this is going to be very crucial when we talk about the DataOps platform and how do you handle these different GDPR-related aspects of it. I know the future is going to be exciting. And I believe, yeah, we’ll get into some of your other experiences as well, as our podcast progresses.
Before that, I think maybe we’ll take a quick break and then come back and get into the, I want to then later on go into the business problems that we are trying to address here, and then we can go a bit more into that.
Welcome back. So we have had, so far we have talked about an introductory details about what we are gonna discuss today. Now I want to really get into the aspect of the business problem that we are trying to address here. Like, let’s take a step back and see what are the typical challenges that are faced by the customers and how are you, or why do you think that is a problem before even answering about
how you are going to solve it, but why do you think, what are these typical challenges that are being faced by the different customers here, coming from the data ops platform, and why do you think that is a problem? Sure. In my sense, the problem right now is the data pipeline. Any AI company needs to build an internal data pipeline that will connect to all the tools they need to do…
you know, genotations and train their model. So either on the top of that, they have built an in-house data labeling tool, for instance, or they have a license to one. And the main problem is the data pipeline. It’s all manual work nowadays when it’s done in-house. It breaks all the time. It’s hard to maintain. It’s hard to scale.
It’s taking also away the data scientists or the engineers working on it from the main focus that is building the model for the AI products. It is so complicated that when the time comes to add an additional tool because they needed to expand on the type of data annotation, for instance, they just don’t want to do it. It’s just too complicated to plug into a new tool, build an additional data pipeline, transform their data format into
the right format accepted by this new tool, get the data ingested into it, and retransform it back into the format that you need for their own model. This is fully a bottleneck to AI expansion. Yes, I agree. Yes, definitely I agree to that, what you’re saying. So in that context, you brought up a very, very big business problem, right? A very big, a typical challenge that
is faced by the customer with these data pipelines that we have. But we also see that there are quite a lot of data tools that are out there in the market to create these different AI models that we are talking about, whether it is starting from your data preparation and data management phase, to building your machine learning, doing prototyping and building the machine learning models, and then finally productizing that.
models by deploying it into an application and then monitoring them. So we have, there are a lot of data tools out in the market. So what do you think is lacking there? Right. So you can see, I’m sure you’re aware of it, that this AI ecosystem is overcrowded. There are literally hundreds of tools out there on the market. In my sense, what’s missing is a sense of
from data labeling to synthetic data again, or MLOps platform, or many other tools that are doing sometimes all of it at the same time, they are not connected to each other in a way that could really accelerate AI adoption across all verticals. They don’t talk to each other. They are very difficult to understand and even to assess for any AI company. And they work well on some parts, but not on other parts. So…
MLTWIS is just about that, the pipes. We want to connect all the tools and create automated pipelines that will plug the right tools to create a workflow for each data type so that AI companies don’t need to choose one single tool anymore. They can use as many tools as they need. And MLTWIS will create automatically this different workflow for each of their use cases so that they can get the high quality data they need to train their model.
Beautiful. So what you anticipate or what ML Twist is doing, like similar to the few other companies as well as that, how holistically understanding the different tools that are the data tools that are out there in the market and making sure that they do not run in siloed fashion, but for the end customer, for the end user and the end customer, or the developer who is implementing that.
it’s easier that they can handle all these things together so that when you create these automated pipelines, it’s easier for you to build your ML models or even before building the models, doing the data management and then in some scenarios, generating the synthetic data, building the models and then deploying them. So I think this is where, yeah, I think, I also like the name of the…
name of the firm, ML Twist, the way you are handling it is like twisting how you could connect all these different data tools and make it automated. Beautiful. Exactly. So yeah, we are getting into some interesting conversation here, right? Like we started with understanding what are the AI innovations out there, how the data is getting GDPR data or data
then we went into the different customer challenges that they have out there. And now that we understood the customer challenges and we understand that there is this problem that is lacking in connecting all this. I want to now go a bit more into the solution that is provided, right? I think we will, from the standpoint of ML Twist or from your experience, I think you have huge experience working on these different.
data labeling operations. We do not need to go into the confidentiality of the name of the customers, but we don’t need to reveal the name of the customer. But maybe I want to take this question a bit more and understand if you could share an example of how you have used these different technologies in a typical customer scenario. Right.
So one of our customers is working on different data types. So they have a project to using images, some others using text and some others using audio. The beauty of Amartus is that we automatically pulls the data from the S3 bucket. We batch it out, we send it to the right labeling tool after having transformed the data into the tools accepted format. We connect the workforce to the labeling tool. And once the data has been labeled, we will transform
the data back into the customer format and applies anomaly detection technology that we have developed on the data in the client’s format, which is truly a big difference to get the quality control on the data into the client’s format. The customer is currently using three different tools for ML Twist, and they do not have to access any of them, do not have to select any of them or contract with any tool directly.
they did not have to build a data pipeline and do any of the pre and post processing of the data. And in my sense, this is revolutionary. And thanks to that, this particular customer who needed to deliver a model within three months was able to deliver it within two months, which clearly gave them a clear advantage to their competitors. Beautiful. So…
Yeah, that is amazing to know that there are three different tools, but you don’t even need to worry about what the different tools that you have. Because like you explained, this customer has different data types, starting from image to text and audio, and where your MLTest platform is pulling all these from these different data tools and creating the automated pipeline. And the customer is directly using this. And obviously it is good to know that if you could deliver it and reduced time for the customer,
Obviously, I think that’s a big win-win for the customer as well as for MLTWIST. Going on that aspect, now, I know that, yes, we have these different data tools and you have used these. Irrespective of what kind of data tool the end user or the end customer is using, you have the MLTWIST platform that is connecting all these different data tools and then
automating the pipeline and then delivering it. But now, how do I know as an end user or the end customer? For me, at the end of the day, I know. When we talk about machine learning and predictive analytics, we do not expect 100% accuracy because there is no 100% accuracy in the world. Obviously, there is gonna be some kind of…
That is where the human in the loop comes into picture. But before getting there, how do you evaluate the effectiveness of the technology that you are using? Because that is where the customers would want to know, right? Like, okay, I’m using this platform. Yeah, I have different tools, but I’m using this platform as well, where it is connecting all this. Can you speak a few words about that?
So we evaluate ourselves on quality, speed and cost. Typically our customers have already, when they come to MLTWIS, they have already built their own data and labeling pipeline or overall data pipeline internally. And they have a really big struggle to scale it or to just maintain it because it keeps breaking.
They then decide to assess our technology and compare the quality and speed of what they currently have or think that they could have with increased investment. And thanks to using our pre-built integrations, our customers save at least 50% of the cost and also see their quality increase along with their speed.
But really, in general, what matters is the performance of the model. And we have a true impact on it thanks to those three KPIs that we have in mind, which is quality, speed, and cost. Beautiful. So I think I like the way you have brought it up, the three KPIs, right? Like speed, quality, and cost. I agree that these are one of those three very important KPIs for a.
end user or the customer to evaluate the effectiveness of your tool, whether it is the tool or the platform or what you’re talking about. So now let’s go to that big question, which I constantly ask my guests who come on my podcast, which is where I think end of the day, all it goes down to is the million dollar question I would say, how do you make money? Right?
What is the differentiation? How do you differentiate yourself with the competition?
So we have several differentiators, including our automated quality control that I mentioned earlier, and our proprietary pipeline technology. That said, I think that our biggest differentiator is that we are fully transparent. Our customers have full access to every third-party technology and every labeling workforce that will be brought in. Imagine having complete access to the AI ecosystem that is tied together into one single platform. That’s an interest.
Awesome. So that is where I think, yeah, making it completely transparent for the end user or the end customer. How do you connect all this and make it available when you’re working on all this labeling workforce. Yeah, I agree. I think that’s going to be a great differentiator when you talk about this crowded Data tools or data marketing that the data ops platforms out there.
Thank you for the detailed answer. I know we have discussed quite a bit. Let us take a quick break. I want the audience to digest this a bit and then we come back and do our closing remarks and key takeaways. Sure.
Welcome back. We have been having some interesting conversation. We talked about the different aspects of data labeling and the importance of data ops in this whole space. And then we also talked about the different challenges and the business problems and how do we address those. We took the example of ML Twist, how they are doing. So maybe first I would like to thank you, Audrey, for sharing your wisdom with us.
Additionally, I think if you can provide some key takeaways and closing remarks and any additional references that you would like to provide, that will be helpful. And also maybe you could also top it off by providing some remarks or some thoughts about what do you see as the future of AI and its applications coming from the GDPR and from the data aspect.
Sure. So thanks for having me, Raghu. It was really a pleasure discussing with you today. Data quality, as I mentioned several times already, is our main focus and making sure that AI companies get fast and easy access to the tools they need to get high quality data. This is our mission at ML Twist. That said, data quality is also impacted by the composition of the teams working on a model. There is growing evidence…
to suggest that improving gender equality in AI can lead to better model performance. There is a survey conducted by the AI Now Institute in 2019 that found that diverse teams, including teams with gender diversity, produced AI systems that were more accurate and less biased than homogeneous teams. So in my sense, I think I want to close that discussion by saying that we…
I believe that we need to work on more accessibility to AI tools, but also need to work on bringing more diversity within the tech industry. Beautiful. I completely agree with you, Audrey. I’ve also had some quite a lot of different numerous conversations where we talked about, and also I’ve put some blogs on this data bias, right? The data bias happens in different stages, starting from your data sampling bias from the data
creation of the data sets and also even during the functionality, even the model building stage and also when you’re building a function, when you’re adding functionality. So the data bias happens at different stages and you’ve rightfully mentioned that we definitely need to work on that and have a better… So yeah, that’s a very good thought for the future. And thank you for your time.