MLtwist COO Audrey Smith recently spoke to Humans of AI about data pipelines for machine learning, the need for diversity in data teams, and the importance of data processing and labeling in machine learning.
Listen to the episode here https://vimeo.com/875044363?share=copy or browse the transcript below for her insights.
Summary
In this episode of Humans of AI, host Sheikh Shuvo interviews Audrey Smith, the Chief Operating Officer of MLtwist. Audrey discusses her role in automating data pipelines for machine learning products, using driverless cars as an example. She shares her career journey, transitioning from being an in-house lawyer in France to entering the tech industry, and highlights the importance of diversity in data teams to combat bias. Audrey also introduces MLtwist, a platform aimed at streamlining the data preparation process for AI, addressing the challenges of data pre-processing and labeling.
Timeline:
00:01-01:27: Introduction and Audrey’s role in automating data pipelines for AI products, with an example of driverless cars.
01:27-03:13: Audrey’s career journey from being a lawyer in France to working with tech giants like Google, Apple, and Amazon.
03:13-04:35: How Audrey’s non-technical background and gender contribute to her unique perspective on data management and bias in AI.
04:35-05:57: Introduction to MLtwist and the challenges it addresses, including the bottleneck in data preparation for AI.
05:57-08:15: The complexity of data labeling and the need for specialized tools based on data types.
08:15-09:36: Exploring the potential use of synthetic data platforms.
09:36-10:34: Implementing technical solutions for responsible data use, including data caps and guidelines.
10:34-12:36: The impact of regulatory frameworks like the EU AI Act on data use, especially in European markets.
12:36-14:37: Audrey’s role as the Chief Operating Officer at MLtwist, encompassing a wide range of responsibilities.
14:37-15:41: Upcoming conferences and events, including ODSC West and a demo day with Stage 2 Capital.
15:41-17:09: Encouragement for those interested in DataOps, emphasizing the importance of passion, detail-orientation, and learning on the job.
17:09-18:29: The ubiquity of AI across industries and the need for more data professionals in the field.
18:29-18:42: Audrey’s contact information for those interested in connecting with her.
18:42-End: Closing remarks.
Transcript
00:01
Hi everyone, I’m Sheik and welcome to Humans of AI, where we meet the people that make the magic of AI possible. Today, we’re chatting with Audrey Smith, the Chief Operating Officer of ML Twist, where she works on automating data pipelines and much more. We’ll dive into exactly what that means in just a bit. Thanks for joining us, Audrey. Thank you for having me. My very first question for you, Audrey, is how would you describe your job to a five-year-old?
00:31
That’s a tough question. I think I can start by saying that I helped prepare data for machine learning products or AI products, I would say, maybe. A good example is driverless cars. So you know, like you have several models that were trained on specific data for the car to recognize pedestrians and other cars and signal on the road and lanes.
01:00
different types of lanes. So I’m one of the people who are helping provide that type of data to be labeled so that it can like feed the model and the car can make decision accordingly and safely. That sounds like a very advanced and precocious five-year-old but that’s a great answer. Yeah, cool awesome. Well and looking at your
01:27
Back on Audrey, you’ve had a very wide ranging career across ML teams with experiences at some of the biggest companies like Apple, Google, Amazon and more. How exactly did you start in this world? Take us through your career story.
01:45
Yeah, so I changed career twice actually. I was an in-house lawyer in France. And so when I arrived in the Silicon Valley in 2014, I had to decide what I was going to do, because there was no transfer of knowledge. I had to pass more exams if I wanted to work in that field again. And I was always attracted by the tech industry, so I looked into it.
02:14
I was thinking, okay, how can I enter the field? And one way was to think about my language skills. And at the time, a very famous voice recognition software that is developed in Cupertino was looking for French speakers. And so that’s how I started in the labeling field. I was literally doing labeling when I started, but then I got really into it. I understood what it meant, what that means for the model
02:44
And then I went to Google, worked on managing several projects on the quality control side of things. And from there I went to Amazon and slowly, slowly I went up the ladder. I ended up spending four years at Amazon and from there I was helping ML engineering team with computer vision project. Then went to Labelbox for two years, similar things, helping Labelbox customers.
03:13
with data orchestration and making sure that the data they were getting for their models was really high quality. Awesome. Do you think, given that your background started as a lawyer, with that academic training, do you think you have a different lens on machine learning from others at all?
03:38
That’s a very cool question. I think so. I’m not a technical person whatsoever, but I understand the technicality of it, and I understand also the operational side of it, which I think is very interesting for several reasons. The first one is definitely I have another way of looking at things. I can talk to different stakeholders from like ML engineering team to the business team to the
04:08
like I do that Amazon or the marketing team. And also I am really big into data bias and how to fight it. And I think being a woman, non-technical in that field is making a big difference because the way I’m going to look at data management or data quality is going to be very different from an ML engineer or data scientist. And that’s just what…
04:35
In my opinion, should happen in the future. Every, every data team should have a mix of skills and a mix of educational background to make sure that data bias is tackled the right way. That seems like a good principle for all teams to have. Absolutely. Huh? Well, uh, given that, uh, tell us more about ML twist and exactly what data labeling ops is.
05:04
Yeah, so Amitwist was founded because of two different things that we noticed with my co-founder. The first one was, so on his side, he was in the ad tech industry for 20 years buying data for AI and he always realized at one point that the data was stuck.
05:29
at the data science team level because it’s a lot of work to get data ready for AI. The pre-processing of the data is cumbersome. It’s not sexy, it’s not exciting, it’s time consuming. And that’s something that was, you know, things were not getting out of that place and they were not used in the way they should have been used for AI models.
05:57
On my end, I realized a lot of different things that were very surprising. Companies that were publicly announcing that they were buying data labeling tools, they were acquiring them, and then a few months later they were looking for an additional one. And you’re like, okay, that’s interesting. Something is going on. And in the past 10 years I’ve been working in the data operation field, I’ve seen data labeling become more and more complex.
06:27
and the ecosystem as well becoming more and more complex. So you have like hundreds of tools out there and most of them are claiming that they can work on any data type. The reality is a bit different. The reality is that every tool that is out there is gonna be great at certain use cases, is gonna be great at certain data type. But.
06:52
but they cannot cover it all. So if you have enterprise-level companies like we work with that are working on text, image, and video, ultimately, if you wanna have the highest quality possible for your data, you’re gonna have to go with the best tool for it. So this customer is currently working with three different tools to make sure that they get the best quality out of them. Interesting. So we…
07:19
Yeah, so we decided to work on that. We decided to not reinvent the wheel. We decided to just stitch all the pieces together, the entire ecosystem, and say, hey, whatever you’re working on, there is the best tool out there. And we’re going to create automatically with no integration and no code the workflow that’s going to stitch all the pieces you need for your use case together, from the data labeling tool to the MLOps platform.
07:47
If you need zero-shot or if you need like pre-lib building, we can also work on that. And all the way to the workforce, and then reformat the data in the data format you need to train your model. So everything is just in one place. And so that prevents customers from going after different tools, assessing them, contracting with them, acquiring them, and then just…
08:15
focus on training the other. The value add is extremely clear there. Do you end up working a lot with synthetic data platforms as well?
08:32
we are connecting with a synthetic data platform, but we are not as of right now, there is not really like a demand for it. So it’s something that I think is gonna happen in the future. I know that back in the days in my previous life, we were thinking about using omega-2 data or so synthetic data. As of right now, the customers that we have are not taking it. Okay, interesting. Now,
09:01
Earlier, one of your comments was around sort of the responsible use of data there and making sure you have diversity viewpoints and team members to make sure that’s happening there. Is there outside of making sure that a diverse group of people are in the room making that decision? Are there any technical ways that help to make sure that?
09:31
data is being used responsibly.
09:36
Yeah, there are a lot of ways that are being developed at the moment. The one that I’m looking at very seriously and that I would like MLTWIS to implement in the very near future is a data cab that has been developed at AWS by Peter Haliman, who was actually my boss when I was there. And GAJ is to be able to have some sort of stamp.
10:05
on what’s going on with the data from data collection to the data that is used for the model to make sure that it’s following guidelines. It could be internal guidelines. It could be also at the legal level, depending on where you are. Is the data going to certain countries? Is it getting out of the cloud? Things like that. So I think that’s a really good way to make sure that…
10:34
the companies that are developing the model become really responsible for what happens to their data and they start saying, well, we didn’t know. And that can include also the workforce. And that can make sure that when you are also working with the workforce, you’re going to be able to select a group of people that are going to be representative of the future customers that you’re going to have.
11:03
Because it’s very important that the labels are reflecting of the people that are going to be using your product. Otherwise, you’re going to have bias in your data and it’s always going to be there, but you need to try at least to reduce it. It almost sounds like having nutrition label guidelines for ML models there. Exactly. That’d be awesome. Yeah. Also, it’s scary to see.
11:32
I think it’s necessary. I think that everyone, like every actor in the field should be responsible for what they do. With along those lines, with all the buzz from the EU AI Act and different types of regulatory frameworks coming up in the US as well. Has that, have you seen that?
11:59
manifest in customer conversations you’re having, are people talking about it?
12:07
Yeah, definitely. Depending on who you’re talking to. So our customers are essentially US based. But as soon as you start talking with customers that are prospect that are in Europe, that’s really one of their main concern is that where is the data going? Is it going to leave the cloud? I cannot. You need to be able to keep my data stored in Europe. So that’s definitely part of the conversation.
12:36
And that’s something also that we are well aware in California, which is like also really into implementing some some sorts of guidelines there as well. And yes, this is definitely something that is discussed as soon as you get into sales conversation with companies. Chifting gears a little bit, what’s it like being the chief operating officer at a
13:06
startup, what fills your day to day?
13:11
I am doing everything that needs to be done. It could be anything, whatever needs to happen in the day. So that’s very different from my previous jobs. But I’m also happy to do it because I’m learning so much every single day. So I’m helping with the fundraising because we’re raising for our next round. I’m helping with hiring.
13:40
but also anything in between like if I need to make coffee, I will make coffee. So it’s a very, every day is very different, but it’s very exciting. It’s just a lot is going on right now and that’s like really a nice place to be. Yeah, awesome. Are there any major conferences or events coming up that you and your team will be at?
14:09
So David, our CEO is going to present at OGSC West. I think it’s coming up soon. Honestly, I forgot about it. You’re busy doing everything else. We are, yeah, I need to get better at that. And then we are also part of Catalyst program with our investor, Stage 2.
14:37
And there’s going to be a demo day coming up, I think. It’s on the 24th of October, so next week. So that’s like two things that are coming up. Let’s see, two more questions for you, Audrey. One is, say I’m just starting my career right now and I’m really interested in data pipelines and responsibly using data to train ML models. What types of
15:05
roles and companies would you suggest I look at?
15:11
That’s a really good question. So DataOps, the first thing I wanted to mention is that I created a group on LinkedIn, a community for DataOps people. That’s called DataOps for AI. Because I was really looking at it in the way that we were kind of invisible. Everyone is talking about data-centric AI. And that’s the core of what we do every single day. We are focused on quality, whatever it means.
15:41
And we are talking a lot about data scientists, we are talking a lot about ML engineers, and that’s generally really much needed, but the data ops people are really the ones that are cross functional, as I mentioned earlier, and they are kind of bridging the gap between the different stakeholders and making sure that everything runs smoothly for the data pipes and for the data workflow and for the quality of the data.
16:11
I think that there is no degree. The only thing that there is is just like jumping in and you can come from a lot of different educational background as long as you are detail-oriented, passionate about AI and that you like having business conversation and also like having some technical discussions. When I started at Amazon, I was part of…
16:39
meetings with ML engineers and I had no idea what they were talking about. For the first three months, I was like, what am I doing here? But then you just learn on the go. There are a bunch of resources that you can check to learn, but there is nothing that’s going to replace just going for it and then start at the very beginning of it like I did. It could be labeling, it could be quality control and go for it.
17:09
AI is everywhere. This is literally everywhere at the moment from the insurance companies to I mean like if you think about like gardening shops they do AI right now. It’s everywhere and like I was in a position at Lebelbox or even right now to check that this is happening across like all industries so that’s very exciting time for us.
17:36
and we’re going to need more and more data of people to help. So it can be, there are a lot of keywords that you can look for from labeling, to annotation, to quality control for data, to workforce management, to data pipeline. It’s just, there is no one single title. And even at Labelbox, I had to come up with my own titles for my team because that doesn’t exist. So it’s pretty creative in that field. And just like…
18:06
you have to look for other different keywords. Your enthusiasm for it is certainly contagious. I’ll make sure to put a link in for the data ops for AI LinkedIn group in the description here. And the very last question for you, Audrey, is if someone wants to chat with you to learn more about ML Twist, what’s the best way to connect with you and your team online?
18:29
LinkedIn. OK. That’s the best place. You can reach out to me at audrey at nmtwist.com. Nice and easy. Well, Audrey, thanks again for the time. This has been a lovely chat.
18:42
It was really great. Thank you so much.
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data