Data-Centric AI: Why This Trend is Here to Stay

Several months ago, MLtwist had the pleasure of participating in the TWIML AI panel on Data-Centric AI.

https://www.youtube.com/watch?v=mbzgRkQtxU0

SC: All right welcome everyone to the broadcast I’m Sam Charrington, founder of

TWIML and host of the TWIML AI podcast. Today I’m super excited to be joined by

an amazing set of panelists to take on the emerging topic of data centric AI.

Now, data collection, transformation, labeling and annotation have long represented the most expensive and

time-consuming aspects of developing machine learning models and yet the ML and AI community has

emphasized the importance of model training and associated model parameters algorithms and architectures as the key

to achieving improved machine learning and performance data centric ai has recently emerged as

Data Centric AI

a trend towards a more balanced view it recognizes that model performance is hugely dependent on not just the

quantity of training data but on its quality data centric ai suggests that

investments in data collection curation augmentation and quality improvement techniques

tooling and platforms can be more effective at delivering high performing real world models at lower cost

we believe this is an important trend for those building real world machine learning systems and we’re very excited

to dig into it in this session before i introduce our panelists I’d

like to share a few quick announcements starting this Sunday July 3rd the

Study Group

Twiml community will be meeting for a study group following along with the recent NLP with transformers book by

Lewis Tunstall, Leandro Vanuera and Thomas Wolfe if you’re interested in joining this

group i encourage you to head over to twimlai.com community to join our slack community

then join the NLP with transformers channel for the most up-to-date information

thanks again to study group host Shan Tsudin for putting together and leading this group

TWiml con AI Platforms

next up i want to officially announce the upcoming TWIMLcon AI platforms conference

join fellow ML AI practitioners and leaders to explore real-world challenges of developing operationalizing and

scaling machine learning and ai in the enterprise in its third year the conference will

continue to focus on the platforms tools technologies and practices necessary to enable and scale enterprise ML AI

Twimblecon ai platforms will take place October 4th through 7th and will be hosted virtually for easy access by our

global community of ML AI practitioners and leaders passionate about deploying building and

integrating ML AI solutions as with previous years we’ll be sharing

a wide range of technical and case study sessions keynote interviews in-depth

workshops and panels learn more and sign up for updates at

twimlecon.com and stay tuned because we’ll be opening up registration which will be free

very shortly all right please keep in mind that while i do have questions prepared we would

Questions

really love for you to be the main driver of today’s conversation please send your questions in via the chat

where our team is moderating uh and I’ll make sure to work them into the conversation

finally we’re looking forward to bringing you more discussions like this on a wide range of topics to be notified

when we schedule future discussions subscribe to our newsletter at twimlai.com

newsletter all right and now let’s introduce our

esteemed panelists please join me in welcoming Adrian Gaydon head of machine learning

Introductions

research at Toyota research institute or tri Audrey Smith

Audrey smith COO at MLtwist hello Charlene Chambliss senior software

engineer at Aquarium learning hey and Janet’s uh senior director of data

science at Paypal hi everyone alrighty so let’s just jump right in uh

i had an epiphany this morning and that is that i should start out every panel discussion with a dad joke

uh i know my team is groaning right now so but I’m gonna do it anyway my first

question is does anyone know how do you make holy water

you boil the hell out of it [Laughter]

just trying to set the tone here we’re not taking ourselves too seriously we’re having a good time and we want to

Personal Experiences

invite involve our audience in this conversation uh by way of introduction let’s start

off by having each of you briefly share a bit about your experiences with data centric ai uh Adrian let’s let’s have

you start yeah sure um so I’m um I’m originally a computer vision person

so dealing uh before deep learning with small data uh then after 2012 dealing

with becoming a deep learning person so dealing with big data and now for the past six years I’ve been

more of a roboticist dealing with embodied data and so the type of data and the quantity

and quality that’s available and the processes around data centric ai have evolved over the

years and evolved you know basically uh with my different roles and i think one of the main things that i learned

in this past few years is how embodied intelligence is very different and how

robotics is very different in the sense of it’s not just a few python

program a few lines of code of python you know to get the web crawler and mine a lot of data and train a robot it’s a

bit more complicated than that because it’s related to physical interaction with the environment safety critical

things like self-driving cars so data centric ai i think means different things for different applications in the

space of robotics it has its own set of challenges and that I’m sure we’re going to talk

about today awesome Audrey

yes so I’ve been introduced to machine learning uh almost eight years ago now and I’ve

been in data labeling operation for all this time um working on different formats from audio video to text um and

and images and because of this role as a data labeling

operation person i have been focusing on data

since like from day one making sure that the highest quality possible

is reached for in any use case that i was working on

and meaning that i was able to work with a lot of different people from product managers to data scientists and

domain knowledge experts to also leveraging the best tools on the market

um when you mention data creation but also data augmentation synthetic data and so

much more that I’m sure we’re going to be covering today and the idea is really like to have this

combination when your data labeling operation person between the different stakeholders that you need

to deal with but also the right technologies always with the same focus which is

reaching the highest quality possible for the data before it hits the machine learning model

awesome Janice yep I’ve been in this data science machine

learning space for 15 plus years and mostly for focusing on fraud detection all this time data is a very important

piece of the equation we spend a lot of time on making sure that the data is a good quality we procure the right data

and it’s a company-wide effort we have different teams focused on different aspects from having the right data

platform having right data governance and when it comes to individual projects we spend a lot of time on making sure

that the quality of data is good for the particular problem so yeah i guess that data has always been a very important

aspect and over time we know there are like new techniques and recently renew focus more focus i guess on this data on

top of algorithm so uh we’d love to hear from different uh people different teams

on like what are the new things that have been introduced in this space because the data is just a fundamental piece as a of the overall data science

space all right awesome and Charlene hey I’m Charlene um my experience with

the topic of data centric ai comes mostly from my time as an ml engineer at primer ai where i built and shipped

multiple NLP models to production for diverse tasks like classification named entity recognition and relationship

extraction for my first year so at primer i also managed the data labeling operations for

the team and helped shape our process into something that could more reliably produce high quality datasets but also

on aggressive timelines and nowadays I’m with aquarium learning as a senior software engineer and our

product focuses on helping ml and ops teams quickly diagnose quality issues with their data sets find areas of

improvements for their model and quickly sample new data for training so now i get to work directly on helping folks

solve this problem which is a really fun position to be in awesome awesome

Definitions

all right so next i want us to level set on what it is that data centric ai

really means uh on our pre-call i expressed some misgivings around you know getting

caught up in the semantics of this but Adrian proposed what i thought was a great idea and that was having us each

try to boil down our thoughts on the topic to a single defining sentence uh and so I’d like to

uh have us share those and then we can dive into a conversation about them Adrian what do you have for us

right uh putting me on the spot um yeah so you forgot your homework

um i did my homework of course um so i think my sentence would be i try to be

as boring as possible because it’s there’s a lot of hype right so here uh it’s uh let’s not create controversy on

the definition at least that’s my intent so the most boring definition of data centric ai i could come up with was

improve downstream performance of machine learning models by iterating mostly on the training data

that’s that’s the most boring version i could get

Audrey um i would say it’s a combination of two things the first one is putting the

right people in the room from day one as soon as you start thinking about your

model and what data you need meaning technical and non-technical people and also being able to leverage the

right technology for the right use case it’s all like also like sorry just to

add on the data preparation side of things awesome awesome Janice

yep i probably would put it even more simple is a focus on data in the ai solution

life cycle because that data is just so fundamental we need to put as much

attention to it as a in parallel to algorithms

Charlene i think that uh Andrew Ng framed it really well when he kind of framed the

distinction as data centric ai versus model centric ai because for a long time the field was really focused on

iterating on model architectures optimizers training methodologies which makes sense because at the time it

was primarily being driven by academics um but nowadays it’s primarily being driven by industry and real world

applications and so the difference there is that you know in academia you don’t necessarily

have infinite money to hire labelers to make these gigantic data sets nor do you

have access to you know a product with millions of users that you can kind of harvest data from um and so we’re seeing

this kind of interesting shift where we’re acknowledging like okay our models are actually really good learners now so

we need to focus on teaching them the right things um it’s kind of about prioritizing what goes into the model

rather the model rather than the model itself so that we can unlock the business and societal value that comes from that

Why Data Centric AI

yeah i think mine is is similar to Adrian’s uh and is inspired by some of

Andrew’s writing and emphasis on a systematic approach data centric ai is a systematic approach

or having a systematic approach to improving and iterating on training data as a way to improve overall ai system

performance stressing the quality of data and labels over the quantity

uh so it doesn’t sound like we’re you know identifying the the kind of

controversies or the polls uh of the conversation here we all kind of

generally agree um about what data centric ai is

i guess the next question is in a sense why are we talking about it now right we’ve been saying for

years and years we’ve been throwing around this 80 percent of data sciences dealing with the data and only 20

uh is the models no one’s ever been able to cite where that comes from but we have all repeated it i’m sure uh raise

your hand if you have not um why are we talking about data centric ai like it’s like it’s new any any takes on

this Janice yeah my attack is uh i guess over

the years there’s a uh a lot more hype on the algorithm like people are talking about deep learning

um uh basically a lot of attention a lot of conferences about algorithm but look at the data as you said 80 of the work

are we kind of maybe uh not paying as much attention maybe we have forgotten some of the fundamentals and also i

guess that there’s some new requirements like a responsible ai or some some of the newer requirements that we need to

spend a lot more focus on data data these days so i think having this like a branding and having this attention will

be able to help us to make sure that we have a balanced approach between data and algorithm

that’s my approach that’s my take on it i think it’s just like a kind of balancing the two sides data algorithm

with uh recently so much hype on the algorithm and maybe we’re not giving enough attention to this area

but Janice not necessarily new in your opinion right it’s more like we need to make

sure that we have enough focus to it and Adrian i thought you had an interesting take on the importance of

Naming Things

naming things can you elaborate on that yeah um yeah it’s a tolkien expression

that i really like uh uh there’s power in naming things actually my collaborators they know that i

allocate a lot of importance to names of methods and titles of papers

and i think that what’s new with data centric ai is the name as you suggested there’s a lot of stuff that that

that have been going around for a while actually when you were talking it reminded me of this twitter account called big data borat you know that was

talking about the eighty percent of like cleaning data and so and it’s been around since big data right since the

unreasonable effectiveness of data paper right uh yeah in 2005 by google so what

changed is not just the name of course that’s that’s a little bit ingest although there is there’s truth to that

um but um i think the deep learning right uh it takes a while right so now everybody thinks in

their mind is 2012 but i lived through 2012. I defended my APH in 2012 and i

was working in convex optimization kernel methods like learning theory

right vapnik world and then overnight throw everything that in the garbage can because deep learning non-convex voodoo

etc so it took a while for computer vision people it took a year or two years to kind of like accept the evidence

for roboticists depending on which robotis is some people are still have to accept it

and so deep learning is is what changed the name of the game because before svms et cetera they didn’t need as much data

uh they didn’t get as much benefit from data a lot of the previous benchmarks even in the ECMA community were

saturating with data so it took Feifei uh faithfully professor at Stanford uh

from Imagenet fame quite a lot of vision at the time to make a million uh like

millions of images uh data sets with Imagenet because at the time it was not clear that it was needed

and and she had almost perfect timing she was a little bit early in the game uh and then boom you know it happened so

i think and it’s been continuing right with the models becoming bigger and more data hungry data started to have a bit more

um place at the table of concerns um and

and with transformers even more so because now people think uh and you know with evidence that we can use generic

models for a lot of different tasks and then we can have more data as the main variable of adjustment

and this is the this is the exciting part also because this is where there’s a lot of art uh more than science and so

this art is slowly getting into a practice right into uh best practices around data

centrism and then hopefully one day it will actually be a science you will be able

to buy a book just like engineers say how to build a bridge you know would be how to build a dataset you know we’re

very far from it yet Charlene you had an interesting take uh

last time we spoke last time we spoke about uh in your experience the relative

performance of data quality and and model training can you uh share that with us

yeah totally um one thing that’s been interesting to see as kind of an on-the-ground practitioner um

is that hey there are a lot more kind of there are a lot more people coming into ml from different diverse backgrounds

and like starting out their careers in ml right now and what they’re finding is that when

they’re going to build a model for their organization um they’re looking around and some of their

co-workers are spending more time on the algorithm and kind of tweaking hyper parameters

and trying to really uh optimize the architecture in a certain way and then other co-workers are

spending more time just iterating on the data set and it turns out that the folks who spend more of their time on the data set

are getting their models out the door a lot more quickly um they’re delivering more business value they’re getting promoted faster um and so from an

incentives perspective you know even just for the practitioners it makes more sense to focus on the data at this point

New Role

awesome Audrey yeah um i was um i liked what Adrian said about

like the fact that now we’re getting into a data-centric world and like what does that mean

in terms of the people that will be working on that field and making sure that data is right

and um that’s what i was talking um to you about the last time we which i did Sam is that i feel like now there will

be a new role um that will be really needed that has been there for a little while

but not as much as i thought it should be um which is data labeling operation people that are not technical people but

that and that who understand technical requirements and who are able to translate them into most like simple

um instructions for the data labeling workforce that will be there to create

the data accurately but also consistently um that comes also with best practices that comes with like

general workflows that need to be applied uh there are like certain recipes that work really well depending

on the format of the data depending on what the data scientists want to um recognize uh uh or

get out of the data um and i think that as you’re saying Adrian that there would be like

like a science around data labeling i think and data prep in general i think there will be

um maybe one day um some you know

you could go to the university and then become a data labeling operation person somehow um i believe that’s the future

so we’ll see so

Labels

it sounds like in a lot of ways as an industry or as a community we’ve got this kind of love hate relationship with

labels and labeling right data centric ai is very explicitly a reaction to kind

of the costs and challenges associated with hand labeling data um thinking back to a recent uh

episode of our data centric ai podcast series i spoke with Cheyenne Mahanti

and he dropped the spicy take that manual labeling is actually harmful to the machine learning process

i’d love to hear any reactions to this idea and i’ll maybe throw it right back to you Audrey

since labeling is your life [Music] yes

so um i think again like it goes back to what i just said i think that there are like

best practices and there are workflows and there are like um guidelines to follow on how to do

uh well when you’re labeling with humans and um if you come as a data scientist

and then you create your own tasks and you don’t try to adapt to the the

quote that’s going to be the one labeling your task or you don’t think about all the different age cases that

can be um you know contain in your data set you don’t think about bias you don’t think about how you’re gonna um

get your quality control done how you’re gonna get feedback to the crowd is doing the labeling and and also there is also

this idea that you know if it’s simple for the technical person it has to be simple for

the crowd and if the quality is not good that means that the crowd is not doing well i think that there should be like a

shift in the way of like uh the way we’re thinking about it and making sure that the best practices and the

workflows in place are driving efficiency consistency and accuracy and

and that’s something that is required to get high quality data

obviously if you don’t do that part right you’re gonna get bad data and you’re gonna you’re gonna

obviously it’s gonna be harmful so um that’s that’s yeah that’s what i think

about it yeah Charlene i was gonna say um real quick that’s

Manual Labeling

interesting it makes me think of um managing certain data labeling tasks at primer particularly named entity

recognition which is a task with so many edge cases you would never think that it’s that hard to identify like what’s a

person in this document like what’s a location um but it turns out you know sometimes university of arizona is a

location sometimes it’s an organization and that’s the case for like thousands and thousands of other things

um and so i can understand uh his point when he says manual labeling is harmful

because one way that people try to solve for this is with consensus labeling where they just throw like five

different people at every document and they’re like we’ll just average out their annotations

into something usable um but like if you do that you know with hundreds of documents you end up with a difference

of like thousands of entities like depending on whether you take two out of five agreement or three out of five agreement and so at some point that just

like doesn’t make sense because like your model is getting a lot of mixed signals um from using that data and so

it becomes more important to uh you know rather than using consensus just really

focus on ironing out those edge cases and making it really clear kind of what principles and what mental models to

have in mind as the labelers are labeling um and even examples of like how to handle certain edge cases as

opposed to just general vague guidance yeah Janice

Human Labeling

yeah i want to add that like uh that luckily uh there’s pros and cons with human labeling there are some problems more suited for human labeling with is

more black and white and i guess some of the but there are some other problems it’s more blurry and like a let’s say it

has to do with intention for example when we look at fraud cases like if we ask human to do labels sometimes it’s

hard by just looking at like those evidence so uh we need to employ different ways like for example uh

Charlene mentioned maybe we have multiple people so like our voting system or like uh we can have a mix of like active

learning or actually sometimes we just want to observe the longer term customer reaction to it so with that different

ways to do labeling i think uh and different different ways is pros and cons we need to combine the different

methods to get us the best way to have the best label at scale

awesome Adrian i i hear you’re labeling his biggest fan oh yeah yeah i have a love hate

Adrian’s Labeling Party

relationship with with labeling um actually do you know what the labeling party is

have you ever labeling a party party okay I’ve so okay i have a little story so um I won the Pascal VOC challenge

which is the ancestor of Imagenet back in 2008 as a big challenge even at the time um

and one of the cool things that happened was it was organized um uh by the late Mark Everingham who’s now has a prize in

his name you know for for contributions um and he had designed this benchmark which is really what then inspired you

know image net and other things and uh so i won this it kind of put me on the on the map a little bit in computer

vision and then Andrew Zissimon as a professor at oxford as like one of the legends in the field invited me to come

spend uh two weeks in oxford i was like great giving a talk there and etc but there was also a labeling party for the

next edition of the pascal VOC challenge where a lot of grad students were using a Matlab interface

to label for semantic segmentation and that’s where my love hate relationship for labeling him

i love using labels i hate producing them and so and then you know um so

that’s why i started to work on ways to avoid that and ultimately in in robotic space um and especially safety critical

applications like self-driving you know it can take up to four hours to label one image right because that 25 pixel

pedestrian you know it might be safety critical um and the level of QA that you

have like quality analysis equality insurances that you have to do is kind of like very very high standards so everybody

has hundreds of pages of labeling guidelines uh in internal annotation

teams or at least internal QA teams it’s it’s it’s serious business uh to do

labeling as I’m sure Audrey and basically everybody in data center expected can talk about and so that’s

why one of the things for me that moves the needle is I’ve mentioned it my most extreme point of view is

all the labels should go to the test set i agree with Janus actually that the solution is is in the gray right as

obviously semi-supervised you need some labels to train you needs like some pragmatic approach but if i try to like

say the most extreme ideal that i have in my head is all your labeling budget should go to

your test set you’ll always need labels because evaluation is statistical so might as well maximize uh your test set

because that’s what really how you know how you get certainty before deploying right and in safety critical

applications you want to be pretty sure that your system is safe and a statistical system right

so um so that was my ideal and so that’s why in my research uh I’ve worked a lot

on auto labeling self-supervised learning using synthetic data so all these ways to avoid the cost

of labels at training time even though in practice so that’s my research has but i have my engineering hat which is

obviously labeled part of your training data obviously right um but the cool thing we found at the intersection of

both doing both the research and engineering is that semi-supervised learning is really

bottlenecked by what you do with the unlabeled data part of it is saying which part you label which is called active learning but part of it is well

even if you label a part of it you have all this rest what do you do with it and you can learn very useful representations from the unlabeled sides

which is why i think semi supervised learning uh is bottlenecked by things like self-supervised learning and this

is what we found in our both in our research and our in our practice

awesome awesome i want to pull in a question from our

Good Data vs Bad Data

audience uh Audrey you reference uh kind of alluded

to good data versus bad data uh Marat is asking uh can you elaborate or can our panel

elaborate on you know what the right data means what is good data and how do you know uh which of your data is good

i’ll let you start with that Audrey that’s a very good question [Laughter]

i think i think that that starts with having the right

people in the room again i think i repeat myself again and again um i as as a non-technical person and as a

person that i really act as you know like more like as a cross-functional position trying to make sure that

everyone talks to each other understand each other um the idea is really to understand at the very beginning when i start a data

labeling project i understand what the machine learning engineer data scientist wants to do with

that data where do they want to go what how are they going to use it why do they need that many images annotated and uh

why do they need to have like a time stamp on a video annotation so there is really like this

idea that we need to understand where they want to go and then from there we can create the right task um but again

it includes also feedback loop and quality control because even though I’m going to be translating everything

in the best way possible so that the crowd can label it the right way and consistently um there will always be

inconsistencies at the beginning there will always be some age cases that were not covered um

and and that just requires a lot of a teamwork between the different

stakeholders um and and having some quality control done by the machine

learning engineer that wants to get the data is in my sense um number one requirement because they are the ones

that will be using the data and they need to be involved i know it’s it’s not the the most like fun part of their job

i know it’s not like really exciting to work on data labeling but having them involved is really critical to uh to get

the right data in place so i would say good data means right data for the machine learning engineer who’s going to

be working on the model Charlene any experience with uh trying

How Do You Use Data

to identify the right data yeah i absolutely agree with the way Audrey framed it about it being about uh

how do you actually want to use the data um and that’s going to really dictate kind of how you frame the task how you

end up evaluating performance on the task like how strict do you need to be about certain rules like our customer is

going to be okay with these kinds of outputs um do you need to prioritize precision or recall

so all these things that are kind of dictated by the actual use case um

yeah that’s that’s kind of my thoughts about it what about this idea

Less Data More Data

that um you know we’ve we’ve been kind of driving towards more data more data more

data but in some cases actually less data less data is the the right approach

does anyone have any experience with uh you know these ideas data curation and things like that

sure okay so i i do have a quick experience with this um funny enough uh this was like uh one of the first models

that i was training at primary it was a relationship extraction model where the task was to identify like who

is the employer of some selected person um and so i had about 500 like hand labeled

examples that i had given to the model and then i was like okay but can i make it even better um and so i labeled like

another 200 examples and then i went to test the model and the performance had actually gone down

like recall had gone down significantly by like 0.2 f1 um and so i was like what the heck and then

i actually i went back and looked at the data and i realized like it was all following kind of the exact same pattern

where it was like person name you know employee at tesla or something like that um and so the

model had basically overlearned this one pattern and so it had kind of forgotten like other types of patterns um

of like employment uh in text and so it’s definitely the case that if your data is not diverse enough you can

actually make it worse by giving uh the model more data

yeah Adrian yeah i also have a cool story like that um there’s actually a kind of um uh

Dark Knowledge

i was talking about like arts that like dark knowledge that becomes kind of best practices and being

in the silicon valley actually has a huge advantage of like people chatting with each other about all these kind of

like war stories so here i can also give you a little bit of that uh here one was

sometimes the best thing you can do is deleting part of your training set and that seems very weird because you

paid good money for those labels you know uh and you have ninety-five percent inter-annotator agreement and you know

the labeling company is like yeah this is the best etc and then what happened was that there was one engineer uh in the driving team

uh mark uh that was interested in um like you know working more into ml and

he saw that was the future and he wanted to learn a little bit so you know in the evenings we were sticking around in the office just two

of us i was teaching him some basics but like on our data and then um you know something that

computer vision people don’t do nearly enough i completely agree with Audrey is look at the data so i was showing him stuff and looking at the data and he was

like yeah that label is weird that label is weird it’s like oh yeah you’re right but it’s just outliers

and then you know going home and uh and and mark is very focused so he went through all the data and and started to

flag things that he thought looked uh not okay and that was half the data set and then retrained the model like push

button with just half the data set and then results were you know two percent better which was a bigger improvement

than we’d achieved you know uh with any single update before and i was like what so at first i didn’t believe him so i

went and looked at all the data i rejected looked at the logs and stuff like that and i was like what

and that was like five years ago uh so i think that’s uh you know with the data there’s the good the bad and

the ugly so that’s an example of ugly uh you should be uh you know periodically revisiting and so some of

it is the data quality right obviously but some of it is more pernicious because it’s the world is changing you

know very very big assumptions we’re making machine learning which is very fundamentally flawed is iid right

everything is independent and identically distributed uh guess what the world is not iid

otherwise it will be like the movie memento if you’ve seen it uh so it’s it would be terrible um and in self-driving

in robotics in in many other applications I’m sure jenny’s can talk about fraud detection or any of the

panelists it’s it’s sequential decision making the whole system is a huge feedback loop and so

decisions impact customer behavior or impact you know the behavior of the robot impact data

that comes in and this whole loop has a big feedback loop and so iid is very bad

assumption and especially for data centric because now you’re shuffling the data and as

leonbo2 a famous machine learning person uh you know a popularized sgd in deep learning said nature does not shuffle

the data and i think that’s a big big issue with what we’re doing now and part of it

contributing to hurting uh in a data-centric space

Getting the right data

got it uh i want to take another question from the audience mahavir asks

along the same lines you know getting the right data unquote

can be tedious and take a long time what data points have you all used to

convince management uh that there’s ROI in spending that time

any thoughts on that and we’ll start with you Janice yep it’s a i would say pretty easy show like

a for example we have a global company and uh there are a few markets that were more dominant versus

like there are some newer markets that will have less data so you say you see a show like let’s say we just like oh maybe

very upper management thinks that we already have a lot of data like why do we still need to spend time to get more

data um because like uh basically uh it was also mentioned by Shawnee like basically the

dominant trends will basically uh shadow everything else so uh with that

we can like build a model easily show that hey uh bimbi is a model with a lot of data but maybe it’s a very tuned to

US market because with that’s a market we have the most data in but what if we want to address some of the newer

emerging places like LATAM or some other countries but we have where we have very little data we can show that when we

separate out the performance of those like a lingua marquez maybe we will have much lower performance because we don’t

have representation we don’t have enough uh enough data points over there so uh i guess once we break down the performance

show the upper management that hey without those data we will have good performance in some pockets but like a

much lower performance in some pockets or maybe sometimes some bias in some some pockets then it’s easy to show that

we need to spend time even though overall scheme is a lot of data but we need to make sure that is representative

is covering all the different pockets and also using good quality otherwise the performance just by itself will show

Is it easier to convince management

yeah i i will also add to the question um i think Janice uh and Audrey uh and Charlene

maybe we’re all in the same situation which is it’s easier to convince management when you’re management

so you know um i think that’s one thing in the machine learning where it’s important to have technical leadership

i think you know there’s this kind of like dual cultures of management right of like people management just to prove

expense reports and and you know and then the other side of things which is just like technical leaders um i think

in my experience it’s kind of unavoidable to be leading from the front and machine learning space because it’s

such a deeply technical and not fully understood space that if you have to educate your manager

in addition to do your job this might be a bit of a tough cookie um and so i think performance does speak

for itself so if you’re in a situation where your manager is not an ml savvy person and it’s kind of like

[Music] doubting the hype you know some people are like that including in the robotic space

and so but performance speaks for itself so articulate a problem articulate a benchmark that’s at the scale where um

management says okay i can improve that you know and maybe you can do a bit more on the side uh if you’re really passionate

about it um and then the performance should speak for itself and that’s why data centric ai is kind of spreading

like wildfire because when Charlene was talking about her examples and like i think everybody here in this panel has

examples of like being surprised uh by focusing on the data for a little bit so my experience is actually the opposite

it’s harder to convince ics than management and for the reason actually Audrey mentioned which is it’s

not as sexy uh working on the data and labeling and correcting stuff as ooh this new

transformer that just came out from Deepmind you know it’s like so i think that’s that’s what’s harder is to get

people like Vincent Van Hook has a great blog post as a head of google robotics

on you know like managing research teams that i think applies a lot to machine learning even the engineering side has a little bit of flavor of research

and says shoot shiny objects on site and that’s really hard when you have like

surrounded by shiny objects which is 100 papers an archive every day in our field right like not joking so so i think um

getting people to focus on the data is not so much of a management problem

it’s more of a down management problem in my in my experience interesting interesting

Bring attention

uh that like i think the whole data centric ai one thing is also bring attention i

mentioned earlier there’s a lot of hype and a lot of attention on algorithm this is the shiny objects basically people

think machine learning is so cool it’s a as a really thick cool later factor to it

so like i guess when you bring more attention like data is equally important it’s equally sexy so we need to make

sure that like from all over the organization we pay enough attention and spending the focus

How do you start

um kind of continuing this thread of

practical takes we’ve kind of you know defined uh data centric ai and we’ve

talked about a lot of aspects but um i don’t know that we’ve kind of really

nailed like how do you do it um and you know maybe the question is

like how do you start if you’re in our audience and you think that this is a

compelling conversation or a compelling idea like what do you do uh and

uh or you know another kind of way to look at that question is

you know for the the ics that are on the ground like how does their experience change uh

in a pre data center model model centric view versus data centric view

any takers on that

Framing

how about you Charlene yeah i was just thinking about that um

i think that framing can really help with this um so Adrian was getting at this earlier but

like it it’s actually in my experience harder to convince ics to focus on the data um because and i i think this is

because a lot of people you know they got into ai because they think the tech is cool like they’re really interested

in the math behind the models and like how it works um but like in order to get

in order to get started with this approach i think it’s it’s maybe important to kind of uh

tie in that understanding of how the data interacts with the model maybe um on like a mathematical level um and that

way you can kind of summon that interest like in the data um in order to

uh like put that same focus into the data curation process as opposed to just the

sort of model tuning process um in terms of like actual

processes to put in place it kind of depends on where you’re starting from so it’s a little bit hard

to say exactly um but you know if you’ve been using like off-the-shelf academic

data sets um first of all stop that you can use that for your pre-training

but um you’re gonna have to actually go and get custom data for your task most of the time unless you’re doing

something pretty basic um and so you need to actually like work with the subject matter experts in your domain

whether those people are like in your company or outside of your company and you you need to hire them I’m sure

Audrey can say a bunch about this too but you actually need to be willing to

um go and define that task and work with all the stakeholders and figure out like

what exactly um is the use case that we’re working towards and how do we get that data um

but I’ll I’ll pass it along to uh to someone else who can say more about that

Audrey um yeah i think I’ve seen like

Data Scientists

depending on the company depending on their knowledge about how to go like on how to get the highest accurate

data and and so on I’ve seen startups that don’t have the budget like data

scientists are working on that and i get that this is like again it’s not super sexy and so when they work with like

companies like like ml twist or they’re working with data labeling operation people like and and we hold their hand

and we said like don’t worry we’re gonna get there this is how to go about it it gets it gets actually quite interesting

event for them it gets like quite sexy even for them because all of us so then they realize how they can impact the

quality of the data just by simple following simple steps and just like following a workflow and and so on so um

i I’m very passionate about data learning so i think i guess I’m I’m able to to give that to the people I’m

working with um what are some of the some examples of those steps those simple steps

Data Learning

the the very the very simple ones is the the ones i mentioned earlier is that

like if you start and you know what data labeling you need to do to get the data you need

for your model um try it yourself i think that’s the number one rule work on the task

yourself even if it’s like 100 images or 50 images or or text annotation just do

the work yourself because that’s one is going to really help you understand if you have covert every potential use

cases that every person sorry potential use case that will be in your data set

um and and that’s the number one rule because that’s going to help you refine your guidelines your step-by-step guide

that you’re gonna give to the workforce so that they can annotate the way you want them to annotate the data so

um and and that’s very simple and and that might be like obvious uh when you

think about it but it’s a very different experience when you try the task yourself

um when when it comes to bigger companies that have already the knowledge about data labeling um the

definitely like the experiences is way different because they have

they have they know already the impact of of doing it of working on a data-centric approach and then all of a

sudden the discussion is is different it’s not about like taking uh step by step and baby steps on how to get there

but it’s more like okay how can we do it even better what type of data labeling tool out there can help me get like you

know all these uh thousands of uh brands for a

coffee pod that would be like just annotated in like less than 30 seconds so we’re talking more like as Charlene

mentioned about um aggressive timelines how to reduce the cost how to reduce the

timeline for delivery um and and it’s much different conversation but there are like different

different levels definitely depending on where you start yeah i think Janice Janus wanted to jump

Mindset Tools

in as well sure yep yeah first it’s a mindset uh we

talked about it as a i think first we mindset and culture so very important so people have the passion to build to do

it and second is having the right tools in place are there a lot of like already like open source or on the market

there’s a lot of tools out there that will facilitate the work a lot i think the principles and the work that’s

needed is not like super complicated it’s just about uh are we applying the right uh right methodology and the right

tools will really help with that so don’t having all those setup then people will

understand the value and also make it much easier to accomplish as well so i guess those are the first few steps to

begin with Adrian yeah i completely agree with what Audrey

Dogfooding

and jenny said uh i think that dogfooding uh starting with dogfooding is really good idea like building your own

data sets uh getting your hands dirty uh you know or being part of the you know

if there is already a pipeline you know joining the pipeline at least for quality checking you know uh

not for labeling uh i learned a lot by just quality checking labels you know

labeling parties I’ve actually organized labeling parties at my work to do that you know incentivizing with pizza and

stuff like that which was both for getting better data but also for getting people to look at the data and

understand a little bit more and that’s a huge driver also of creativity of insights into not just the domain but the whole

labeling process and things like that so a huge uh plus one for that maybe to complement one things i can add is that

that’s also why i fundamentally believe that there’s a danger there’s something called conway’s law

which is that an organization writes software that’s structured in modules like the organization’s teams right so

you have like in self-driving cars for instance you have a perception team that does the perception module your prediction team does the prediction

module you have a planning team that does a planning model right and so one of the problems is that there’s another law which

it says that the law of leaky abstractions all abstractions are leaky and i think there’s a kind of a danger

of having vertical silos like the data silo the ml silo which contributes to also maybe

the slow adoption of data centric and ml teams because they were they had a data team

that were dealing with data and they were dealing with the models right and so i think that

ml people need to be more involved in the data operations whether it is by having the organizational structure in

place to have mln data under the same roof or just having really good cross-functional collaboration i think

that’s really important now in terms of tooling so we’ve done

taking some steps for that the first steps we’ve taken for us was governance

because in safety critical scenarios applications of machine learning like self-driving cars which are not super

regulated you have to decide um how much you want uh how safe you

want to be right and and uh and so in car makers like that tend to be very safe so some of the things that were

important for us is ai safety and so it starts actually with simple things in the data space which is like

traceability so we’ve had an open source library called the data set governance policy or dgp and that’s been

instrumental to how we know like there’s no github of data right and at certain scale it’s really hard to do data

versioning tracing and stuff like that so it sounds simple because for code we have get it’s so easy right but for data

it’s very very hard especially because of the human in the loop uh many humans in many loops um and so i think that

having traceability having integration in different systems uh that’s also why we are one of the first customers of

weights and biases uh to have all this experimental management to have the code the data the experiments and all the

human decisions in between um as connected as possible um

and and so yeah so i think those kind of like data governance is a good thing to have in place getting

your hands dirty and building your own data sets definitely plus one continuing on this thread of tools uh in

Tools

the context of data centric ai um you know we hear a lot about different

tools technologies uh and the things that come to mind are uh you know data creation or data

curation you know synthetic data programmatic labeling weak supervision

active learning um I’m curious

do do you do we think are these foundational you know tools that are

you know required for data centric ai are they uh important things to have in

the toolbox but not necessarily foundational or are they shiny objects that are probably more of a distraction

I’m a huge proponent of self-supervised learning and simulation so i definitely don’t think they’re shining

objects i think like you know one one thing that people underestimate is accessibility of data

uh right it’s like for certain safety critical scenarios or privacy uh for

reasons or ethical considerations you know it’s not like you just have harvest the data from all your users

certain applications you can a lot of applications you can’t or you don’t want to um and so i think accessibility to

data is a big challenge and so either you have access to a lot of raw data um but you cannot you know label it or you

don’t want to label it for certain things um but it’s still part of the real world so it’s still there’s

something to learn from there because it’s still part of the world that your system is going to operate in so that’s where self-supervised learning is there

the main question there is is self-supervised learning absorbing or worse amplifying the biases in your data

and that’s why we’ve done some theoretical studies including an eye clear paper we presented recently to show that at least for imbalance which

is a very natural bias ubiquitous in data um uh actually

self-sufficient learning learns more robust features to that bias than supervised learning so that was a great

thing there’s another one which is simulation which is if you don’t have access to your own data you

need to create it and in robotics a big way is synthetic data generating

simulation especially again for if you want to learn to avoid to crash you shouldn’t be having crashing millions of

real cars you know to learn to avoid that from experience that doesn’t so overall it maps out to this very simple

like uh view of the world which is you have three types of variables you have the known nodes which is what you can label

right you have the known unknowns which is you know they exist you know uh like you

know like the you know like a dead animal on the side of the highway that’s a real example where people told me we need to make sure we the car behaves

correctly there it’s like well what am i gonna do is take a shotgun and shoot animals like corpses along the highway

that’s like i mean no you know that sounds insane um and so simulation is

there which you can very easily generate scenarios for that so that’s the known unknowns what you do it’s called programmatic data generation uh

programmable data right that that’s what simulation is really really good at the known unknowns but then you have the

heavy tail of the unknown unknowns right all the crazy things that can happen in the world they happen right and and you

have to have your model be either robust to that meaning being able to say well here you shouldn’t trust my prediction

right out of domain detection etc or said i need to have access to all the data and learn from all the data to be

able to recognize when it when it happens even if it’s just to say don’t trust me and that’s for me where self-supervised

learning comes into the picture so again we need everything at the table the problems that we’re dealing with is so

hard every tool is needed any additional takes on tools

Data augmentation

Charlene yeah i think some of these things can be more useful than others depending on the area so like one thing

that you run into in NLP is that data augmentation isn’t really a

thing uh because you can’t just substitute words in a document and

like for a synonym or an ostensible synonym and expect it to like mean the same thing or even like have the same

label afterwards uh and so you have to get a little bit more creative about how you end up um

getting new documents to label um so one thing that helps a lot is similarity search so if you project all of your

documents into embedding space you can actually search that space using nearest neighbor

search or something like that let’s say you’re trying to increase recall specifically you know you

take all your positive documents you project them and then you select more documents from your unlabeled set that

are close to those in embedding space so that’s one way that something like that can help active

learning also helps a lot when you’re still kind of initially exploring the space so one algorithm we implemented at

primer was something called corset which is a form of active learning that

prioritizes um covering the entire uh data distribution

um and there’s some very fancy math involved in this but there are like implementations that you can use um

and essentially it ensures that you have explored the diversity of the data even just in your first like 50 to 100

examples instead of like taking a random sample and just hoping that you know most of the things and and

the edge cases that you would want to see are in that random sample um and so you definitely have to get creative with

using some of these methods depending on uh like Adrian was saying you know what level of access do you have to the data

like how much other data do you have um and various other constraints

Top takeaways

got it got it uh so if the question is is it about people process mindset or

tools the answer is yes yes yes and yes uh we are coming up on the top of the

hour and so um let’s do a really quick round of uh you

know top one takeaway for our audience from your perspective and if you can

keep it to less than three words all the better Adrian

ooh um listen to our discussion with Sam about principle centric uh because i think

that data centric we don’t know how to design data sets and we need to know how to be able to teach machines and that

means including designing data sets and injecting what is not in the data set and we didn’t talk about it today but we

made a whole episode with Sam so people can check it out awesome Audrey

and that’s a very tough one

i think that yeah i know that data quality uh obviously we all know

now uh by now that it’s like that’s the new oil the new gold or however you wanna you wanna call it and um

and you need like a lot of different people in the room i hope and and we chatted about it the

last time also as uh with with you Sam is that um we need to get more people in

the room that are coming from different backgrounds because that’s going to help with a lot of different issues that comes with data centric approach as well

which is how do we go into like more ethical you know data how do we remove bias and and

so on and so having like a mix of of of different people will really help on the

top of definitely leveraging all the different technologies at the end uh to

come back to what we discussed like what Charlene and Adrian were talking about i believe that technologies are they’re

all great all all the ones that are there we are very lucky to be um

using them uh right now but depending on the use case depending on the industry depending on the field

uh not all of them should be used and there should be like this way to

unify the data labeling ecosystem and just pick the right tool when we need it on the top of the right people

awesome Janice yeah overall i would uh reiterate data is as

important as algorithm and user as sexy as algorithm there’s a lot going on in this space data is sexy yeah

yeah and uh we talk about a lot of things in this one hour actually there’s a lot of activities going on and also

they require a lot of deep skills as well so i would encourage everyone who is uh in the ai space in the like a

machine learning space and pay as much attention to data and also build all those knowledge on what it takes to

build a uh the data centric ai to have apply the right quality

uh how to do synthetic data there’s a lot going on so and there’s a lot to learn

yeah i totally agree um if i had to summarize mine in three words i would say get better tooling um

because uh a like right now there are so many companies coming out with really interesting ml ops tools that can help

you so much with this process of iterating on your data set and iterating on your model like you don’t have to

write your own scripts for all this stuff anymore like you can literally just buy a solution

so try checking out first of all better labeling tools because there are labeling tools out there that have kind

of QA built into the labeling process and so it’ll help your labelers do the right thing and avoid doing the wrong

thing which results in better quality data for you and then secondly uh data management i i

I’m a little bit biased but i say check out aquarium for data set management um we help you keep track of quality issues

you know team members can collaborate on different issues with the data set um you can explore your data really easily

with the embedding view there’s uh many many other features um so definitely uh you don’t have to do

all this alone uh get the right tools and uh it’ll help you solve your problem with a lot less tedium

awesome awesome all right well we are going to wrap up i want to start by thanking our panelists for

their insights and contribution to this session uh thanks team for pulling this

together uh the recording of today’s discussion will be available on YouTube immediately

so if you’re out there and you want to share it with your friends just send them to the YouTube URL

big thanks to everyone who tuned in and once again to our fantastic panelists

to stay up to date on the next one please be sure to visit twimlaya.com and sign up for our newsletter and you can

also follow us on twitter at twiML and of course be sure to uh sign up for

updates at twimlcon.com uh thanks so much everyone thanks everyone thanks Sam

thank you thank you bye

Data-Centric AI: Why This Trend is Here to Stay

Leave A Comment Cancel reply

Prev Post

Data-Centric AI: Why This Trend is ...

Next Post

Data-Centric AI: Why This Trend is ...

Contact

Info

Get it now!