Data-Centric AI: Why This Trend is Here to Stay
Several months ago, MLtwist had the pleasure of participating in the TWIML AI panel on Data-Centric AI.
https://www.youtube.com/watch?v=mbzgRkQtxU0
SC: All right welcome everyone to the broadcast I’m Sam Charrington, founder of
TWIML and host of the TWIML AI podcast. Today I’m super excited to be joined by
an amazing set of panelists to take on the emerging topic of data centric AI.
Now, data collection, transformation, labeling and annotation have long represented the most expensive and
time-consuming aspects of developing machine learning models and yet the ML and AI community has
emphasized the importance of model training and associated model parameters algorithms and architectures as the key
to achieving improved machine learning and performance data centric ai has recently emerged as
Data Centric AI
a trend towards a more balanced view it recognizes that model performance is hugely dependent on not just the
quantity of training data but on its quality data centric ai suggests that
investments in data collection curation augmentation and quality improvement techniques
tooling and platforms can be more effective at delivering high performing real world models at lower cost
we believe this is an important trend for those building real world machine learning systems and we’re very excited
to dig into it in this session before i introduce our panelists I’d
like to share a few quick announcements starting this Sunday July 3rd the
Study Group
Twiml community will be meeting for a study group following along with the recent NLP with transformers book by
Lewis Tunstall, Leandro Vanuera and Thomas Wolfe if you’re interested in joining this
group i encourage you to head over to twimlai.com community to join our slack community
then join the NLP with transformers channel for the most up-to-date information
thanks again to study group host Shan Tsudin for putting together and leading this group
TWiml con AI Platforms
next up i want to officially announce the upcoming TWIMLcon AI platforms conference
join fellow ML AI practitioners and leaders to explore real-world challenges of developing operationalizing and
scaling machine learning and ai in the enterprise in its third year the conference will
continue to focus on the platforms tools technologies and practices necessary to enable and scale enterprise ML AI
Twimblecon ai platforms will take place October 4th through 7th and will be hosted virtually for easy access by our
global community of ML AI practitioners and leaders passionate about deploying building and
integrating ML AI solutions as with previous years we’ll be sharing
a wide range of technical and case study sessions keynote interviews in-depth
workshops and panels learn more and sign up for updates at
twimlecon.com and stay tuned because we’ll be opening up registration which will be free
very shortly all right please keep in mind that while i do have questions prepared we would
Questions
really love for you to be the main driver of today’s conversation please send your questions in via the chat
where our team is moderating uh and I’ll make sure to work them into the conversation
finally we’re looking forward to bringing you more discussions like this on a wide range of topics to be notified
when we schedule future discussions subscribe to our newsletter at twimlai.com
newsletter all right and now let’s introduce our
esteemed panelists please join me in welcoming Adrian Gaydon head of machine learning
Introductions
research at Toyota research institute or tri Audrey Smith
Audrey smith COO at MLtwist hello Charlene Chambliss senior software
engineer at Aquarium learning hey and Janet’s uh senior director of data
science at Paypal hi everyone alrighty so let’s just jump right in uh
i had an epiphany this morning and that is that i should start out every panel discussion with a dad joke
uh i know my team is groaning right now so but I’m gonna do it anyway my first
question is does anyone know how do you make holy water
you boil the hell out of it [Laughter]
just trying to set the tone here we’re not taking ourselves too seriously we’re having a good time and we want to
Personal Experiences
invite involve our audience in this conversation uh by way of introduction let’s start
off by having each of you briefly share a bit about your experiences with data centric ai uh Adrian let’s let’s have
you start yeah sure um so I’m um I’m originally a computer vision person
so dealing uh before deep learning with small data uh then after 2012 dealing
with becoming a deep learning person so dealing with big data and now for the past six years I’ve been
more of a roboticist dealing with embodied data and so the type of data and the quantity
and quality that’s available and the processes around data centric ai have evolved over the
years and evolved you know basically uh with my different roles and i think one of the main things that i learned
in this past few years is how embodied intelligence is very different and how
robotics is very different in the sense of it’s not just a few python
program a few lines of code of python you know to get the web crawler and mine a lot of data and train a robot it’s a
bit more complicated than that because it’s related to physical interaction with the environment safety critical
things like self-driving cars so data centric ai i think means different things for different applications in the
space of robotics it has its own set of challenges and that I’m sure we’re going to talk
about today awesome Audrey
yes so I’ve been introduced to machine learning uh almost eight years ago now and I’ve
been in data labeling operation for all this time um working on different formats from audio video to text um and
and images and because of this role as a data labeling
operation person i have been focusing on data
since like from day one making sure that the highest quality possible
is reached for in any use case that i was working on
and meaning that i was able to work with a lot of different people from product managers to data scientists and
domain knowledge experts to also leveraging the best tools on the market
um when you mention data creation but also data augmentation synthetic data and so
much more that I’m sure we’re going to be covering today and the idea is really like to have this
combination when your data labeling operation person between the different stakeholders that you need
to deal with but also the right technologies always with the same focus which is
reaching the highest quality possible for the data before it hits the machine learning model
awesome Janice yep I’ve been in this data science machine
learning space for 15 plus years and mostly for focusing on fraud detection all this time data is a very important
piece of the equation we spend a lot of time on making sure that the data is a good quality we procure the right data
and it’s a company-wide effort we have different teams focused on different aspects from having the right data
platform having right data governance and when it comes to individual projects we spend a lot of time on making sure
that the quality of data is good for the particular problem so yeah i guess that data has always been a very important
aspect and over time we know there are like new techniques and recently renew focus more focus i guess on this data on
top of algorithm so uh we’d love to hear from different uh people different teams
on like what are the new things that have been introduced in this space because the data is just a fundamental piece as a of the overall data science
space all right awesome and Charlene hey I’m Charlene um my experience with
the topic of data centric ai comes mostly from my time as an ml engineer at primer ai where i built and shipped
multiple NLP models to production for diverse tasks like classification named entity recognition and relationship
extraction for my first year so at primer i also managed the data labeling operations for
the team and helped shape our process into something that could more reliably produce high quality datasets but also
on aggressive timelines and nowadays I’m with aquarium learning as a senior software engineer and our
product focuses on helping ml and ops teams quickly diagnose quality issues with their data sets find areas of
improvements for their model and quickly sample new data for training so now i get to work directly on helping folks
solve this problem which is a really fun position to be in awesome awesome
Definitions
all right so next i want us to level set on what it is that data centric ai
really means uh on our pre-call i expressed some misgivings around you know getting
caught up in the semantics of this but Adrian proposed what i thought was a great idea and that was having us each
try to boil down our thoughts on the topic to a single defining sentence uh and so I’d like to
uh have us share those and then we can dive into a conversation about them Adrian what do you have for us
right uh putting me on the spot um yeah so you forgot your homework
um i did my homework of course um so i think my sentence would be i try to be
as boring as possible because it’s there’s a lot of hype right so here uh it’s uh let’s not create controversy on
the definition at least that’s my intent so the most boring definition of data centric ai i could come up with was
improve downstream performance of machine learning models by iterating mostly on the training data
that’s that’s the most boring version i could get
Audrey um i would say it’s a combination of two things the first one is putting the
right people in the room from day one as soon as you start thinking about your
model and what data you need meaning technical and non-technical people and also being able to leverage the
right technology for the right use case it’s all like also like sorry just to
add on the data preparation side of things awesome awesome Janice
yep i probably would put it even more simple is a focus on data in the ai solution
life cycle because that data is just so fundamental we need to put as much
attention to it as a in parallel to algorithms
Charlene i think that uh Andrew Ng framed it really well when he kind of framed the
distinction as data centric ai versus model centric ai because for a long time the field was really focused on
iterating on model architectures optimizers training methodologies which makes sense because at the time it
was primarily being driven by academics um but nowadays it’s primarily being driven by industry and real world
applications and so the difference there is that you know in academia you don’t necessarily
have infinite money to hire labelers to make these gigantic data sets nor do you
have access to you know a product with millions of users that you can kind of harvest data from um and so we’re seeing
this kind of interesting shift where we’re acknowledging like okay our models are actually really good learners now so
we need to focus on teaching them the right things um it’s kind of about prioritizing what goes into the model
rather the model rather than the model itself so that we can unlock the business and societal value that comes from that
Why Data Centric AI
yeah i think mine is is similar to Adrian’s uh and is inspired by some of
Andrew’s writing and emphasis on a systematic approach data centric ai is a systematic approach
or having a systematic approach to improving and iterating on training data as a way to improve overall ai system
performance stressing the quality of data and labels over the quantity
uh so it doesn’t sound like we’re you know identifying the the kind of
controversies or the polls uh of the conversation here we all kind of
generally agree um about what data centric ai is
i guess the next question is in a sense why are we talking about it now right we’ve been saying for
years and years we’ve been throwing around this 80 percent of data sciences dealing with the data and only 20
uh is the models no one’s ever been able to cite where that comes from but we have all repeated it i’m sure uh raise
your hand if you have not um why are we talking about data centric ai like it’s like it’s new any any takes on
this Janice yeah my attack is uh i guess over
the years there’s a uh a lot more hype on the algorithm like people are talking about deep learning
um uh basically a lot of attention a lot of conferences about algorithm but look at the data as you said 80 of the work
are we kind of maybe uh not paying as much attention maybe we have forgotten some of the fundamentals and also i
guess that there’s some new requirements like a responsible ai or some some of the newer requirements that we need to
spend a lot more focus on data data these days so i think having this like a branding and having this attention will
be able to help us to make sure that we have a balanced approach between data and algorithm
that’s my approach that’s my take on it i think it’s just like a kind of balancing the two sides data algorithm
with uh recently so much hype on the algorithm and maybe we’re not giving enough attention to this area
but Janice not necessarily new in your opinion right it’s more like we need to make
sure that we have enough focus to it and Adrian i thought you had an interesting take on the importance of
Naming Things
naming things can you elaborate on that yeah um yeah it’s a tolkien expression
that i really like uh uh there’s power in naming things actually my collaborators they know that i
allocate a lot of importance to names of methods and titles of papers
and i think that what’s new with data centric ai is the name as you suggested there’s a lot of stuff that that
that have been going around for a while actually when you were talking it reminded me of this twitter account called big data borat you know that was
talking about the eighty percent of like cleaning data and so and it’s been around since big data right since the
unreasonable effectiveness of data paper right uh yeah in 2005 by google so what
changed is not just the name of course that’s that’s a little bit ingest although there is there’s truth to that
um but um i think the deep learning right uh it takes a while right so now everybody thinks in
their mind is 2012 but i lived through 2012. I defended my APH in 2012 and i
was working in convex optimization kernel methods like learning theory
right vapnik world and then overnight throw everything that in the garbage can because deep learning non-convex voodoo
etc so it took a while for computer vision people it took a year or two years to kind of like accept the evidence
for roboticists depending on which robotis is some people are still have to accept it
and so deep learning is is what changed the name of the game because before svms et cetera they didn’t need as much data
uh they didn’t get as much benefit from data a lot of the previous benchmarks even in the ECMA community were
saturating with data so it took Feifei uh faithfully professor at Stanford uh
from Imagenet fame quite a lot of vision at the time to make a million uh like
millions of images uh data sets with Imagenet because at the time it was not clear that it was needed
and and she had almost perfect timing she was a little bit early in the game uh and then boom you know it happened so
i think and it’s been continuing right with the models becoming bigger and more data hungry data started to have a bit more
um place at the table of concerns um and
and with transformers even more so because now people think uh and you know with evidence that we can use generic
models for a lot of different tasks and then we can have more data as the main variable of adjustment
and this is the this is the exciting part also because this is where there’s a lot of art uh more than science and so
this art is slowly getting into a practice right into uh best practices around data
centrism and then hopefully one day it will actually be a science you will be able
to buy a book just like engineers say how to build a bridge you know would be how to build a dataset you know we’re
very far from it yet Charlene you had an interesting take uh
last time we spoke last time we spoke about uh in your experience the relative
performance of data quality and and model training can you uh share that with us
yeah totally um one thing that’s been interesting to see as kind of an on-the-ground practitioner um
is that hey there are a lot more kind of there are a lot more people coming into ml from different diverse backgrounds
and like starting out their careers in ml right now and what they’re finding is that when
they’re going to build a model for their organization um they’re looking around and some of their
co-workers are spending more time on the algorithm and kind of tweaking hyper parameters
and trying to really uh optimize the architecture in a certain way and then other co-workers are
spending more time just iterating on the data set and it turns out that the folks who spend more of their time on the data set
are getting their models out the door a lot more quickly um they’re delivering more business value they’re getting promoted faster um and so from an
incentives perspective you know even just for the practitioners it makes more sense to focus on the data at this point
New Role
awesome Audrey yeah um i was um i liked what Adrian said about
like the fact that now we’re getting into a data-centric world and like what does that mean
in terms of the people that will be working on that field and making sure that data is right
and um that’s what i was talking um to you about the last time we which i did Sam is that i feel like now there will
be a new role um that will be really needed that has been there for a little while
but not as much as i thought it should be um which is data labeling operation people that are not technical people but
that and that who understand technical requirements and who are able to translate them into most like simple
um instructions for the data labeling workforce that will be there to create
the data accurately but also consistently um that comes also with best practices that comes with like
general workflows that need to be applied uh there are like certain recipes that work really well depending
on the format of the data depending on what the data scientists want to um recognize uh uh or
get out of the data um and i think that as you’re saying Adrian that there would be like
like a science around data labeling i think and data prep in general i think there will be
um maybe one day um some you know
you could go to the university and then become a data labeling operation person somehow um i believe that’s the future
so we’ll see so
Labels
it sounds like in a lot of ways as an industry or as a community we’ve got this kind of love hate relationship with
labels and labeling right data centric ai is very explicitly a reaction to kind
of the costs and challenges associated with hand labeling data um thinking back to a recent uh
episode of our data centric ai podcast series i spoke with Cheyenne Mahanti
and he dropped the spicy take that manual labeling is actually harmful to the machine learning process
i’d love to hear any reactions to this idea and i’ll maybe throw it right back to you Audrey
since labeling is your life [Music] yes
so um i think again like it goes back to what i just said i think that there are like
best practices and there are workflows and there are like um guidelines to follow on how to do
uh well when you’re labeling with humans and um if you come as a data scientist
and then you create your own tasks and you don’t try to adapt to the the
quote that’s going to be the one labeling your task or you don’t think about all the different age cases that
can be um you know contain in your data set you don’t think about bias you don’t think about how you’re gonna um
get your quality control done how you’re gonna get feedback to the crowd is doing the labeling and and also there is also
this idea that you know if it’s simple for the technical person it has to be simple for
the crowd and if the quality is not good that means that the crowd is not doing well i think that there should be like a
shift in the way of like uh the way we’re thinking about it and making sure that the best practices and the
workflows in place are driving efficiency consistency and accuracy and
and that’s something that is required to get high quality data
obviously if you don’t do that part right you’re gonna get bad data and you’re gonna you’re gonna
obviously it’s gonna be harmful so um that’s that’s yeah that’s what i think
about it yeah Charlene i was gonna say um real quick that’s
Manual Labeling
interesting it makes me think of um managing certain data labeling tasks at primer particularly named entity
recognition which is a task with so many edge cases you would never think that it’s that hard to identify like what’s a
person in this document like what’s a location um but it turns out you know sometimes university of arizona is a
location sometimes it’s an organization and that’s the case for like thousands and thousands of other things
um and so i can understand uh his point when he says manual labeling is harmful
because one way that people try to solve for this is with consensus labeling where they just throw like five
different people at every document and they’re like we’ll just average out their annotations
into something usable um but like if you do that you know with hundreds of documents you end up with a difference
of like thousands of entities like depending on whether you take two out of five agreement or three out of five agreement and so at some point that just
like doesn’t make sense because like your model is getting a lot of mixed signals um from using that data and so
it becomes more important to uh you know rather than using consensus just really
focus on ironing out those edge cases and making it really clear kind of what principles and what mental models to
have in mind as the labelers are labeling um and even examples of like how to handle certain edge cases as
opposed to just general vague guidance yeah Janice
Human Labeling
yeah i want to add that like uh that luckily uh there’s pros and cons with human labeling there are some problems more suited for human labeling with is
more black and white and i guess some of the but there are some other problems it’s more blurry and like a let’s say it
has to do with intention for example when we look at fraud cases like if we ask human to do labels sometimes it’s
hard by just looking at like those evidence so uh we need to employ different ways like for example uh
Charlene mentioned maybe we have multiple people so like our voting system or like uh we can have a mix of like active
learning or actually sometimes we just want to observe the longer term customer reaction to it so with that different
ways to do labeling i think uh and different different ways is pros and cons we need to combine the different
methods to get us the best way to have the best label at scale
awesome Adrian i i hear you’re labeling his biggest fan oh yeah yeah i have a love hate
Adrian’s Labeling Party
relationship with with labeling um actually do you know what the labeling party is
have you ever labeling a party party okay I’ve so okay i have a little story so um I won the Pascal VOC challenge
which is the ancestor of Imagenet back in 2008 as a big challenge even at the time um
and one of the cool things that happened was it was organized um uh by the late Mark Everingham who’s now has a prize in
his name you know for for contributions um and he had designed this benchmark which is really what then inspired you
know image net and other things and uh so i won this it kind of put me on the on the map a little bit in computer
vision and then Andrew Zissimon as a professor at oxford as like one of the legends in the field invited me to come
spend uh two weeks in oxford i was like great giving a talk there and etc but there was also a labeling party for the
next edition of the pascal VOC challenge where a lot of grad students were using a Matlab interface
to label for semantic segmentation and that’s where my love hate relationship for labeling him
i love using labels i hate producing them and so and then you know um so
that’s why i started to work on ways to avoid that and ultimately in in robotic space um and especially safety critical
applications like self-driving you know it can take up to four hours to label one image right because that 25 pixel
pedestrian you know it might be safety critical um and the level of QA that you
have like quality analysis equality insurances that you have to do is kind of like very very high standards so everybody
has hundreds of pages of labeling guidelines uh in internal annotation
teams or at least internal QA teams it’s it’s it’s serious business uh to do
labeling as I’m sure Audrey and basically everybody in data center expected can talk about and so that’s
why one of the things for me that moves the needle is I’ve mentioned it my most extreme point of view is
all the labels should go to the test set i agree with Janus actually that the solution is is in the gray right as
obviously semi-supervised you need some labels to train you needs like some pragmatic approach but if i try to like
say the most extreme ideal that i have in my head is all your labeling budget should go to
your test set you’ll always need labels because evaluation is statistical so might as well maximize uh your test set
because that’s what really how you know how you get certainty before deploying right and in safety critical
applications you want to be pretty sure that your system is safe and a statistical system right
so um so that was my ideal and so that’s why in my research uh I’ve worked a lot
on auto labeling self-supervised learning using synthetic data so all these ways to avoid the cost
of labels at training time even though in practice so that’s my research has but i have my engineering hat which is
obviously labeled part of your training data obviously right um but the cool thing we found at the intersection of
both doing both the research and engineering is that semi-supervised learning is really
bottlenecked by what you do with the unlabeled data part of it is saying which part you label which is called active learning but part of it is well
even if you label a part of it you have all this rest what do you do with it and you can learn very useful representations from the unlabeled sides
which is why i think semi supervised learning uh is bottlenecked by things like self-supervised learning and this
is what we found in our both in our research and our in our practice
awesome awesome i want to pull in a question from our
Good Data vs Bad Data
audience uh Audrey you reference uh kind of alluded
to good data versus bad data uh Marat is asking uh can you elaborate or can our panel
elaborate on you know what the right data means what is good data and how do you know uh which of your data is good
i’ll let you start with that Audrey that’s a very good question [Laughter]
i think i think that that starts with having the right
people in the room again i think i repeat myself again and again um i as as a non-technical person and as a
person that i really act as you know like more like as a cross-functional position trying to make sure that
everyone talks to each other understand each other um the idea is really to understand at the very beginning when i start a data
labeling project i understand what the machine learning engineer data scientist wants to do with
that data where do they want to go what how are they going to use it why do they need that many images annotated and uh
why do they need to have like a time stamp on a video annotation so there is really like this
idea that we need to understand where they want to go and then from there we can create the right task um but again
it includes also feedback loop and quality control because even though I’m going to be translating everything
in the best way possible so that the crowd can label it the right way and consistently um there will always be
inconsistencies at the beginning there will always be some age cases that were not covered um
and and that just requires a lot of a teamwork between the different
stakeholders um and and having some quality control done by the machine
learning engineer that wants to get the data is in my sense um number one requirement because they are the ones
that will be using the data and they need to be involved i know it’s it’s not the the most like fun part of their job
i know it’s not like really exciting to work on data labeling but having them involved is really critical to uh to get
the right data in place so i would say good data means right data for the machine learning engineer who’s going to
be working on the model Charlene any experience with uh trying
How Do You Use Data
to identify the right data yeah i absolutely agree with the way Audrey framed it about it being about uh
how do you actually want to use the data um and that’s going to really dictate kind of how you frame the task how you
end up evaluating performance on the task like how strict do you need to be about certain rules like our customer is
going to be okay with these kinds of outputs um do you need to prioritize precision or recall
so all these things that are kind of dictated by the actual use case um
yeah that’s that’s kind of my thoughts about it what about this idea
Less Data More Data
that um you know we’ve we’ve been kind of driving towards more data more data more
data but in some cases actually less data less data is the the right approach
does anyone have any experience with uh you know these ideas data curation and things like that
sure okay so i i do have a quick experience with this um funny enough uh this was like uh one of the first models
that i was training at primary it was a relationship extraction model where the task was to identify like who
is the employer of some selected person um and so i had about 500 like hand labeled
examples that i had given to the model and then i was like okay but can i make it even better um and so i labeled like
another 200 examples and then i went to test the model and the performance had actually gone down
like recall had gone down significantly by like 0.2 f1 um and so i was like what the heck and then
i actually i went back and looked at the data and i realized like it was all following kind of the exact same pattern
where it was like person name you know employee at tesla or something like that um and so the
model had basically overlearned this one pattern and so it had kind of forgotten like other types of patterns um
of like employment uh in text and so it’s definitely the case that if your data is not diverse enough you can
actually make it worse by giving uh the model more data
yeah Adrian yeah i also have a cool story like that um there’s actually a kind of um uh
Dark Knowledge
i was talking about like arts that like dark knowledge that becomes kind of best practices and being
in the silicon valley actually has a huge advantage of like people chatting with each other about all these kind of
like war stories so here i can also give you a little bit of that uh here one was
sometimes the best thing you can do is deleting part of your training set and that seems very weird because you
paid good money for those labels you know uh and you have ninety-five percent inter-annotator agreement and you know
the labeling company is like yeah this is the best etc and then what happened was that there was one engineer uh in the driving team
uh mark uh that was interested in um like you know working more into ml and
he saw that was the future and he wanted to learn a little bit so you know in the evenings we were sticking around in the office just two
of us i was teaching him some basics but like on our data and then um you know something that
computer vision people don’t do nearly enough i completely agree with Audrey is look at the data so i was showing him stuff and looking at the data and he was
like yeah that label is weird that label is weird it’s like oh yeah you’re right but it’s just outliers
and then you know going home and uh and and mark is very focused so he went through all the data and and started to
flag things that he thought looked uh not okay and that was half the data set and then retrained the model like push
button with just half the data set and then results were you know two percent better which was a bigger improvement
than we’d achieved you know uh with any single update before and i was like what so at first i didn’t believe him so i
went and looked at all the data i rejected looked at the logs and stuff like that and i was like what
and that was like five years ago uh so i think that’s uh you know with the data there’s the good the bad and
the ugly so that’s an example of ugly uh you should be uh you know periodically revisiting and so some of
it is the data quality right obviously but some of it is more pernicious because it’s the world is changing you
know very very big assumptions we’re making machine learning which is very fundamentally flawed is iid right
everything is independent and identically distributed uh guess what the world is not iid
otherwise it will be like the movie memento if you’ve seen it uh so it’s it would be terrible um and in self-driving
in robotics in in many other applications I’m sure jenny’s can talk about fraud detection or any of the
panelists it’s it’s sequential decision making the whole system is a huge feedback loop and so
decisions impact customer behavior or impact you know the behavior of the robot impact data
that comes in and this whole loop has a big feedback loop and so iid is very bad
assumption and especially for data centric because now you’re shuffling the data and as
leonbo2 a famous machine learning person uh you know a popularized sgd in deep learning said nature does not shuffle
the data and i think that’s a big big issue with what we’re doing now and part of it
contributing to hurting uh in a data-centric space
Getting the right data
got it uh i want to take another question from the audience mahavir asks
along the same lines you know getting the right data unquote
can be tedious and take a long time what data points have you all used to
convince management uh that there’s ROI in spending that time
any thoughts on that and we’ll start with you Janice yep it’s a i would say pretty easy show like
a for example we have a global company and uh there are a few markets that were more dominant versus
like there are some newer markets that will have less data so you say you see a show like let’s say we just like oh maybe
very upper management thinks that we already have a lot of data like why do we still need to spend time to get more
data um because like uh basically uh it was also mentioned by Shawnee like basically the
dominant trends will basically uh shadow everything else so uh with that
we can like build a model easily show that hey uh bimbi is a model with a lot of data but maybe it’s a very tuned to
US market because with that’s a market we have the most data in but what if we want to address some of the newer
emerging places like LATAM or some other countries but we have where we have very little data we can show that when we
separate out the performance of those like a lingua marquez maybe we will have much lower performance because we don’t
have representation we don’t have enough uh enough data points over there so uh i guess once we break down the performance
show the upper management that hey without those data we will have good performance in some pockets but like a
much lower performance in some pockets or maybe sometimes some bias in some some pockets then it’s easy to show that
we need to spend time even though overall scheme is a lot of data but we need to make sure that is representative
is covering all the different pockets and also using good quality otherwise the performance just by itself will show
Is it easier to convince management
yeah i i will also add to the question um i think Janice uh and Audrey uh and Charlene
maybe we’re all in the same situation which is it’s easier to convince management when you’re management
so you know um i think that’s one thing in the machine learning where it’s important to have technical leadership
i think you know there’s this kind of like dual cultures of management right of like people management just to prove
expense reports and and you know and then the other side of things which is just like technical leaders um i think
in my experience it’s kind of unavoidable to be leading from the front and machine learning space because it’s
such a deeply technical and not fully understood space that if you have to educate your manager
in addition to do your job this might be a bit of a tough cookie um and so i think performance does speak
for itself so if you’re in a situation where your manager is not an ml savvy person and it’s kind of like
[Music] doubting the hype you know some people are like that including in the robotic space
and so but performance speaks for itself so articulate a problem articulate a benchmark that’s at the scale where um
management says okay i can improve that you know and maybe you can do a bit more on the side uh if you’re really passionate
about it um and then the performance should speak for itself and that’s why data centric ai is kind of spreading
like wildfire because when Charlene was talking about her examples and like i think everybody here in this panel has
examples of like being surprised uh by focusing on the data for a little bit so my experience is actually the opposite
it’s harder to convince ics than management and for the reason actually Audrey mentioned which is it’s
not as sexy uh working on the data and labeling and correcting stuff as ooh this new
transformer that just came out from Deepmind you know it’s like so i think that’s that’s what’s harder is to get
people like Vincent Van Hook has a great blog post as a head of google robotics
on you know like managing research teams that i think applies a lot to machine learning even the engineering side has a little bit of flavor of research
and says shoot shiny objects on site and that’s really hard when you have like
surrounded by shiny objects which is 100 papers an archive every day in our field right like not joking so so i think um
getting people to focus on the data is not so much of a management problem
it’s more of a down management problem in my in my experience interesting interesting
Bring attention
uh that like i think the whole data centric ai one thing is also bring attention i
mentioned earlier there’s a lot of hype and a lot of attention on algorithm this is the shiny objects basically people
think machine learning is so cool it’s a as a really thick cool later factor to it
so like i guess when you bring more attention like data is equally important it’s equally sexy so we need to make
sure that like from all over the organization we pay enough attention and spending the focus
How do you start
um kind of continuing this thread of
practical takes we’ve kind of you know defined uh data centric ai and we’ve
talked about a lot of aspects but um i don’t know that we’ve kind of really
nailed like how do you do it um and you know maybe the question is
like how do you start if you’re in our audience and you think that this is a
compelling conversation or a compelling idea like what do you do uh and
uh or you know another kind of way to look at that question is
you know for the the ics that are on the ground like how does their experience change uh
in a pre data center model model centric view versus data centric view
any takers on that
Framing
how about you Charlene yeah i was just thinking about that um
i think that framing can really help with this um so Adrian was getting at this earlier but
like it it’s actually in my experience harder to convince ics to focus on the data um because and i i think this is
because a lot of people you know they got into ai because they think the tech is cool like they’re really interested
in the math behind the models and like how it works um but like in order to get
in order to get started with this approach i think it’s it’s maybe important to kind of uh
tie in that understanding of how the data interacts with the model maybe um on like a mathematical level um and that
way you can kind of summon that interest like in the data um in order to
uh like put that same focus into the data curation process as opposed to just the
sort of model tuning process um in terms of like actual
processes to put in place it kind of depends on where you’re starting from so it’s a little bit hard
to say exactly um but you know if you’ve been using like off-the-shelf academic
data sets um first of all stop that you can use that for your pre-training
but um you’re gonna have to actually go and get custom data for your task most of the time unless you’re doing
something pretty basic um and so you need to actually like work with the subject matter experts in your domain
whether those people are like in your company or outside of your company and you you need to hire them I’m sure
Audrey can say a bunch about this too but you actually need to be willing to
um go and define that task and work with all the stakeholders and figure out like
what exactly um is the use case that we’re working towards and how do we get that data um
but I’ll I’ll pass it along to uh to someone else who can say more about that
Audrey um yeah i think I’ve seen like
Data Scientists
depending on the company depending on their knowledge about how to go like on how to get the highest accurate
data and and so on I’ve seen startups that don’t have the budget like data
scientists are working on that and i get that this is like again it’s not super sexy and so when they work with like
companies like like ml twist or they’re working with data labeling operation people like and and we hold their hand
and we said like don’t worry we’re gonna get there this is how to go about it it gets it gets actually quite interesting
event for them it gets like quite sexy even for them because all of us so then they realize how they can impact the
quality of the data just by simple following simple steps and just like following a workflow and and so on so um
i I’m very passionate about data learning so i think i guess I’m I’m able to to give that to the people I’m
working with um what are some of the some examples of those steps those simple steps
Data Learning
the the very the very simple ones is the the ones i mentioned earlier is that
like if you start and you know what data labeling you need to do to get the data you need
for your model um try it yourself i think that’s the number one rule work on the task
yourself even if it’s like 100 images or 50 images or or text annotation just do
the work yourself because that’s one is going to really help you understand if you have covert every potential use
cases that every person sorry potential use case that will be in your data set
um and and that’s the number one rule because that’s going to help you refine your guidelines your step-by-step guide
that you’re gonna give to the workforce so that they can annotate the way you want them to annotate the data so
um and and that’s very simple and and that might be like obvious uh when you
think about it but it’s a very different experience when you try the task yourself
um when when it comes to bigger companies that have already the knowledge about data labeling um the
definitely like the experiences is way different because they have
they have they know already the impact of of doing it of working on a data-centric approach and then all of a
sudden the discussion is is different it’s not about like taking uh step by step and baby steps on how to get there
but it’s more like okay how can we do it even better what type of data labeling tool out there can help me get like you
know all these uh thousands of uh brands for a
coffee pod that would be like just annotated in like less than 30 seconds so we’re talking more like as Charlene
mentioned about um aggressive timelines how to reduce the cost how to reduce the
timeline for delivery um and and it’s much different conversation but there are like different
different levels definitely depending on where you start yeah i think Janice Janus wanted to jump
Mindset Tools
in as well sure yep yeah first it’s a mindset uh we
talked about it as a i think first we mindset and culture so very important so people have the passion to build to do
it and second is having the right tools in place are there a lot of like already like open source or on the market
there’s a lot of tools out there that will facilitate the work a lot i think the principles and the work that’s
needed is not like super complicated it’s just about uh are we applying the right uh right methodology and the right
tools will really help with that so don’t having all those setup then people will
understand the value and also make it much easier to accomplish as well so i guess those are the first few steps to
begin with Adrian yeah i completely agree with what Audrey
Dogfooding
and jenny said uh i think that dogfooding uh starting with dogfooding is really good idea like building your own
data sets uh getting your hands dirty uh you know or being part of the you know
if there is already a pipeline you know joining the pipeline at least for quality checking you know uh
not for labeling uh i learned a lot by just quality checking labels you know
labeling parties I’ve actually organized labeling parties at my work to do that you know incentivizing with pizza and
stuff like that which was both for getting better data but also for getting people to look at the data and
understand a little bit more and that’s a huge driver also of creativity of insights into not just the domain but the whole
labeling process and things like that so a huge uh plus one for that maybe to complement one things i can add is that
that’s also why i fundamentally believe that there’s a danger there’s something called conway’s law
which is that an organization writes software that’s structured in modules like the organization’s teams right so
you have like in self-driving cars for instance you have a perception team that does the perception module your prediction team does the prediction
module you have a planning team that does a planning model right and so one of the problems is that there’s another law which
it says that the law of leaky abstractions all abstractions are leaky and i think there’s a kind of a danger
of having vertical silos like the data silo the ml silo which contributes to also maybe
the slow adoption of data centric and ml teams because they were they had a data team
that were dealing with data and they were dealing with the models right and so i think that
ml people need to be more involved in the data operations whether it is by having the organizational structure in
place to have mln data under the same roof or just having really good cross-functional collaboration i think
that’s really important now in terms of tooling so we’ve done
taking some steps for that the first steps we’ve taken for us was governance
because in safety critical scenarios applications of machine learning like self-driving cars which are not super
regulated you have to decide um how much you want uh how safe you
want to be right and and uh and so in car makers like that tend to be very safe so some of the things that were
important for us is ai safety and so it starts actually with simple things in the data space which is like
traceability so we’ve had an open source library called the data set governance policy or dgp and that’s been
instrumental to how we know like there’s no github of data right and at certain scale it’s really hard to do data
versioning tracing and stuff like that so it sounds simple because for code we have get it’s so easy right but for data
it’s very very hard especially because of the human in the loop uh many humans in many loops um and so i think that
having traceability having integration in different systems uh that’s also why we are one of the first customers of
weights and biases uh to have all this experimental management to have the code the data the experiments and all the
human decisions in between um as connected as possible um
and and so yeah so i think those kind of like data governance is a good thing to have in place getting
your hands dirty and building your own data sets definitely plus one continuing on this thread of tools uh in
Tools
the context of data centric ai um you know we hear a lot about different
tools technologies uh and the things that come to mind are uh you know data creation or data
curation you know synthetic data programmatic labeling weak supervision
active learning um I’m curious
do do you do we think are these foundational you know tools that are
you know required for data centric ai are they uh important things to have in
the toolbox but not necessarily foundational or are they shiny objects that are probably more of a distraction
I’m a huge proponent of self-supervised learning and simulation so i definitely don’t think they’re shining
objects i think like you know one one thing that people underestimate is accessibility of data
uh right it’s like for certain safety critical scenarios or privacy uh for
reasons or ethical considerations you know it’s not like you just have harvest the data from all your users
certain applications you can a lot of applications you can’t or you don’t want to um and so i think accessibility to
data is a big challenge and so either you have access to a lot of raw data um but you cannot you know label it or you
don’t want to label it for certain things um but it’s still part of the real world so it’s still there’s
something to learn from there because it’s still part of the world that your system is going to operate in so that’s where self-supervised learning is there
the main question there is is self-supervised learning absorbing or worse amplifying the biases in your data
and that’s why we’ve done some theoretical studies including an eye clear paper we presented recently to show that at least for imbalance which
is a very natural bias ubiquitous in data um uh actually
self-sufficient learning learns more robust features to that bias than supervised learning so that was a great
thing there’s another one which is simulation which is if you don’t have access to your own data you
need to create it and in robotics a big way is synthetic data generating
simulation especially again for if you want to learn to avoid to crash you shouldn’t be having crashing millions of
real cars you know to learn to avoid that from experience that doesn’t so overall it maps out to this very simple
like uh view of the world which is you have three types of variables you have the known nodes which is what you can label
right you have the known unknowns which is you know they exist you know uh like you
know like the you know like a dead animal on the side of the highway that’s a real example where people told me we need to make sure we the car behaves
correctly there it’s like well what am i gonna do is take a shotgun and shoot animals like corpses along the highway
that’s like i mean no you know that sounds insane um and so simulation is
there which you can very easily generate scenarios for that so that’s the known unknowns what you do it’s called programmatic data generation uh
programmable data right that that’s what simulation is really really good at the known unknowns but then you have the
heavy tail of the unknown unknowns right all the crazy things that can happen in the world they happen right and and you
have to have your model be either robust to that meaning being able to say well here you shouldn’t trust my prediction
right out of domain detection etc or said i need to have access to all the data and learn from all the data to be
able to recognize when it when it happens even if it’s just to say don’t trust me and that’s for me where self-supervised
learning comes into the picture so again we need everything at the table the problems that we’re dealing with is so
hard every tool is needed any additional takes on tools
Data augmentation
Charlene yeah i think some of these things can be more useful than others depending on the area so like one thing
that you run into in NLP is that data augmentation isn’t really a
thing uh because you can’t just substitute words in a document and
like for a synonym or an ostensible synonym and expect it to like mean the same thing or even like have the same
label afterwards uh and so you have to get a little bit more creative about how you end up um
getting new documents to label um so one thing that helps a lot is similarity search so if you project all of your
documents into embedding space you can actually search that space using nearest neighbor
search or something like that let’s say you’re trying to increase recall specifically you know you
take all your positive documents you project them and then you select more documents from your unlabeled set that
are close to those in embedding space so that’s one way that something like that can help active
learning also helps a lot when you’re still kind of initially exploring the space so one algorithm we implemented at
primer was something called corset which is a form of active learning that
prioritizes um covering the entire uh data distribution
um and there’s some very fancy math involved in this but there are like implementations that you can use um
and essentially it ensures that you have explored the diversity of the data even just in your first like 50 to 100
examples instead of like taking a random sample and just hoping that you know most of the things and and
the edge cases that you would want to see are in that random sample um and so you definitely have to get creative with
using some of these methods depending on uh like Adrian was saying you know what level of access do you have to the data
like how much other data do you have um and various other constraints
Top takeaways
got it got it uh so if the question is is it about people process mindset or
tools the answer is yes yes yes and yes uh we are coming up on the top of the
hour and so um let’s do a really quick round of uh you
know top one takeaway for our audience from your perspective and if you can
keep it to less than three words all the better Adrian
ooh um listen to our discussion with Sam about principle centric uh because i think
that data centric we don’t know how to design data sets and we need to know how to be able to teach machines and that
means including designing data sets and injecting what is not in the data set and we didn’t talk about it today but we
made a whole episode with Sam so people can check it out awesome Audrey
and that’s a very tough one
i think that yeah i know that data quality uh obviously we all know
now uh by now that it’s like that’s the new oil the new gold or however you wanna you wanna call it and um
and you need like a lot of different people in the room i hope and and we chatted about it the
last time also as uh with with you Sam is that um we need to get more people in
the room that are coming from different backgrounds because that’s going to help with a lot of different issues that comes with data centric approach as well
which is how do we go into like more ethical you know data how do we remove bias and and
so on and so having like a mix of of of different people will really help on the
top of definitely leveraging all the different technologies at the end uh to
come back to what we discussed like what Charlene and Adrian were talking about i believe that technologies are they’re
all great all all the ones that are there we are very lucky to be um
using them uh right now but depending on the use case depending on the industry depending on the field
uh not all of them should be used and there should be like this way to
unify the data labeling ecosystem and just pick the right tool when we need it on the top of the right people
awesome Janice yeah overall i would uh reiterate data is as
important as algorithm and user as sexy as algorithm there’s a lot going on in this space data is sexy yeah
yeah and uh we talk about a lot of things in this one hour actually there’s a lot of activities going on and also
they require a lot of deep skills as well so i would encourage everyone who is uh in the ai space in the like a
machine learning space and pay as much attention to data and also build all those knowledge on what it takes to
build a uh the data centric ai to have apply the right quality
uh how to do synthetic data there’s a lot going on so and there’s a lot to learn
yeah i totally agree um if i had to summarize mine in three words i would say get better tooling um
because uh a like right now there are so many companies coming out with really interesting ml ops tools that can help
you so much with this process of iterating on your data set and iterating on your model like you don’t have to
write your own scripts for all this stuff anymore like you can literally just buy a solution
so try checking out first of all better labeling tools because there are labeling tools out there that have kind
of QA built into the labeling process and so it’ll help your labelers do the right thing and avoid doing the wrong
thing which results in better quality data for you and then secondly uh data management i i
I’m a little bit biased but i say check out aquarium for data set management um we help you keep track of quality issues
you know team members can collaborate on different issues with the data set um you can explore your data really easily
with the embedding view there’s uh many many other features um so definitely uh you don’t have to do
all this alone uh get the right tools and uh it’ll help you solve your problem with a lot less tedium
awesome awesome all right well we are going to wrap up i want to start by thanking our panelists for
their insights and contribution to this session uh thanks team for pulling this
together uh the recording of today’s discussion will be available on YouTube immediately
so if you’re out there and you want to share it with your friends just send them to the YouTube URL
big thanks to everyone who tuned in and once again to our fantastic panelists
to stay up to date on the next one please be sure to visit twimlaya.com and sign up for our newsletter and you can
also follow us on twitter at twiML and of course be sure to uh sign up for
updates at twimlcon.com uh thanks so much everyone thanks everyone thanks Sam
thank you thank you bye
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data
Leave A Comment