Good afternoon everyone.
Carahsoft Technology would like to welcome you
to our Google Cloud and MLtwist webinar
navigating AI data complexity within
Sandia National Laboratories.
At this time, I would like to introduce our speakers
for today, David Smith, founder and CEO of MLtwist.
Andrew Cox, R&D system analysis at Sandia National Laboratories.
And Steven Boesel, customer engineer
for Google Public Sector.
Andrew, the floor is all yours.
Okay. thanks for y’all being here.
what we’re gonna talk about today is,
how Sandia National Labs, has been supporting, TSA,
the folks that screen you at the airport,
to develop machine learning algorithms
that can help better detect threats.
And, so we’re gonna get into that.
but first, I’d like
to tell you a little bit about Sandia National Labs.
we are, oh, excuse me.
Let me tell you about a little bit more about the problem
that we’re trying to solve first.
Thank you, David. so TSA,
of course wants to be able to detect when threats go
through,
or when someone attempts to have a, a threat go
through the checkpoint.
So, for example, if someone wants to bring on a gun
or a knife or something like that, you put your baggage
through an X-ray,
or you go through one of the body scanners,
and they have algorithms that will detect that stuff.
Of course, TSA has an interest in always
continuing to improve that.
And recent years, TSA has adopted an open architecture,
strategy, which means that, in addition to
fielding algorithms from the companies that,
provide the X-rays
and the body scanners, they want to be able to, build on
that and draw from the broader machine learning community,
and a broader set of market players
to add additional algorithms to maybe specialize
to do better and to sort of continuously improve.
And the idea is that TSA wants to be able to
get algorithms out there, a lot quicker
and make those algorithms better so that there, so
that passengers can get through much more quickly
and much more secure.
a big part of that, as many of you know, is if you’re
doing any sort of machine learning, you have
to have a really robust quick data pipeline.
And so what we’re gonna talk a little bit about today is
how we stood up, that pipeline for TSA,
but we did the sort of initial version
and lots of growing pains in that.
And we’re gonna talk about, how we, how we matured that.
certainly with the help of, of MLtwist, who is one
of the other partners, on this call.
Next, please, David. the first thing I,
before we get into the specifics,
I just wanna tell you a little bit about
Sandia National Labs.
We’re a, a national lab, one of the national,
national security labs that does r
and d for national security.
items includes nuclear weapons,
and then all sorts of other stuff,
obviously including x-ray algorithms, body scanner,
algorithms, things like that.
Our job is to help create those capabilities
that will allow the private sector to ultimately come in
and have, competitive solutions so
that TSA has a huge amount of choices, in algorithms.
That’s what, that’s what we’d like to do.
but Sandia National Laboratories is itself
not a private sector company.
We’re not after profit, and we, we enable the market
to come in and compete,
so we’re not trying to compete with them.
and just to note for everyone,
we have had a really good experience with MLtwist.
in fact, they’ve helped us out of, a couple of jams,
because they’ve been, so on top of things,
very agile company,
and their technology, really sort
of helped us iron out quite a few errors that,
that we’re experiencing.
But I want to note that we’re not saying
that MLtwist is better
or worse than all a lot
of the other partners that we’re dealing with.
we work with a lot of people,
and so we’re not, this is not a,
we’re not ranking them above anybody,
but we did have a great experience,
and so I’m, I’m pretty, pretty happy to say that.
Next, please, David. most
of you, well, not most of you,
but some of you probably know that,
the algorithm part is the most,
the most talked about part of a data pipeline.
Your ultimate product is to develop an algorithm.
In T SSA’s case, it’s
to develop an algorithm that can detect things.
But in order to get to that algorithm, you have
to collect data, you have to prepare that data,
you’ve gotta get that data to the right place.
You’ve got to, label that data.
That is to say, if you have taken a, a scan,
that you’re gonna use to train your algorithm, you have
to label it and say, Hey, there is a threat object in this
scan, and you gotta show where that threat object is.
So there’s a huge, pipeline, that leads up
to developing that algorithm.
And, and if you ignore building that pipeline,
and if you ignore making it robust
and comprehensive, you
ultimately won’t get a great algorithm.
So a huge amount of the work we’ve done is
focusing on that data pipeline.
Please, David, this is a very busy
chart, but the idea is to convey
how complex our data pipeline has been
and to, to indicate to you how errors
and risks propagated through that system.
So the data pipeline for building detection algorithms
for TSA starts with collecting data.
So we may, we may go to a, laboratory setting,
and we may pack bags full of, full of threat objects,
things that, that you wouldn’t want on a plane, things
that are prohibited by regulation.
and you develop a really detailed collection plan,
then you work with people to go
and, for example, run bags through that x-ray.
And if, for example, you’re dealing with, explosives,
that’s a very careful and controlled environment.
You have to have a lot of, a lot
of safety regulations in place,
and you have to really control how that all works out.
So that collects an execution is, is, really demanding.
Once you do that, you might get an X-ray
or a body scan, image, again,
collected in a lab setting,
not in a live setting, but in a lab setting.
And you would get that image, and you need to make sure
that, for example, if I’ve put a knife in a bag, I need
to annotate, I need to outline where that knife is.
Or if you have a knife on a, on a person, you need
to very closely identify which pixels contain that,
that threat object and which don’t.
But because ultimately that’s what your algorithm is going
to use,
or sorry, that’s what your machine learning process is going
to use, produce an algorithm that can detect those things.
And that labeling and annotation part
that is extraordinarily challenging.
And I’ll say right up front, that’s where we experience,
a huge amount of risk.
if you don’t have a, pipeline,
that includes a huge amount of quality assurance steps.
If you are not developing a very tight annotation protocol,
if you’re not watching out for errors all throughout
that process, you can end up with data
that’s, that’s no good.
And that affects your algorithm.
the next thing is, we, we need to get this data shipped
to sandias database.
you know, these, this data can’t just, in order
for this data to get out to the vendors, we need
to have it in a centralized place.
we then merge a bunch of metadata, for example,
when we do, scans with people coming
through a laboratory setting.
Again, never live, never live.
When we have people coming through scans,
doing a laboratory setting,
we might look at different body types.
a tall person, a short person,
a big person, a small person.
We need all of those, those body types.
And so we merge the metadata with those scans at,
at Sandia.
We then clean up all of that data,
and then we get it into our database so
that it is searchable.
Once we have that, then we get,
we’ve had at least a dozen,
or excuse me, half dozen vendors on
contract at any one time.
And over the course of the years, we’ve had
well over a dozen, maybe, maybe up to 15 vendors
that we’ve worked with who are developing these algorithms.
We’ve gotta get that data out to them right now, right now,
on one part of our, data pipeline, we’ve been forced
to use hard drives,
but we’re, we’ve got a situation where we can move to,
cloud in the future to make that a lot more efficient.
so we get that data out to the vendors.
Those vendors, are then developing their algorithms.
And then finally, what we’ll do is we’ll test those
algorithms to see how good they are,
and we’ll get those algorithms, quantified
for, for their accuracy.
And then, frequently
what happens is those same vendors will then say, Hey,
we’d like to make another collection request.
We, you know, we didn’t do well on such and such an item,
and we need to do better.
So we need more data. What I wanna convey about all
of this is when we first started, we thought
that the data pipeline would be just a collection process.
It would just be a matter of circling the,
the threat in a scan.
And, and then putting it into, you know,
a file share that was, that was naive.
the reality is that, we needed a very complex,
set of procedures to handle quality.
by quality, I mean, did we,
did we place a threat object in the right place,
in a bag?
and did we put the right items in that bag?
Because that’s something
that algorithms have maybe struggled with in the past.
we might have struggled with, annotation quality,
maybe, maybe you needed to annotate.
By annotate, I mean, draw 3D box
around a particular threat object.
Maybe one of the annotators wasn’t trained well,
and they needed to do it quick, so we needed to
put in quality assurance processes
so we could get that done correctly.
And then, we needed to put in, error checking
for how they named a file.
Maybe someone gets in a file
and then they, they’re corresponding partner file for the,
the ground truth annotation was named differently,
and we’ve gotta object for those errors.
So what we found is we started off with a conception
that we would have a simple process,
but in reality, reality, we had 75 or
or more different things that we had to, to worry about
that would create error in our pipeline.
And if we didn’t manage the error upfront,
those errors would, would, they would compound
and just accumulate over time so that by the time we got
to the, the data, we would have a, a mess on our hands.
so with that, please, David, next slide.
we ended up, working
with MLtwist on, body scanner scans.
these scans were collected in a
controlled laboratory setting.
I want to emphasize there’s no,
no collection in a field situation with,
regular passengers.
These are all people who were, part
of a collection protocol at a laboratory.
and we had really tight timelines.
And we also knew from our x-ray side of the house
how important quality was.
and this was also a set of data
that we had not really dealt with in depth before.
and we ended up working with MLtwist,
who had a really flexible enterprise system
where they could set up these individual pipelines for us
to handle different types of errors, different types
of approaches to annotation,
and they could easily adjust those same pipelines to handle
problems that would come up that we wouldn’t even know about
until our algorithm developers were working with them.
And in doing that, we were able to drive down our, all
of our errors in that data pipeline,
and we actually were able to speed
that pipeline up quite a bit as compared to some
of our other efforts such that, such that we dropped,
I think, we dropped our time from, you know,
something like several months to a few weeks
for getting some data, and we’re able to turn around
demonstration algorithms for some of our, for some for
TSA so that we could demonstrate these algorithms in a
laboratory setting in a very quick amount of time.
So we were able to get some quality and speed,
but that quality in that speed came
because we were able to reconfigure those pipelines
and learn quickly.
And so that, the piece I want to emphasize
to this audience is you can put together a data pipeline
and planning and forecasting about
how it should work is really important,
but you also need that agility to change on the fly
and change in a way that’s reliable, in order to get
that data pipeline working really well.
next, please. So a couple of things that,
that I wanna leave this audience with is, if you are going
to be in the business of developing a algorithm,
which is supposed to detect something somewhere, maybe it’s
for TSA, maybe it’s for something else, you need to think
of the entire data pipeline, the data, the processes,
the tools, especially the human interaction with that, all
of the communications across all of your different vendors.
You need to think of that as a dynamic system
with a bunch of feedback loops.
If you get it right, you’re gonna get a virtuous cycle.
That is, things get better and better and better.
And if you get it wrong, you can, you can really go south.
so think of it as entire system, not, don’t think of it
as, here’s the algorithm and then I’ve got data somewhere.
You need to think of it instead of,
I’ve got this complex data pipeline
and I need to manage that especially.
And if I can do that right,
then I can get this algorithm part right.
the other piece is that, over the long haul,
TSA has greatly benefited from owning this data
and driving, driving how that data pipeline works.
And the more control they had over their data,
the more control they had over driving security
in a productive direction.
So I really encourage people when they’re thinking about,
working with algorithms, they need
to think about owning the data pipeline.
Don’t just think about the algorithm part,
think about the data pipeline.
so real quick on some lessons learned,
I’ve talked about it before, but if you don’t manage
that data pipeline carefully, errors accumulate
through your system, and you can end up with a bad situation
with a bunch of data that’s not usable,
and you’ve wasted a lot of money, you need to configure
that data pipeline to be a bunch of discreet small steps.
And when you do that, that means you’re able to check
for errors and correct errors in all
of those little steps in a way
that’s really quick and really efficient.
And then you could put all those steps together
for your data pipeline.
And that was one of the benefits of working with MLtwist,
is that we could, we had all of those steps, in little,
little modules, and we reconfigured when we figured out
what we were doing wasn’t working on the collection side
or the annotation side, or whatever it may have been.
and then part of that is that, being able
to reconfigure that pipeline when you realize
that maybe you started off with an assumption
that doesn’t prove to be true, you need to be able
to re-engineer that pipeline
without breaking everything else.
And that was a huge benefit of working with an MLtwist.
and then I would say that the efficiencies
of having a really high quality data pipeline in terms
of having a really good, infrastructure
where you can get that data,
into a really good database,
into a really good cloud situation,
and having everything work quickly
and not having to worry about the infrastructure, not having
to worry about those underlying things, the,
the efficiencies for that are much bigger
than we originally realized.
For example, TSA, has not
yet enabled cloud for some sensitive information.
And when that has happened, that has meant that we can’t,
we can’t use the cloud to ship data back and forth.
And we had to use these hard drives.
And this was on the computed tomography
side, not the, not the side.
That MLtwist is helping us with
the inefficiency involved in that, of gathering up data,
loading on, you know, doing this manual process
of getting on a hard drive
and shipping it across the country, making sure it gets
to the right place, hugely inefficient, hugely costly,
takes a huge amount of time.
It’s just, it’s just not, not the best solution.
And so I would, I would strongly encourage everyone
to think about the infrastructure
of the data pipeline in addition to the processes.
If you have good infrastructure, you’re
gonna make everything work fast.
If you can make everything work fast,
you can figure out problems early,
and you can solve those problems early,
and you make your whole data pipeline work better.
and then finally, I would,
I would just encourage a lot of this, a lot
of the decision makers who may work for organizations
that are more operationally based
or they’re more policy based, you know,
this is typically government organizations.
They don’t always think of the data first.
They think of the operation first.
And that’s usually pretty appropriate.
That’s usually the right thing to, to do.
But I would encourage everyone to think about how do I,
how do I manage my data and control my data?
Because the more I do that, the better I do that,
the better I’m going to make my operations work.
And, and I think that’s something
that we’ve not always seen, and at least until recently,
and the more we’ve seen people pay attention to that data,
the better things have gotten for producing great,
great operational outcomes in ts a’s case it might be
better, threat detection algorithms.
and then finally, what I’ll leave you
with is if you can work with what,
if you are a data pipeline provider or an algorithm provider
or something like that, you can work with that organization
that you’re supporting, and you can really impress upon them
the importance of that data pipeline.
You can get a lot of really good things happening,
where you can get into a, a virtuous cycle.
For example, one of the things that we’re gonna work
with MLtwist on later this, this year,
we’re gonna collect great data.
We’re gonna get that data annotated well,
and we produce an algorithm.
That algorithm can be a solid algorithm and,
but maybe not as high performing as we want it yet,
but we can provide that algorithm back to MLtwist,
who then uses that to speed up their annotation process.
We get better quality, better speed, we then feed
that data back into our algorithm.
We get a better algorithm that we can then maybe push out
to the field or push into a demo.
But that same algorithm we give back to MLtwist, who then
uses it again to do a quicker, better job
of getting that data annotated.
So ultimately what I wanna say is, if you get
that data pipeline right, you can create this virtuous cycle
where you can get faster, better quality,
you can get both of those things.
So I’ll pause there and,
I believe we’re taking questions at the end.
And so I’ll hand this off, I believe to David.
Thanks, Andrew. That was, that was fantastic.
Thank you for the tee up.
I am Dave Smith.
It is, really good to meet you all.
I am CEO Co-founder of MLtwist.
I am also the driver, so they wouldn’t let me present
unless I actually drove the deck.
So if you see any driving problems, that’s me.
So I apologize in advance.
and we’re going to focus on one of the boxes
that Andrew talks about, this data labeling box.
We’re gonna talk a little bit about what’s in that box.
I did take a look at the registration list
before joining the webinar.
A lot of you all are computer,
or some sort of architecture expert.
So instead of going into the nuance on 3D data and,
and all the stuff that goes into,
into actually labeling it, I thought
what we would do is potentially, take a, a higher level.
So first off, wanna clarify exactly what Andrew said.
None of the information that you’re going to see is,
is SSI or, or has been deemed sensitive.
and from an overall agenda point of view,
I was thinking, okay, let’s talk a little bit about why
we’re making, this whole presentation over this.
And, and then overall, it comes down
to one concept, which is data quality.
So why, right? So the high level of that,
and then let’s take that box that Andrew had,
the data labeling box, and break it out a little bit.
What actually gets involved in an AI data pipeline?
Are AI data pipelines kind of the same
as like regular data pipeline?
So let’s talk a little bit about that.
And then finally, we’ll end on a 3D data preparation
approach, before handing over to Google.
So first off, AI has data as fuel.
The higher the quality, the data,
the better the AI performance.
So let’s talk about that for a second. Dr.
Andrew Eng as an AI leader, AI expert over at Stanford,
he, shared some insights saying that he was able
to build a model and get the performance of that model to,
be the same, whether
or not he used, sorry, whether he used, 30,
sorry, three times more okay.
Data than, he had good data to get the same performance.
So basically, he had a model.
He said, okay, if I give you, x amount of good data,
you’re going to perform at this level,
and then if I give you three x amount of okay data,
you’re gonna perform at that same level.
So the bottom line is you end up needing three times more,
okay, data as good data,
we’ll talk a little bit about what that actually means.
But, the summary is good quality data saves you time
costs from the TSA point of view, prevents false alarms,
allows for accurate decision making.
And I also wanna press the idea that, okay,
if you’re looking at one megabyte versus three
megabytes, not that big of a deal.
but as you grow, you are looking at gigabytes,
you’re looking at terabytes, et cetera, that data starts
to become more and more, tougher two handle.
And if you need three times the amount of data
that has a knock on effect that happens down the road,
you also have the concept of bad quality data.
So, bad quality data,
we don’t really talk about bad quality data,
but the idea is, what happens if you introduce data
that is noisy to the model?
And there’s research out there that goes, that shows
for every 1% you introduce,
you have a 1.8% drop in the model accuracy.
So you could be in a situation where you are spending money,
spending time getting data ready, you introduce it
to the model, and the model’s then performing worse,
than if you had done nothing, right?
So these are things that are real.
These are things that happen.
These are things that are tough to detect,
and sometimes you don’t even realize you have an AI data
pipeline problem until all of a sudden your model,
starts performing worse.
There’s all these things that are involved,
and this is why high quality data is important.
So let’s talk a little bit about
what goes into high quality data creation.
You start out with a scan.
So Andrew mentioned this is a test environment,
nothing happening in the field.
You’ve got a a, a, a scan data,
and a lot of people think, okay, I just put
that data into my AI and I’m good to go.
So that’s the ETL take data.
Put it to ai, AI learns continue.
But the reality is, there’s this concept
of getting data ready for ai.
We’re gonna call it data preparation.
And if you break out data preparation, even at a high level,
Andrew mentioned a process of send these something steps.
this is our version of, of,
again, send these something steps.
But there’s a lot that goes on when it comes
to getting data ready for ai.
The idea is not to go through every one of these boxes.
The idea is to talk about the purple boxes,
which MLtwist is very focused on.
These are the things that are involved
that you can apply technology to.
and the goal is you start out with a scan,
and eventually you end up with this concept of,
and for now, we’re gonna call it a js ON file,
but you have a file
that describes what’s going on in your scan,
what’s going on in your data.
It could be a js ON file, it could be another file,
but you are creating a file that describes
what is happening in the data you’re interested in.
For the purposes of this presentation, it’s a 3D scan,
but it can be a 2D image, it can be a text file,
it can be all sorts of other, pieces.
So let’s talk about that little arrow that’s in between scan
and JSON.
That arrow is a combination of a lot of different concepts.
So Andrew mentioned the idea that you’re able
to take your AI
and potentially help get that data ready,
get the data labeled, or annotated for your model.
you also have tools.
So 3D images require different tools
and different annotation capabilities.
they’re not all built the same.
We’ll talk about that in a second.
We have the concept of people.
People are involved,
they bring their own expertise, their own judgment.
They’re able to make those call in those gray areas.
And a, they are able to effectively enhance
and improve, the quality of the data.
You have automation, so a lot
of this is powered by automation.
What can you do to make the lives of those people easier,
to make them better at, to give them the right tools so
that they can do their job in a way that, that
that is good for them and good for the data?
And then finally, quality control.
Quality control sometimes can be,
two people working on the same data set.
And then you compare them.
ideally you’re gonna have some semi-automation in there.
You’re going to have concepts of, oh, are we detecting,
if we’re looking at a scan, two right arms,
did somebody mess up
and accidentally grab a second right arm label?
This stuff happens. Did they switch the right and left arms?
there are a lot of things
that you can automate on the quality control side
to empower the people to go back
and find those mistakes on top
of improving the quality that they’re working on.
And AI data tools are something of,
of an interesting ecosystem.
So if you haven’t heard of Matt Turk,
he’s an AI thought leader, I highly recommend you,
look him up, go to, his, landscape.
I grabbed, some of the different boxes he had
for some of the different modalities of data.
And this is, I, in our estimation, roughly 10% coverage
of all the companies that are out there.
Different data modalities have different data tools in ai,
and you are going to identify tools,
and there’s not a single MLtwist.
So, so most MLtwist customers have their data flowing
to different tools, depending on the modality, depending on
what needs to happen, depending on the requirements of the,
of the people or the environment.
Andrew mentioned there are environments
that cannot be connected.
There are all these concepts that are out there.
So what you end up doing is you end up,
needing to manage all of this.
And on top of that, you have to also think about,
auditability about versioning.
So the, the, the, a lot
of lawmakers in America have talked about
what happens when we want to apply some sort of regulations.
So we all know about CCPA, California,
the Privacy Protection Act.
We all know about, concepts like hipaa,
different environments.
There’s regulation on privacy.
There’s going to likely be regulation on AI and AI data.
and fact, Europe just passed their own version a few
weeks ago, called the AI Act.
So on top of all of this, you need to account
for where your data has been.
And you need to account for things like,
were the people working on the data paid a good,
paid a, paid a correct wage?
Were they subjected to inappropriate content
and not given the, the, the support
or the foundation they need to be able to work on that.
Were they put in front of content
that they should not have been put into, in front of?
is there a way to audit all these questions
and data biases coming up more and more?
Is there a way to understand the bias of,
of potentially the people working on the data?
In addition to the bias of the data itself?
These are all things that are going to need,
auditability involved.
So with that, we’ll end on MLtwist,
getting data ready for ai.
you’ve got a solution, a hybrid automation of human,
automation, human intelligence.
you take data, push it through
to semi-automated labeling, push that through to,
different tools out there, different, teams,
different automation for quality control and,
and other types of review.
And then you push that through to, the, the company
that’s building the artificial intelligence,
and they’re able to, to continue to work on that.
shout out to the TSA. They, have the DICO standards.
So for their, security 3D scans,
they have implemented open architecture, open architectures,
effectively opening up the ecosystem
to companies like MLtwist.
We wouldn’t be here if it wasn’t for,
the open architecture initiative.
there’s, we wouldn’t be able to,
to work on this type of data.
we have dcos.
dcos is a file format, 3D that’s built off of dicom,
and ads its own security, twist to it, if you will.
So, there’s a lot of ability to leverage
what DICOM has on working on 3D images in di os.
And I wanna call out that a lot
of this is about not reinventing the wheel,
where you can avoid reinventing the wheel, avoid it.
there’s a lot to do in ai.
I also want to call out that with automation,
you get improved performance.
So the green bar represents, the time it takes
to ship data back.
The blue bar is, is the time it’s taking
for, for people.
The red bar is the time it’s taking for actually loading,
transforming,
doing all those pieces in the wheel that you saw.
ideally that red bar will go further
and further to zero over time.
the blue bar will tend to have, some sort of,
some sort of, limitation on how fast it can go.
and the better you are at that red bar,
the better you can drive down that blue bar, right?
So I’ll end with this. getting data ready for Geicos.
You have, auto processing, auto batching,
sending, data through.
You have human in the loop.
all of the things that those people do.
it is a tough job.
It is a job that, that you really want
to set those people up for success as best as you can.
you want to give them the right tools,
the right reporting, the right automation.
you want to, as much as you can limit, the ability
to, to, to, to make errors
or have those errors impact things downstream or upstream.
you then have automated
or semi-automated quality control.
and then you have the ability to transform all that data
and push it through to a way that company,
or sorry, entities like Sandia, like their partners,
like others, are able to then develop AI models and,
and continue to improve them.
so with that, I want to call out that a lot
of MLtwist is, powered by Google Cloud.
And I’d like to hand over to Steve
to talk a little bit about, Google Cloud.
And Steve, I will be your, your driver for the next session.
So just, just tell me when you need
to switch the slides, over to you.
Wonderful. Thank you, David.
what a, it’s a fascinating project.
I love working on, on these types of technology.
So if you can go to the next slide.
I just wanna spend a couple of minutes talking about Google
Cloud and why trust your data with,
or operating on Google Cloud.
So the first thing I just wanted to say quickly is
that Google is a data company through and through,
and we are fully,
supporting our public sector customers, in,
in their mission set.
So we have a dedicated, company that just works
with public sector,
and that’s our, that, operates,
or supports our customers, on the cloud.
So Google Cloud operates a FedRAMP moderate worldwide
planetary infrastructure.
So for, missions
and customers that have distributed applications
or data needs
or processing needs,
we can operate in a FedRAMP moderate across the planet.
and we also operate a FedRAMP high cloud,
in the conus, environment.
and we do also have the ability
to address disconnected mission, applications
where there is no cloud available.
So this is just a real quick map,
of all of our data centers.
We operate in 40 regions, with 121 zones,
and we continue to build this out.
Next slide. So, Google was built from the ground up
to be a multi-tenant environment to support, again,
massive applications.
You can think of, you know, the, the old G suite now,
workspace, Google Maps, you,
YouTube, you name it.
So it has been built, as a,
defense in depth type of architecture.
Each domain is discreetly, separated out
with multiple controls.
So each, region ha
or each section of our defense in depth has
complimentary controls.
So even if one fails, others are there to pick it up.
And we’re gonna drill down a little bit into
how we differentiate a little bit as well.
Next slide. So this is the main, the main point,
two main points I want to get, get across,
to y’all on this call, which is,
you own the data, you own the mission data.
Google has, no control over it.
You define who can get access to it.
at no time does Google have any visibility into that.
So whether it’s in enforcing data, residency, keeping it,
you know, wherever you’re operating based on the
regulations, you are being subjected to, if it,
you know, you have to be conus if you have
to be in very specific data centers, you set that.
And it is guaranteed to operate in that, in that,
regulation.
You can also run assured workloads, which effectively think
of it as guardrails.
As you’re going down the highway, you know, the tendency
to maybe drift, workloads, make sure that nothing,
changes or gets you out of the, regulation compliance.
So think of those as the guardrails.
Access transparency is everything that happens
to your data is logged.
And in many cases, for, sensitive data, you have
to have multi-party control over, or multi-party approval.
So if Google needs to access the data,
you have to approve it.
So if we’re trying to, you know, work on a support ticket,
for example, we can’t o operate
or look at that data without your express approval.
It is logged, recorded, auditable, all the way through,
the guarantee that, everything,
everything is encrypted on Google by default.
Nothing is stored ever in the clear.
you have complete key control should you want it.
Google also manages cryptographic,
key management, as well.
But at, at no time do we store anything.
So that’s data in flight, data rest, data, you know,
processing through everything is encrypted at all times.
And we even take it one step further
with confidential computing.
So as, as data moves through the network
and it hits a CPU, normally, that, that,
those instruction sets are decrypted operated on the CPU
and then encrypted on the way out,
confidential computing allows it
to remain encrypted even while processing.
So you have complete control end-to-end
encryption all the way through.
Next slide. So the one thing,
that’s pretty unique about Google is we,
we rely on no third party software.
so any of the third party vulnerabilities,
third party software vulnerabilities that have,
impacted a lot of our, our customers.
We, purpose build our servers.
They’re all built by us for us.
they all run this titan, chip set.
it’s also in our pixel phones,
but as, essentially, it’s a chain of trust for the hardware,
for attestation.
So we build everything ourselves.
It doesn’t come with video cards, it doesn’t come
with any unnecessary peripherals.
So we wanna limit what we have to protect.
So it’s very purpose built,
and that extends to our storage.
when we are operating cloud native storage,
it’s all purpose built, not a third party, involved.
You can use third party, vendors if desired.
So again, everything is built to eliminate that risk
of supply chain vulnerability.
Next slide. And then on top of that, we, we layer in,
this zero trust concept, which,
we pioneered well over a decade ago.
Nothing inside of Google is trusted by default.
So every, every user, device, machine, service
and code all has to have an identity
and a cryptographic signature, to be able to operate.
So every person has an identity.
The machines, even the code,
you can turn on binary authorization to only allow code
that’s been cryptographically signed.
and then, data protection as well.
So we trust nothing inside, inside the network
and protect you at all times.
Next slide. All right, so that’s like, why
trust Google Cloud,
and then there’s the, what can I do with it?
Okay, so I, I trust Google Cloud, what can I do with it?
So with MLtwist, you know, pipelining getting
that data in is foundational.
and we’ve spent a lot of time talking about that.
Once we get in that data across the bottom,
you see pipelines, we start to build up, on top
of this, of this data.
So, David mentioned auditability.
We also call it explainability, being able to model
to monitor the model performance over time, be able
to operate ML ops, which we, we define as essentially
that continuous virtuous cycle that David mentioned, going
through MA, making sure that it’s getting better,
that you’re keeping track of all of the, weights
and the features, and why did something
score the way it’s scored?
Keeping in all that over time as an ML operation,
ML ops type of operation.
So then I’m gonna switch to the top,
section of this for a second.
So, Google believes in using
whatever tool is the most appropriate for the job.
So if that exists,
and another, another model, another,
location, that’s fine.
We operate first party, tools,
as meaning Google design tools, as well as third party.
So whether it be llama
or, philanthropic, you name it, there’s a,
an entire model garden.
We also offer these prebuilt, pre-trained models.
So all you have to do is provide the data from,
someone like MLtwist,
and we can use these prebuilt, pre-trained models,
right out of the box to get started.
You can also do some fine tuning.
So whether it be vision, you know, object detection,
translating texts, annotating video,
speech to text, all of that, type of, work
that you might want to do, as well as your, your, pride
and true prediction
and forecasting operations that, that you need
to do in AI as well.
Will something happen given a set of circumstances?
So all of these are prebuilt out of the box, easy
to use and get started.
You don’t have to be a ML developer.
You can start getting immediate business value.
and those will run in either kind of two ways.
You can run that in Vertex AI Studio,
which is on Google Cloud,
or you can use notebooks if you’re starting to get,
you know, into a data science, type environment.
And of course, all this runs on top of Google’s,
IAS infrastructure.
So all the cloud storage data, warehousing databases,
across the board.
So it’s all integrated, and all managed service.
So you don’t have to spin up data warehouse, you don’t have
to spin up, database and manage those things.
These are all just services.
You just point to ’em
and you can, easily operate, in the cloud.
We try to make it very easy.
And with that, I’m gonna say thank you very much
for your time and attention.
I appreciate everyone being here.
Alrighty. We can go ahead and move into q and a.
So if anyone has any questions,
please feel free to put them in the chat.
We have our first question here.
If you are able to take advantage
of cloud computing resources
to scale up your processing CAPAs capacity,
what are the benefits and drawbacks to using ELT versus ETL
to prep the data in your lake slash warehouse?
So when you
think of, when you think of,
ETL, right?
So it’s, it’s, it like E-L-T-E-T-L,
I think these concepts are, they’re, they’re important.
I think they’re probably better suited for
what happens in an environment when you’re trying to think
of, like, I have a lot of data to move.
when you introduce human in the loop, it kind
of puts things a little bit off the rails.
So, I’m not quite sure what the question was getting at,
with the concept of pushing it to the cloud environment.
But the, effectively, what I would say is a lot
of people today are making this stuff run on their
laptops, right?
Like, a lot of people wrote up a script
and like did the thing and then put in a folder
and then boot it up some SaaS
and upload it to a wizard, like create a project
and like, like that.
That’s kind of the reality we see today for a lot of these,
ETL concepts.
and pushing it to the cloud, getting it
to work in the cloud is important.
Whether it’s E-L-T-E-T-L, the, the, the question then kind
of becomes how do you adapt to human in the loop?
we would, we stick with the ETL version,
so you still need to kind of like mess
with the data before you load it.
but I hope, like at a high level that,
that helps answer that question.
Our next question here is for Andrew.
what event triggered you to use an external solution
to build your data pipe pipelines?
Yeah. I would say a couple of things.
one is we’re research laboratory.
So, so it’s just the, the type of personnel we have.
It’s, it’s not cost effective for us to do that.
we would tie up a lot of resources that are meant for,
other, like BA basically meant for the right end of the,
the machine learning or the data pipeline.
We would end up tying up a lot of resources to,
in data preparation.
And, and that’s just not cost effective for us to do.
so that was one reason that we went external.
the other reason we went external is
after a few years of doing this with computed tomography
or 3D x-ray, we were doing that with the X-Ray side.
We realized it was so complex
and that if we were, if we were gonna really manage this
to the quality that we wanted to, to manage for a IT data,
we’d have to do a huge number of, of steps.
We’d have to recreate some stuff.
And it was better for us to go with a,
with an enterprise solution
where someone already had the ability
to set up those pipelines really easily.
They already had the infrastructure,
and that, that paid off for us.
So, so we could have recreated, you know,
a data pipeline that had that flexibility in theory,
but it was, it was just ludicrous for us to do that given,
given our staffing, configuration.
So cost effectiveness.
And then also someone already had a huge amount of agility
and flexibility built into these configurable pipelines.
And so that was really attractive to us.
We have another question. How do you evaluate AI
performance and when you need more or better da better data?
I can take that.
so if you think of the pipeline,
at least I, I, I do think this is generalizable,
but at least with TSA, you start off with a collection.
Let’s say you want to detect, knives
that people might wanna smuggle on their
bodies to go to an airplane.
you collect a set of,
you collect a set of scans, on those knives.
And maybe there’s a certain, let’s say you collect,
I don’t know, 20 scans per per region of the body.
maybe I’ve got one that’s hidden here, one
that’s hidden in the center of my chest, one
that’s on my back, et cetera.
And you develop your algorithm
and maybe your algorithm to be effective,
at the level that TSA requires, you need
to collect double that number.
in another part of the body.
Let’s just say you have to collect double that number,
that’s my left shoulder.
You have to collect double that number on the left shoulder.
Well, you’re not gonna know that until that AI developer,
that machine, that algorithm
developer develops their algorithm.
And then what we do is we have a really detailed testing
evaluation system where we go in and we test
and say, okay, this algorithm works,
it finds knives on all parts of the body, except
that left shoulder, by the way, I wanna be clear
that I’m being purely hypothetical.
and then what we would do is we’d say, you know what?
We need to go back and,
and collect more data on that particular region.
’cause the more data that algorithm has, the better off.
Now there’s a step that I sort of skipped in there,
in this specific example, one of the things
that MLtwist helped us to do is not just say,
not just annotate where that knife was in the,
in the example, but they also helped us create some
metadata for all those scans.
They helped us to annotate what regions
of the body things we’re in.
So if you have a process that’s really fast
and a really, you can add metadata at not
that much more cost, which then in turn allows us
to diagnose what the problems are with algorithms,
with a lot more fidelity, a lot more precision.
And, and there’s been multiple situations
where we’ve worked with algorithm developers where
because we had all of that information, we’re able
to pinpoint something and they were able
to correct it within a couple of days.
Hopefully that answers the question.
Andrew, random question for me, like, when you’re,
when you’re looking at this, you’re,
you’ve been talking a lot about this from the working on the
TSA side, and I know you mentioned originally Sanz has like
a, a fairly broad, mission across things like nuclear.
What, how does this, the,
have you seen this concept like introduce itself in other
parks that are non TSA related?
like how, how do people think about this from,
from those points of view?
Like is it like to,
because what’s interesting with TSA is like the,
the file scans, 3D scans are pretty bulky, right?
So have you seen this also happen where it’s like a flood
of data, but really small levels
or like just in general, any, any insights you have on,
what, how do, how does this get approached kind of from,
from Sandias point of view, when they’re building AI
or when they’re working under building ai,
and like, when does it kind of hit that point of, okay,
now we gotta take this to the next level?
Oh boy. there’s a really,
really broad set of experiences.
I, I will say, I’ll say in my sort of
view, when I look at some of the other projects I’m involved
with and some of the views around me, I’ll say thinking
of the data pipeline is, in very,
you get a huge amount of data
that is on an not necessarily like a large size of data,
like the CT scans are really large themselves,
but maybe you only have a couple
hundred or something, I don’t know.
But when you have a large amount of data, then yes.
And what you see in Sandia is they do actually end up
recreating their own data pipelines.
Now, a lot of that data is, from other, I won’t even say
what the other, other projects are,
but a lot of it is highly classified.
And so you can’t necessarily go external.
But what you end up doing is they recreate all these,
all these pipelines.
They end up, they ingest the data
and then they have someone, they have people
working on the annotation.
And the ones that do it, well do treat it as a pipeline
where they take their initial algorithm
that may have discovered something,
or that may be able to detect something,
and then they put that into their pipeline,
and then they use that to sort of create that virtuous cycle
where they’re, where they’re, ground truthing things
and a lot faster and a lot with a lot more higher quality.
However, in a lot of situations,
and there’s other, there’s other projects I’m on,
you end up, you end up in a situation
where people don’t plan for that data pipeline ahead
of time, and you run into these massive problems of,
of trying to create quality data.
and, I’ll,
we’re, we’re working on another X-ray,
another X-ray project separate from the TSA work
where we’re trying to discover smuggled,
wildlife products, in, in, in other countries.
And the, the data collection process is,
is extraordinarily expensive.
And, and that’s, I don’t know, I think it’s probably,
probably 50% of our cost in trying
to develop these algorithms.
So, my hope is
that people would figure out thinking about things
as a data pipeline and figuring out ways where
external vendors such as yourselves, could go ahead.
They would already have the clearances,
like you all are going through this process
of getting the clearances and so forth,
and some of your people have gotten some of that stuff.
That would be a huge help to,
to a huge amount of projects in Sandia.
Some of them will never go outside Sandia and shouldn’t,
but some, a lot of them could.
And, I would just say that we will rediscover the need
for having a really efficient data pipeline over and over
and over, and I hope that that ends someday,
and that they just, you know, just go
with an existing solution.
Thanks, Andrew. Sorry, Brittany, I, I,
I I got opportunistic there.
No, all good. We have one more question here.
Feel free to continue dropping
those questions in the chat though.
What differentiates MLtwist from an open source solution?
that’s a good one.
So, so kind of what Andrew was saying, Hey,
you have all this, all, all these different pieces
that are out there, in theory you can kind of bailed it,
yourself.
In fact, I would say that’s probably the majority of,
of a lot of companies, are, are doing, over time,
as Andrew mentioned, the data collection,
like the volumes, like there’s these other things and,
and you also have these concepts of,
regulations starting to creep in.
So open source is, is typically a way to take,
some code that was out there
and use it in your AI data preparation process.
but the, the thing to be conscious of is,
somebody still will be maintaining it, right?
Somebody still will be putting that all together.
Somebody still will be having to make it work in the cloud
and put in the right environment and
and then integrate it with the, the whatever human
or validation components
and semi-automate the quality control.
So it’s not to, it, it’s not to take away from the idea
that yes, like it’s a different path.
You kind of have software as a service,
you have pure services, like, I’m gonna get a consultant,
and, and consult.
and then you’ve got the open source path.
So MLtwist, styles itself more, more on the SaaS,
type of, type of play of being able to,
put all these pieces together, have it flow and and,
and let twist kind of maintain all of those, all of those,
components and all of that architecture.
and that said, there is open source where you can kind
of like, get somebody to, to put
that all into their own cloud environment and, and,
and build it and maintain it.
And it will be u very, very unique to them,
blessing and a curse.
It’s unique. It’s also bespoke,
that person moves on all those things.
But I, I hope that helps Brittany. It’s a good question.
It’s definitely, definitely that’s having,
building the bespoke things,
at least in Sandia on various projects, it’s,
it is sometimes necessary, but it’s hugely expensive.
It’s, most of the time it’s not, it’s not the right choice.
It’s better to go for an existing solution.
Yeah. And, and we do the same ourselves.
Like we didn’t go out
and build a bunch of labeling tools that existed, right?
That’s not our shtick.
we didn’t go out and hire thousands of people, right?
So, I, I kind of want to, reem echo that concept of,
there there are things that you, if that is your core,
god bless, go do, if that’s not necessarily your core,
maybe how to think about it.
And we have one more question here.
How does Google Cloud support companies that want
to operate in FedRAMP or DOD environments?
So, I think I mentioned,
we do maintain FedRAMP moderate, across the planet,
as well as FedRAMP, high and conus, operations.
So we have an entire public sector group that is waiting
to engage with you,
and support our, public sector customers.
So that is our mission, that,
and we wanna help you with your mission.
Thank you. Do we have any other questions?
I did miss the DOD one since I cover primarily
federal, but we do also operate in the aisle, two for,
five space as well.
So I was, I apologize for leaving that,
that, that customer base.
Thank you, David. I wanna thank all
of our participants as well.
As David, Andrew,
and Steven for being with us today,
we hope our webinar has been helpful for you
and your organization.
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data