Open Data Science Conference West – 2023
MLtwist founder and CEO David Smith recently spoke at the 2023 edition of Open Data Science Conference West in San Francisco. David’s talk centered on generative data AI pipelines. Consult the abstract to get an overview, or watch the talk.
00:03
Testing, testing, yeah, there we go. Hi everyone, thank you so much for coming. I’ll talk a little bit about generative AI data pipelines and the future of data orchestration. Little bit on what we’re gonna cover in 10 minutes, probably less given our…
00:21
I’m feeling right now with too much coffee. Problem and solution overview. So we’ll talk a little bit about AI data. LLM refreshers, which I’ve noticed is a little bit out of date, so I will correct myself in the fly. Data pipelines, and then finally generative data pipelines. Little bit about me, so as was mentioned, I spent some time, I joined Google through an acquisition of a company.
00:47
at the time on data for dashboards. That eventually turned into data for models, and then eventually data for AI when I was at Oracle. I guess I’ve always kind of seen myself as sort of a delivery person for data scientists. And the idea behind AI data is that the pipelines that maintain them that generate the AI data,
01:12
maintain, they’re costly, and we all know that this is going to become easier. So there’s a lot of companies hard at work at making sure that this stuff becomes a lot easier to do, and companies are now starting to look into the opportunity to leverage LLMs to make this happen. If you look at AIin study recently, they’re saying that effectively the biggest challenge for companies getting involved in AI, whether
01:40
leveraging it or building it themselves is the data. Cleaning the data, transforming the data, all the fun stuff that sometimes gets turned as data janitorial work.
01:52
And there’s a good reason for this. So this is a little bit of an eye chart, but the idea is that, and by the way, this is gonna be a build, like I did it on PowerPoint, it was going to go great, and then they said PDFs only, I was like, oh, okay, so now you get the eye chart. What you’ve got is the yellow chevrons are sort of the tools involved in making AI data happen. And then.
02:14
The glue chevrons is the glue that transports data from tool to tool to tool. You have to buy the data, you have to store the data, you have to apply AI and pre-trained models to the data to effectively do what’s called the zero-shots, to take a first pass at that data. Then you need to have humans involved, but you need to display that data in a way that the humans can actually work on it. And then finally you need to push that data into an MLOps platform so that you can actually build your model.
02:44
There’s a lot of ways to approach this concept. And there’s recently been a ton of investment in open source technology in order to make that happen. And L-Twist is also working on this challenge. I got told that this was not a sales pitch, so I’m not gonna sell the company. Instead, this is more about just letting you know, hey, the blue stuff’s important, right? And this is the slide that I kinda wish I could go back this morning and change up, because.
03:11
effectively you have three main options. You’ve got LLMs provided through cloud, LLMs that are proprietary, and then LLMs that are open source. When it comes to the intuitiveness, what you would expect is that for open source LLMs, you can kind of do what you want. You can tune them, you can train them, you can obviously, expensive, time consuming, and then the higher up the stack you go, effectively the less you can mess with it. What I realized is that there are,
03:42
So those should see, instead of saying cannot be tuned, it should say coming soon. So there’s a lot of opportunities, or sorry, there’s a lot of plans to make these LLMs more accessible to their customers, but the concept still remains that open source LLMs, if you want to build your own, very time consuming, very difficult, sorry, difficult, they’re tougher in general versus something out of the can, but at the same time, some of the stuff that comes out of the can can be a little tougher to make your…
04:13
And then we get to AI data. So AI data differs from a couple of key aspects to other ETL. Most ETL, it’s like, hey, I’m gonna get these zeros and ones from one side of the database to a different database, and I need to make sure that they’re perfect. When you look at AI data, oftentimes what you’ll have is you’ll have an unstructured file, and at the end, you need a JSON. And that JSON needs to accurately describe what’s going on in your unstructured file.
04:40
for your AI model to learn from. And what you often have is you have some sort of data quality control. This quality control ends up being done by different aspects, but a key piece of quality control in almost all AI, including our friends over at ChaiQPT, is human in the loop. Call it RLHF, call it what you want. There is a human component that’s involved.
05:04
And for standard ETL that involves just pushing data across, that is a question that you need to address. The other one is around directed acyclic graphs, right? So what you have is you have this concept of not needing to go back. Basically, the whole concept is to go in one direction. Well, turns out with human loop and feedback loops, you actually do have this concept of going back.
05:30
So this is something else that ETL needs to take into account as it starts to address this concept of getting data ready for AI. And again, this is another one of my build slides. Ta-da! What you have is you have data. You need to do stuff to that data, batch it out, chunk it up.
05:50
to a zero-shot model, which gets that data ready for human validation in some way, shape, or form. And then you push that into a tool where the humans can go ahead and do their thing and make sure that that data is going to be of high quality and not sink your model. If that works, then you get to open the floodgates, let everything release. And if that doesn’t work, then you need to go back, right? So think of it kind of like a manufacturing process. You know the Costco can of beans you buy, and you get that.
06:20
Okay, if you bought batch number 34 or batch four.
06:26
Data with human in the loop can actually behave a little bit of the same way, where you end up batching the data, and there are some batches that are good and some batches that are not good. And they all need to be managed differently, yet the same, and that’s kind of where you have this ETL concepts coming in. So at ML Twist, one of the first things we thought was, okay, great, let’s just go ahead and tell LLM what we’re trying to do.
06:53
to get a zero-shot thumb to a data set. We’re going to try to push it.
07:04
That is kind of one of the ways that some of the other companies are attacking this problem. It’s just expecting an LLM to know how to do this stuff. What we’re finding is that it is possible to do, but you have to go a different direction. So we’ll talk a little bit about that. But in terms of the overall goal, what you’re hoping for is you’re hoping for the ability to describe to a model, hey, I’ve got this data set. I need this to happen to it. We’ll talk a little bit about how MLTwist…
07:33
I’m tired of all that.
07:36
This is kind of the dream. In terms of an actual pipeline, when you look at it, it looks something like this. I extract data, I apply all these different things to it, I push it to the next stop, something else happens, I extract it again, all these different things need to happen, then I’ll push it to the next stop again. Continue to do that.
07:57
Well, LLMs are going to be involved by effectively building these infrastructure or using the infrastructure or foundation that’s built. This concept of connectors and components. You’ll be able to swap out connectors, swap out components, and then you’ll also have an LLM take a zero shot or a first attempt at an orchestration approach.
08:15
So when you build your components, you have two options. Right now we have a lot of great open source libraries, but they’re not quite components, right? They’re functions. They’re different types of calls that LLM has to figure out when and where to use. What you’re going to do is you’re going to start building components, things that are purpose built for different types of unstructured data.
08:37
and you’re going to be able to chain those together in order to get your data into the next stop. This is what that looks like. So now you have API integrations. You have APIs to pull and push into that ecosystem because remember, that ecosystem that was built was actually one that has a lot of money invested and a lot of those tools end up doing some really amazing things for the types of data that they have.
09:01
The question becomes, can an LLM determine which tools to use, what data types of data transformations need to happen for those data formats, and then after that, what type of data gets to be pulled out of that to be pushed into your NLOP solution.
09:19
And finally, you have the actual zero shot. So the LLM, again, whether it’s a CHAT GPT version or a POM2 version or some version that you train yourself is going to be able to start to put that all together.
09:33
The side effect of tackling this far is that you have clean composable pipelines that can be used for general purposes. And also this ended up getting the grant from the Department of Energy, which involved us integrating into several different databases that were materials types of databases and attempting to extract information out of unstructured data to push into a structured database. The ability to do this is massive. It’s not just materials, it’s not just…
10:02
these other pieces, like you would be surprised at what’s in some of these databases. In some cases we’re getting photos of somebody’s bicycle trip. How is that related to laser beams being pointed at teeth, for example? There’s a lot of different questions that come up, but at its core it’s always the same thing. It’s can we take unstructured data, get what is meaningful out of this, and push it into something else and potentially have human in the loop to validate it?
10:32
And in a twist, we are…
10:38
So the DOE thing was with Berkeley National Labs.
10:43
have been really kind to let us publicly talk about them. Captain and I cited us as a top AI data tool. Stanford’s AI department’s work allowed us to cite them as a customer. The point I’m trying to make is that as you go down this path, it is really, really rewarding if you can get it to a stage where you make it easy, not just for the data scientists, but for the experts who are working on the data themselves. Because they’re actually the people who are really training
11:13
and helping the AI do its thing. And if you can take the plumbing away from them, that is absolutely golden, not just for yourself, but for the world and for these people specifically. That is my talk. Thank you so much for being a wonderful audience. I really should have asked you guys to prepare questions at the beginning, so I apologize, but I’m up here for another minute if you’ve got any questions, and feel free to contact us at EnoTools. Thank you very much.
11:47
I can see the thing myself.
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data