Last September, MLtwist participated in the AI Infrastructure Alliance’s community event on the theme of LLMs and generative AI.
MLtwist CEO David Smith spoke about the potential of large language models to replace the need for traditional coding skills in creating data pipelines.
LLMs offer a novel alternative to data pipelines by generating complex code from simple text prompts, enabling non-technical users to build AI data pipelines independently. The traditional approach to designing pipeline workflows often requires knowledge of programming languages, which can be challenging and discouraging for non-engineers. In this talk, David explores this concept and walks through the future of an LLM powered platform that empowers users to construct intricate data pipelines without developer involvement.
Transcript
Hi, welcome. My name’s David Smith, and we’re going to talk about generative AI data pipelines, the future of data orchestration. Our company is MLtwist, and we’re incredibly excited about what is to come in the future for data orchestration.
A quick agenda: we’re going to go through a problem in solution overview, a little refresher on LLMs, a little refresher on data pipelines, and then we’re going to talk about generative data pipelines.
In terms of myself, at MLtwist I’m the CEO. Before that, I’ve gotten to work with companies like Google and Oracle, and I’ve spent 15 years focused on data for machine learning.
One of the biggest challenges that I have faced in my career, and a lot of the industry faces, is the orchestration or organization of data flow. According to a report by AI infrastructure.com, curating, cleaning, transforming data is still the biggest blocker when it comes to productionizing AI models.
These data pipelines that are built are typically complex, require a lot of maintenance, and they’re costly. The reality is, things need to be easier, especially when it comes to the flow of AI data.
And we’re going to talk about how companies can leverage LLMs to make this happen.
At the moment, AI data pipelines are going through major changes. There has been a lot of investment in different pieces of the AI data ecosystem, from acquiring data to storing data to leveraging AI to help you work on your data, so finally building your own AI models with ML operations.
The flow between these pieces is incredibly important, and one of the parts that many data scientists and AI experts spend a lot of time working on.
A quick refresher on LLMs: LLMs come in a lot of flavors. Some are the LLMs that we’re very familiar with that are provided through cloud services, like GPT-4, Palm2, available on platforms like Azure and GCP.
These LLMs cannot actually be really tuned outside of special relationships. Proprietary LLMs, like those provided by ChatGPT, are actually GPT-4 models that OpenAI was able to work on with a special partnership and also to not be tuned.
Finally, open source LLMs like Stability AI are LLMs that can be tuned, and you can actually deploy these now. Over time, LLMs are going to change. So, as of this recording, if you want to build your own LLM and own it outright, you’re likely to go with an open source model.
However, the clouds out there and other companies are doing a great job in trying to provide more flexible ownership to companies that want to build their own LLMs, while still retaining the ease for companies that don’t want to build their own LLMs and simply want to leverage them.
How AI data pipelines are different from traditional data pipelines varies. When we think of standard ETL (extract, transform, load) for data pipelines, we don’t typically think of concepts like the quality of data.
What we mean by quality is not whether or not the data made it from one side to the other. If you’re going to extract data from your database in, let’s say, SQL, you’re going to move it over to Snowflake, you’re going to want to make sure that that record arrived intact as is.
When we talk about data quality and AI, we’re talking about something different. We’re talking about the accuracy of what the record is described as to the model. This concept of data quality is a little different from traditional ETL.
The other piece that is different from traditional ETL is the concept of acyclic versus cyclic. In ETL, the term DAG is used often, symbolizing directed acyclic graphs. The challenge with AI data is often because of the quality component, you have cyclic requirements.
If we were to take a data pipeline for AI, you would typically have some sort of repository. And you have to use the extract piece to get that data and batch it out because, remember, a lot of times data needs to go through quality control, which is often a human process, even if it’s a part of a process.
But also, because sometimes humans need to actually work on the data in different tools like labeling tools. Those batches can be pushed through what we call zero-shot, in other words, AI taking a first pass at imitating the job that a human would try to do on that data.
Companies can then gate several batches, but typically they want to inspect a percentage of that data to make sure that the data is good before it gets fed into their model because, as we all know, if the data going into the model is not good, that is the fastest way to crash a model.
So, in that manual review process, which can be accomplished through several tools that were described in that last ecosystem slide, there’s often a requirement to rework on this data, hence we start the cyclical nature of ETL for AI.
Once the data hits the quality threshold or the Quality Acceptance from the company, that gate is then lifted and then pushed into ML operations.
One other thing to call out in ETL is when we talk about pushing data to manual review, we are not talking about just dropping an image into a bucket. You’re actually talking about leveraging APIs to do the setup so that people can go in and validate that data themselves.
That often requires either going through the UI or if you’re leveraging the API, automating that full setup, which can often involve 20 to 40 different parameters and sometimes needs to actually be changed from pipeline to pipeline.
Wouldn’t it be great if LLMs were able to just take what you needed to do in terms of data flow and create the code that you could use? The reality is LLMs are starting to get there, and you can describe what you would like to do to data and potentially describe the tools that you would like that data to be pushed into.
However, today the reality is people are not able to effectively use LLMs to generate code, and even if they could, you would still be required to host that code in some sort of solution in the cloud and run the automation and maintain that automation.
We’re going to talk a little bit about how LLMs are going to be used and can be used in the future to accomplish something that looks very much like this and deliver something that really allows someone without coding knowledge or data science expertise to go in and describe how they would like to work on a data set.
The democratization of AI means that people are now being pulled in from all sorts of different disciplines to work on data: lawyers, doctors, mechanics, machine productions people, people in the military, all sorts of people are getting involved in data now.
And the data flow is one of the biggest restrictions for a lot of these disciplines because it requires so much knowledge of coding, even low code solutions require this.
In the future, LLMs are going to help those people describe what they need to do on a data set, and LLMs are going to assist them in visualizing that data so that they can work on it themselves and deliver that data to where it needs to go.
As a whole, this solution looks a little bit like a puzzle with a lot of different pieces. Ideally, an LLM would be able to create all these pieces itself, however, in reality, it often has challenges because one of these pieces is complex, let alone all of them.
To describe this puzzle a little bit, you have concepts of connectors, so these are the pieces that grab data and push data leveraging typically API Integrations. You have components, which are the pieces that transform the data in some way so that the data is ready to be pushed to the next stop.
And then finally, you have those tools that are incredibly important and help people work on their data or apply artificial intelligence to their data. Pipelines tend to be a combination of all of these pieces.
Where LLMs get involved is they’re able to evoke or leverage modular pipeline infrastructure, foundation of connectors and components, and happen to manage swappable components and connectors to deliver orchestration support.
In order to achieve this, the first attempt is going to be actually building those components ourselves instead of relying on LLMs to build all the components. The components are going to be generated by companies or by a technology which will allow LLMs to pull from those components and understand for itself when those components should be used.
After that, you have the concept of orchestration. How should the pieces be used in concert? What does the actual order look like? These are places where LLMs can get involved without having to write code and instead use logic that is gathered from API documentation, product documentation, the description that the user gives the LLM, and access to the dataset so that it is able to receive those files and extract the parameters from those files to then start to piece together a flow that makes sense.
And finally, you have LLM generative AI data pipelines, where the ultimate task for an LLM will be to leverage the pre-built components, the pre-built connectors, the orchestration knowledge, and bridge gaps when they are encountered in order to deliver a fully executable end-to-end pipeline for people to enable data flow into the tools and for them to do what they need to do on that data and out of those tools as well.
A side effect of achieving this technology is you now have clean, composable pipelines that can also be used for general purposes. Concepts like extracting data from publicly listed databases to aid researchers on identifying material properties, concepts like transferring vehicle sales data from dealerships to OEM central databases.
MLTwist is heavily engaged in this space and working on these types of solutions with some amazing companies. We’re really pleased to have been able to talk about generative AI data pipelines. If you’re interested, please reach out. Thank you so much for your time.
Join to learn how Sandia National Labs ran into this challenge when building AI for the TSA,
and how they overcame it.
June 25, 2024 / 2pm EST / 11am PST
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data