This webinar, hosted by Carahsoft with speakers from Sandia National Laboratories, MLtwist, and Google Cloud, explored how AI data pipelines are being used to support TSA’s mission of improving threat detection at airport checkpoints. Sandia shared the complexity of building high-quality machine learning datasets, emphasizing the importance of data labeling, quality control, and metadata. MLtwist highlighted how its flexible, modular platform helped reduce error rates and accelerate data preparation. Google Cloud outlined its secure, scalable infrastructure supporting these efforts. The key takeaway: effective AI relies not just on models, but on robust, agile, and auditable data pipelines.
Webinar Transcript
Title: Navigating AI Data Complexity within Sandia National Laboratories
Host: Carahsoft Technology
Partners: MLtwist, Google Cloud, Sandia National Laboratories
Introduction
Good afternoon, everyone. Carahsoft welcomes you to our joint webinar with Google Cloud and MLtwist: Navigating AI Data Complexity within Sandia National Laboratories.
Speakers:
David Smith, Founder and CEO of MLtwist
Andrew Cox, R&D Systems Analyst, Sandia National Laboratories
Steven Boesel, Customer Engineer, Google Public Sector
Thanks for joining. Today we’re discussing how Sandia National Labs supports TSA in developing ML algorithms for enhanced threat detection at security checkpoints.
Project Background
TSA’s goal is to quickly and accurately detect threats (e.g., weapons) using X-ray and body scanners. In recent years, TSA has shifted toward an open architecture strategy, seeking a broader set of machine learning providers to speed up innovation and improve detection.
Why the Data Pipeline Matters
Developing effective ML algorithms requires a robust, high-quality data pipeline. Key stages include:
Data collection (controlled environments only)
Annotation of threat objects
Merging metadata (body types, item placement)
Quality control checks
Centralized data distribution to vendors
Challenges and Learnings:
Initial expectations of a simple process were unrealistic.
More than 75 potential error points were identified.
Annotation errors, metadata mismatches, and quality assurance gaps can severely degrade algorithm performance.
Working with MLtwist allowed rapid reconfiguration of pipelines to address new issues, significantly reducing time-to-delivery.
Conclusion: Success comes from treating the data pipeline as a dynamic, modular system with feedback loops. Good infrastructure and data ownership are critical. Sandia’s collaboration with MLtwist enabled quality, agility, and speed, which resulted in better threat detection algorithms.
David Smith – MLtwist
Thank you, Andrew. Let’s focus on the data labeling component. Why does it matter?
Data Quality is Everything
AI depends on high-quality data.
Referencing Dr. Andrew Ng: you need 3x more “okay” data to match the value of high-quality data.
Bad data degrades model performance. Even 1% noisy data can lead to a 1.8% drop in accuracy.
MLtwist’s Pipeline Features
Hybrid human-automation approach
Modular steps: annotation, quality control, versioning, bias detection, audit trails
Tools support for 2D, 3D, and text data
Emphasis on ethical labor practices and compliance (e.g., CCPA, HIPAA, EU AI Act)
Operational Highlights:
TSA’s open architecture and DICOS format allow ecosystem partners to contribute.
MLtwist’s automation speeds up labeling and reduces errors.
Enables a feedback loop where models improve labeling, which improves models again.
Steven Boesel – Google Cloud
Google Cloud offers the secure, scalable infrastructure powering MLtwist and public sector applications.
Why Google Cloud?
FedRAMP Moderate and High support
Global cloud infrastructure with defense-in-depth architecture
Data ownership: customers control and audit all access
Encryption by default, including confidential computing
Zero-trust security model
What You Can Do with Google Cloud:
Run ML pipelines on secure cloud infrastructure
Use pre-trained or fine-tuned models for vision, translation, prediction, etc.
Vertex AI Studio and managed services simplify ML deployment
Audience Q&A Highlights
Q1: ETL vs ELT in the cloud?
ETL preferred when humans are in the loop; pre-processing is critical.
Q2: Why outsource data pipelines?
Cost-effective and scalable; Sandia lacked the resources to build flexible pipelines internally.
Q3: How do you know when you need more data?
Testing identifies blind spots. Metadata (like body regions) helps refine data collection.
Q4: Other Sandia use cases?
Yes. From nuclear projects to wildlife smuggling detection, data pipelines are essential. External vendors could help more if cleared.
Q5: MLtwist vs. open-source?
Open-source is viable but costly to maintain. MLtwist offers a scalable SaaS with built-in tools and support.
Q6: Google Cloud’s support for DOD/FedRAMP?
Full FedRAMP support; dedicated public sector infrastructure.
Closing Remarks Thanks to all attendees and our speakers. This webinar highlighted the importance of scalable, auditable, and high-quality AI data pipelines in national security and public sector AI applications
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data