How MLtwist Supported  Computer Science Research for Creating New Language Models

 

The Use Case

 

AI is opening up powerful possibilities for economic and scientific development across the planet, but access to high-performing AI models depends on available language resources. 

 

Sindhi is an Indo-Aryan language similar to Punjabi spoken by 40 million people in southeast Asia. Even though it has a significant number of speakers, there are a limited number of data sets available for AI training. 

 

A research group from a leading American university recently published a study on language models for Sindhi in the context of the Universal Dependencies framework. The university contracted with MLtwist to provide services for their research, including proprietary data processing technology and recruitment of two native Sindhi speakers from MLtwist’s Partner Network.

 

THE CHALLENGE

 

Developing Data Pipelines for a Low-Resource Language


The need for annotated Sindhi-language data resources, and the development of the first Sindhi-language data annotation pipeline, met several real-world challenges:

  • Lack of existing datasets: Sindhi had few labeled datasets or pre-trained embeddings available, making it difficult to build accurate NLP models.
  • Complex data annotation setup: The project required end-to-end setup across multiple annotation platforms, demanding careful orchestration to ensure consistency and quality.
  • Recruiting qualified native speakers: Finding and managing qualified Sindhi-speaking annotators was essential for high-quality human-in-the-loop annotation in a low-resource context.

 

MLtwist’s Solution: Data Processing Support and Services

 

MLtwist provided end-to-end data processing support, including automated quality control, setup across third-party annotation platforms (Datasaur and Kili Technologies), and the recruitment of native Sindhi speakers from its partner network, ensuring accurate human annotation and seamless post-processing for AI model training.

  • Data Processing Support: MLtwist prepared and preprocessed datasets for annotation, ensuring a consistent and efficient workflow.
  • Automated Quality Control: AI-powered revision checked assisted human annotators, reducing errors and ensuring accurate labeling.
  • Recruitment of Native Sindhi Speakers: MLtwist leveraged its Partner Network of experts in 60+ languages and technologies to recruit two native Sindhi speakers, both cited as study co-authors.

 

Impact & Benefits

  • Reliable Annotations for Research: The research team received high-quality labeled data optimized for their study.
  • Streamlined Human-in-the-Loop Process: Automated revisions reduced annotation time while maintaining accuracy.
  • Native Speaker Sourcing: Integrating native Sindhi speakers into the annotation process resulted in more linguistically and culturally accurate data.

 

The Takeaway

 

MLtwist’s infrastructure helped streamline a complex data annotation workflow, reducing friction in handling low-resource language data and ensuring consistent quality for research use. The university’s research was successfully published by the university group, with Sindhi data included in Universal Depenencies versions 2.16 and 2.17, marking a significant milestone for linguistic inclusivity in AI.