31
AugAI is opening up powerful possibilities for economic and scientific development across the planet, but access to high-performing AI models depends on available language resources.
Sindhi is an Indo-Aryan language similar to Punjabi spoken by 40 million people in southeast Asia. Even though it has a significant number of speakers, there are a limited number of data sets available for AI training.
A research group from a leading American university recently published a study on language models for Sindhi in the context of the Universal Dependencies framework. The university contracted with MLtwist to provide services for their research, including proprietary data processing technology and recruitment of two native Sindhi speakers from MLtwist’s Partner Network.
The need for annotated Sindhi-language data resources, and the development of the first Sindhi-language data annotation pipeline, met several real-world challenges:
MLtwist provided end-to-end data processing support, including automated quality control, setup across third-party annotation platforms (Datasaur and Kili Technologies), and the recruitment of native Sindhi speakers from its partner network, ensuring accurate human annotation and seamless post-processing for AI model training.
MLtwist’s infrastructure helped streamline a complex data annotation workflow, reducing friction in handling low-resource language data and ensuring consistent quality for research use. The university’s research was successfully published by the university group, with Sindhi data included in Universal Depenencies versions 2.16 and 2.17, marking a significant milestone for linguistic inclusivity in AI.
Subscribe us and get latest news and updates to your inbox directly.
The Ultimate Guide to AI Data Pipelines: Learn how to Build, Maintain and Update your pipes for your unstructured data