MLtwist Helps Computer Science Researchers Create Language Models for Low-Resource Language

 

MLtwist was contracted to provide services to computer science researchers at the Stanford School of Engineering for a research project on language models for Sindhi, a low-resource language. The services included proprietary data processing technology such as quality control, connectors to third-party annotation tools and recruitment of two Sindhi native speakers from MLtwist’s partner network, both cited as co-authors in the paper. The research study was recently published by Stanford Engineering’s Natural Language Processing Group.

The research paper, “Universal Dependencies for Sindhi,” examines the need for annotated Sindhi-language data for the Universal Dependencies framework and the development of the first end-to-end AI data annotation pipeline for the Sindhi language.

Sindhi is an Indo-Aryan language, closely related to Punjabi, which is spoken by about 40 million people in India and Pakistan. Despite being widely spoken, it is a low-resource language, with few existing labeled datasets or pretrained embeddings. The research paper outlines a plan for building the first modern natural language processing infrastructure for Sindhi using the Universal Dependencies framework.

“MLtwist provided the data processing technology used to train the Sindhi language models. Our focus was on accurate, human annotation to support a reliable foundation for NLP in a low-resource language.” said Liana Raggio, Data Orchestration lead for MLtwist.

For this project, the study’s authors analyzed approximately 6,000 sentences of Sindhi after a needs analysis phase, then built a dataset for release in the 2.16 and 2.17 version of Universal Dependencies. The MLtwist platform was used to preprocess and set up data both in the Datasaur and Kili Technologies data labeling platforms for the assigned labeling team, ensuring the projects were optimally and consistently set up. MLtwist automated quality control technology assisted the labeling teams in running the revisions required for Human-in-the-Loop. When the manual work was complete, the data was automatically pulled and post processed for the researchers to use in their AI model.

“We’re glad to have contributed to a project that gives Sindhi a place in modern NLP. It’s the kind of work that quietly moves the field forward by creating a push for all languages (including low-resource ones) to be included in the multitude of new AI technologies,” said Audrey Smith, Chief Operating Officer of MLtwist.