MLtwist selected to support Stanford computer science research

 

 

MLtwist provided third-party annotation service, including AI data pipeline technology, in a research study recently published by Stanford School of Engineering’s Natural Language Processing Group.

 


The research paper,
“Do “English” Named Entity Recognizers work well on Global Englishes?” examines the need for content recognition models that are calibrated to the linguistic origins of the speaker who produced the content. The findings showed that differences in all concepts, even concepts like name structure as cited in the paper, by highly proficient but non-native speakers of a language are significant enough to affect performance of content recognition systems trained on native-speaker content.

 


“The amount of published content is growing exponentially and it is increasingly important to understand not just the
content of the data, but also the origin of data,” said Audrey Smith, Chief Operating Officer of MLtwist.

 


The MLtwist platform was used to preprocess and set up data in the Datasaur data labeling tool for the assigned labeling team, ensuring the projects were optimally and consistently set up. MLtwist automated quality control technology assisted the labeling teams in running the revisions required for Human-in-the-Loop. When the manual work was complete, the data was automatically pulled and post processed for the researchers to use in their AI model.

 


MLtwist’s Data Card functionality was also used by the researchers to audit compliance on both the origin and the transit of the data. The Data Card documents which companies and technologies are used to transform the data, and how those entities comply with both ethical and security requirements. “The Data Card is a powerful concept which follows in the style of
AWS’s AI Service Cards published by the Amazon team. As AI legislation becomes prevalent, we expect Data Cards to be a common requirement for all AI data pipelines in the future,” said Audrey Smith.

 

"The amount of published content is growing exponentially and it is increasingly important to understand not just the content of the data, but also the origin of the data."