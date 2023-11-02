“India is the first non-Western country we are doing this in, and we are testing Bard in nine Indian languages,” said Manish Gupta, head of Google Research in India, referring to the company’s AI chatbot. “Over 70 Indian languages spoken by over a million people each had zero digital corpus. The problem is so stark.”

Gupta ticked off a list of issues that AI firms need to address in order to serve India’s internet users: Non-English datasets are dismally low quality; hardly any conversational data exists in Hindi and other Indian languages; and digitized content from books and newspapers in Indian languages is very limited.

When used for South Asian languages, some large language models have been found to make up words and struggle with basic grammar. There are also concerns these AI services may reflect a more skewed view of other cultures. It’s critical to have broad representation of training data, including non-English data, so AI systems “don’t perpetuate harmful stereotypes, produce hate speech, nor yield misinformation,” said Mehran Sahami, a professor in the computer science department at Stanford University.