How can AI improve multilingualism
OpenAI’s chatbot, ChatGPT-4, performs well in English but struggles in other languages, scoring just 62% in Telugu. Language models fare better in “high-resource” languages compared to “low-resource” ones. Efforts are underway to make AI more multilingual, with India’s government launching a chatbot for farmers. Different approaches include modifying tokenisers and improving datasets. However, challenges such as illiteracy and preference for voice messages remain. Despite the challenges, expanding AI’s language capabilities is important for societies worldwide.
ChatGPT-4 performances in languages other than English are of lower qulity. In a recent test, ChatGPT-4 scored 85% on a question-and-answer test in English but only 62% in Telugu, an Indian language spoken by almost 100 million people.
The language bias and performance limitations of ChatGPT are not unique to this particular chatbot but are prevalent among large language models (LLMs). These models are primarily trained on English text scraped from the internet, with only a small fraction of the training data in other languages. For example, ChatGPT-3, the predecessor to ChatGPT-4, had around 93% of its training data in English, while other languages like Chinese, Japanese, and Telugu were underrepresented.
The language bias poses a challenge for deploying AI solutions in low-resource languages and countries with limited training data. The Indian government, for instance, has digitized many public services and aims to enhance them with AI. To achieve this, they have launched a chatbot that combines language models and machine-translation software to process queries in multiple native languages. However, translating queries into the preferred language of LLMs overlooks the cultural and worldview aspects of communication.
To address these challenges, researchers are exploring various strategies. One approach involves modifying the tokeniser, which divides words into smaller units for processing by the model. Optimizing the tokeniser for languages like Devanagari (used for Hindi) can significantly reduce computation costs. Another approach focuses on improving the training datasets for LLMs, which may involve digitizing pen-and-paper texts. For example, the Arabic-speaking model “Jais” has been developed with Arabic and English training data, performing on par with models like ChatGPT-3 despite having fewer parameters.
In addition, manually tweaking models after training has shown promise. Incorporating handcrafted question-and-answer pairs into models like Jais and OpenHathi (a Devanagari-optimized LLM) has enhanced their performance. However, challenges such as illiteracy and the preference for voice messages over text-based communication hinder the ability to provide accurate feedback on AI responses.
While efforts are being made to develop local language models, there is a possibility that larger tech companies from Silicon Valley could overshadow these endeavors. Nevertheless, ChatGPT-4 represents an improvement over ChatGPT-3 in non-English languages. Expanding AI’s language capabilities to encompass the world’s approximately 7,000 languages is a crucial goal that requires ongoing research, development, and collaboration.
Source: The Economist