AI and multilingualism

There are many ways in which AI can contribute to preserving multilingualism in the digital sphere using techniques such as natural language processing (NLP), speech recognition, and machine translation. Creating AI models for more languages is essential to bridge the language gap and ensure speakers of non-English languages are not left behind.

How AI can foster multilingualism

AI techniques such as NLP can be used to analyse text written in different languages. By processing and analysing multilingual data, AI helps to improve knowledge of different cultures and language patterns. AI can also make it easier for non-native speakers to use the internet by offering real-time language translation services. Text, audio, and video content may all be translated using language processing models, making online information more accessible to users who might not understand the original language.

AI can also help preserve endangered languages and dialects by building digital archives and linguistic databases. For example, Iceland has partnered with OpenAI to use GPT-4 to preserve the Icelandic language. The collaboration has a two-fold purpose – to help GPT-4 better serve a new region of the world while also laying the groundwork for developing resources that could help preserve other low-resource languages. Other countries have expressed interest in using this model to preserve minority languages. 

Challenges and limitations

While ChatGPT is available in over 100 languages, it is primarily trained on a large corpus of English language datasets. This is a pervasive challenge, given the predominance of online content in English compared to other languages. Thus, the prioritisation of widely spoken languages, given the lack of or limited training datasets in other languages, hinders the potential of AI to foster linguistic diversity, which might even lead to broadening linguistic divides.

Other challenges include quality issues in existing multilingual resources. AI models also struggle with accurately capturing the nuances, cultural context, and idiomatic expressions of different languages. AI-generated outputs in less-used languages can occasionally be unreliable, deceptive, or lacking in the nuance of human language.

Learn more on AI Governance.

If the Internet is to be used by everyone, content needs to be accessible in more languages. In this sense, multilingualism is an important aspect of the promotion and development of cultural diversity on the Internet as well as digital inclusion.

A report released by the UN Broadband Commission revealed that only about 5% of the world’s estimated 7100 languages are represented on the Internet. It also noted that the use of the Latin script remains a challenge for many Internet users, in particular for reading domain names.

Multilingualism is strongly related to local content. More languages on the Internet means that more locally relevant content is being made available. If online content is provided in local languages (by governments, companies, etc.), this provides people with the incentive to go online, as ‘users’ of content. At the same time, allowing people to express themselves online in their own languages encourages them to become generators of content. As such, the availability of local content can contribute to both making the Internet more inclusive and bridging the digital divide through its potential to attract more people online, as both users and generators of content.

Languages IRL and online

The Financial Times article ‘The problem with English’ points out that ‘Foreign countries are opaque to mostly monolingual Britons and Americans. Foreigners know us much better than we know them’, and suggests that this language asymmetry likely hurts English speaking countries in many ways, such as communications, limited access to information in other languages, and even difficulty in fighting cybercrime and hacking.

The promotion of multilingualism requires technical standards that facilitate the use of non-Latin alphabets. One of the early initiatives related to the multilingual use of computers was undertaken by the Unicode Consortium – a non-profit institution that develops standards to facilitate the use of character sets for different languages.

As part of their efforts in this regard, the Internet Corporation for Assigned Names and Numbers (ICANN) and the Internet Engineering Task Force (IETF) took important steps in promoting Internationalised Domain Names (IDNs). IETF’s work has resulted in documents such as the Request for Comments (RFC) 5890: Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework and RFC 5891: Internationalized Domain Names in Applications (IDNA): Protocol. ICANN has launched the IDN Program to ‘assist in the development and promotion of a multilingual Internet using IDNs’. IDNs facilitate the use of domain names written in non-Latin alphabets such as Chinese, Arabic, Cyrillic, and others. IDNs have been introduced in several countries and territories as equivalent to their Latin country code top-level domains (ccTLDs). For example, in China, 中国 has been introduced in addition to .cn, while in Russia, рф has been introduced in addition to .ru. IDNs are also part of ICANN’s New gTLD Program, allowing for the registration of new top-level domains (gTLDs) in scripts other than Latin; for example, .сайт (website) and .онлайн (online) are among the new top-level domains available to the public.

IDNs thus contribute to making the Internet more inclusive, as the possibility of accessing and registering domain names in more languages and scripts empower more people to use the Internet. It has been said numerous times that domain names are not only about addressing and naming, but also about content. Therefore, they are relevant for local communities and they have the potential of encouraging both the use and the development of local content in local languages and scripts. However, technical challenges remain; for example, when it comes to the acceptance of ‘fully internationalised’ e-mail addresses – particularly by mail server infrastructures – and to the recognition of IDNs by search engines. In addition to addressing these technical challenges, more work is also needed to raise awareness about IDNs and the possibilities they offer.

Another key promoter of multilingualism is the EU, since it embodies multilingualism as one of its basic political and working principles, enshrined in the EU Charter (article 22) and in the Treaty on European Union (article 3(3) TEU). Given its policy of translating all official activities into the languages of all member states, the EU has supported various development activities in the field of machine translation such as Matecat and QT21. It has also supported other language technologies such as speech recognition and data analytics.

The promotion of multilingualism requires appropriate governance frameworks. The initial elements of such frameworks have been provided by organisations such as the United Nations Educational, Scientific, and Cultural Organisation (UNESCO), which has instigated many initiatives focusing on multilingualism, including the adoption of important documents such as the Universal Declaration of Cultural Diversity and Recommendation concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace (2003). UNESCO supports the inclusion of new languages in the digital world, the creation and dissemination of content in local languages on the Internet and mass communication channels, and encourages multilingual access to digital resources in cyberspace.

Although major breakthroughs have been made, limitations still remain. In the case of IDNs, for example, universal acceptance is still a challenge, particularly when it comes to issues such as functional IDN e-mails and the recognition of IDNs by search engines. The evolution and wide usage of Web 2.0 tools (interfaces that allow ordinary users to become contributors and content developers with minimal technical knowledge) offers an opportunity for increased local content in a wide variety of languages. Nevertheless, without a wider framework for the promotion of multilingualism, this opportunity might end up creating an even wider gap, since users feel the pressure of using the common language or a few main world languages in order to reach a broader audience.