Large Language Models on the Web: Anticipating the challenge | IGF 2023 WS #217

12 Oct 2023 01:30h - 03:00h UTC

Event report

Speakers and Moderators

Speakers:
  • Santana Vagner, Private Sector, Western European and Others Group (WEOG)
  • Yuki Arase, Civil Society, Asia-Pacific Group
  • Barbara Leporini, Civil Society, Western European and Others Group (WEOG)
  • Emily Bender, Civil Society, Western European and Others Group (WEOG)
  • Dominique Hazaël-Massieux, Technical Community, Western European and Others Group (WEOG)
  • Ryan Budish, Privacy and Public Policy Manager at Meta
  • Rafael Evangelista, Civil Society, Latin American and Caribbean Group (GRULAC)
Moderators:
  • Diogo Cortiz da Silva, Technical Community, Latin American and Caribbean Group (GRULAC)

Table of contents

Disclaimer: This is not an official record of the IGF session. The DiploAI system automatically generates these resources from the audiovisual recording. Resources are presented in their original format, as provided by the AI (e.g. including any spelling mistakes). The accuracy of these resources cannot be guaranteed. The official record of the session can be found on the IGF's official website.

Knowledge Graph of Debate

Session report

Emily Bender

The analysis discussed various aspects of language models (LLMs) and artificial intelligence (AI). One key point raised was the limitation of web data scraping for training LLMs. Speakers highlighted that the current data collection for LLMs is often haphazard and lacks consent. They argued that this indiscriminate scraping of web data can violate privacy, copyright, and consent. Sacha Costanza-Chock’s concept of consentful technology, which emphasises meaningful opt-in data collection, was presented as a better alternative.

The speakers also stressed that LLMs are not always reliable sources of information. They pointed out that LLMs reflect biases of the Global North due to data imbalance. This uneven representation can lead to skewed outputs and perpetuate existing inequalities. Therefore, there were concerns about incorporating LLMs into search engines, as it could amplify these biases and hinder the dissemination of objective and diverse information.

Another topic of discussion was the risks associated with synthetic media spills. Speakers highlighted that synthetic media can easily spread to other internet sites, raising concerns about disinformation and misinformation. They recommended that synthetic text should be properly marked and tracked in order to enable detection and ensure accountability.

On the positive side, the analysis explored approaches to detect AI-generated content. Speakers acknowledged that once synthetic text is disseminated, it becomes difficult to detect. However, they expressed optimism that watermarking could serve as a potential solution to track AI-generated content and differentiate it from human-generated content.

In terms of reframing discussions, there was a call to shift the focus from AI to automation. By doing so, a clearer understanding of the societal impact can be achieved, ensuring that potential risks are thoroughly assessed.

Regarding language-related AI models, speakers emphasized the importance of not conflating them and carefully considering their usage in different tasks. This highlights the need for a nuanced approach that takes into account the specific capabilities and limitations of different AI models for various language processing tasks.

The analysis also emphasized the importance of communities having control over their data for cultural preservation. Speakers stressed that languages belong to their respective communities, and they should have the power to determine how their data is used. The ‘no-language-left-behind’ model, which aims to preserve all languages, was criticized as being viewed as a colonialist project that fails to address power imbalances and the profits gained by multinational corporations. It was argued that if profit is to be made from language technology in the Global South, it should be reinvested back into the communities.

In summary, the analysis delved into the complexities and challenges surrounding LLMs and AI. It highlighted the limitations of web scraping for data collection and the associated concerns of privacy, copyright, and consent. The biases in LLMs and the potential risks of incorporating them into search engines were thoroughly discussed. The analysis also examined the risks and detection of synthetic media spills, as well as the need for reframing discussions about AI in terms of automation. The importance of considering language-related AI models in different tasks and the control of data by communities were underscored. Criticisms were made of the ‘no-language-left-behind’ model and the profiting of multinational corporations in the Global North from language technology in the Global South.

Diogo Cortiz da Silva

The use of the web as a data source for training large language models (LLMs) has sparked concerns surrounding user consent, copyright infringement, and privacy. These concerns raise ethical and legal questions about the sources of the data and the permissions granted by users. Furthermore, there are concerns about potential copyright violations when LLMs generate content that closely resembles copyrighted works. Privacy is also a major concern as the web contains vast amounts of personal and sensitive information, and using this data without proper consent raises privacy implications.

In response to these concerns, tech companies such as OpenAI and Google are actively working on developing solutions to provide users with greater control over their content. These companies recognise the need for transparency and user consent and are exploring ways to incorporate user preferences and permissions into their LLM training processes. By giving users more control, these companies aim to address the ethical and legal challenges associated with web data usage.

The incorporation of LLMs into search engines has the potential to significantly impact web traffic and the digital economy. This integration raises policy questions regarding the potential risks and regulatory complexities of using LLMs as chatbot interfaces. As LLMs become more sophisticated, integrating them into search engines could revolutionise the way users interact with online platforms and consume information. However, there are concerns about the accuracy and reliability of LLM-driven search results, as well as the potential for biased or manipulative outcomes.

In addition to these concerns, the association of generative AI with web content presents challenges related to the detection, management, and accountability of sensitive content. Generative AI technologies have the capability to autonomously produce and post web content, raising queries about how to effectively monitor and regulate this content. Detecting and managing sensitive or harmful content is crucial in ensuring the responsible use of generative AI while addressing the potential risks associated with false information, hate speech, or illegal materials. Similarly, holding responsible parties accountable for the content generated by AI systems remains a complex issue.

To address these challenges, technical and governance approaches are being discussed. These approaches aim to strike a balance between innovation and responsible use of AI technologies. By implementing robust systems for content detection and moderation, as well as establishing clear accountability frameworks, stakeholders can work towards effectively managing generative AI-driven web content.

In conclusion, the use of the web as a training data source for LLMs has raised concerns regarding user consent, copyright infringement, and privacy. Tech companies are actively working on providing users with more control over their content to address these concerns. The integration of LLMs into search engines has the potential to impact web traffic and the digital economy, leading to policy questions about potential risks and regulatory complexities. The association of generative AI with web content raises queries about detecting sensitive content and ensuring accountability. Technical and governance approaches are being explored to navigate these challenges and foster responsible and ethical practices in the use of LLMs and generative AI technologies.

Audience

The discussion revolved around various topics related to the effects of generative AI and LLM (Large Language Models) development. Julius Endert from Deutsche Welle Academy is currently researching the impact of generative AI on freedom of speech. This research sheds light on the potential consequences of AI on individuals’ ability to express themselves.

The regulation of LLM development was also discussed during the session. The representative from META suggested that regulation should focus on the outcome of LLM development, rather than the process itself. This raises the question of how to strike the right balance between regulating the technology and ensuring positive outcomes.

The control of platforms and social media was another aspect of the discussion. It was noted that a few businesses have significant control over these platforms and the development of LLMs. This concentration of power raises concerns about competition and potential limitations on innovation.

The role of the state and openness in regulating LLMs was a topic of inquiry. The participants examined the role that the state should play in regulating LLM development and how to promote openness in this process. However, there was no clear consensus on this issue, highlighting the complexity of governing emerging technologies.

The discussion also explored the neutrality of technology, recognizing that different people have different values and use contexts for technology. It was acknowledged that technology is not inherently neutral, and its use and creation context vary among individuals and values.

Transparency in content creation by large language models was another area of concern. Unlike web page content and search engines, large language models lack clear mechanisms for finding and controlling content. This lack of transparency raises questions about the responsibility for the content created by these models and how stakeholders should be considered.

The discussion emphasized the need for the alignment of values in language models, with participation from different languages and communities. This inclusive approach recognizes the importance of diverse perspectives and ensures that the values embedded in language models reflect the needs and voices of various groups.

The notion of the internet as a ‘public knowledge infrastructure’ was also brought up, advocating for shaping the governance aspects of the internet to align with this goal. This highlights the need to democratize access to information and knowledge.

Furthermore, the economic aspects of content creation and the internet were given attention. It was noted that these aspects are often overlooked in discussions on internet governance. Participants argued for engaging in discussions about taxing and financing the internet and multimedia, particularly when creating new economic revenue streams for quality content.

These discussions provide valuable insights into the complexities and potential consequences of generative AI and LLM development. They underscore the importance of careful regulation, transparency, inclusivity, and economic considerations to ensure that these technologies are leveraged for the benefit of society. The discussions also highlight the significance of promoting openness and preserving freedom of speech in the digital era.

Dominique Hazaël Massieux

The analysis examines several aspects related to LLMs and web data scraping, content creation, AI technology, search engines, and accountability. It asserts that LLMs and search engines have different impacts when it comes to web data scraping. While web data scraping has been practiced since the early days of the internet, LLMs, being mostly a black box, make it difficult to determine the sources used for training and building answers. This lack of transparency and accountability poses challenges.

Furthermore, the analysis argues for explicit consent from content creators for the use of their content in LLM training. The current robots exclusion protocol is considered insufficient in ensuring content creators’ explicit consent. This stance aligns with SDG 9 – Industry, Innovation, and Infrastructure, suggesting the need to establish a mechanism for obtaining explicit consent to maintain content creators’ control over their materials.

In addition, the analysis proposes that the content used for LLM training should evolve based on regulations and individual rights. This aligns with the principles of SDG 16 – Peace, Justice, and Strong Institutions. It highlights the need for a dynamic approach to permissible content, guided by evolving regulations and the protection of individual rights.

The integration of chatbots into search engines is seen as a UI challenge. Users perceive search engines as reliable sources of information with verifiable provenance. However, the incorporation of chatbots, which may not always provide trustworthy information, raises concerns about the reliability and trustworthiness of the information presented. Striking a balance between reliable search results and chatbot integration is a challenging task.

Making AI-generated content detectable presents significant challenges. The process of watermarking text in a meaningful and resistant manner poses difficulties. Detecting and verifying AI-generated content is complex and has implications for authenticity and trust.

The main issues revolve around accountability and transparency regarding the source of content. The prevalence of fake information and spam existed before LLMs and AI, but these technologies amplify the problem. Addressing accountability and transparency is crucial in combatting the spread of misinformation and promoting reliable information dissemination.

The analysis emphasizes the benefits and drawbacks of open sourcing LLM models. Open sourcing improves transparency, accountability, and research through wider access to models, but the valuable training data that contributes to their effectiveness is not open sourced. Careful consideration is required to balance the advantages and drawbacks of open sourcing LLMs.

Lastly, more transparency is needed in the selection and curation of training data for LLMs. The value of training data is underscored, and discussions on transparency in data sources and curation processes are necessary to ensure the integrity and reliability of LLMs.

In conclusion, the analysis thoroughly examines various dimensions surrounding LLMs and their implications. It explores web data scraping, content creation, AI-generated content, chatbot integration, and accountability/transparency. The arguments presented call for thoughtful measures to ensure ethical and responsible use of LLMs in a constantly evolving digital landscape.

Rafael Evangelista

The analysis provides a comprehensive examination of the current landscape of online content creation and compensation structures. One of the primary concerns highlighted is the financial model that rewards content creators based on the number of views or clicks their content generates. This system often leads to the production of sensationalist and misleading content. The detrimental effects of this model were evident during the 2018 elections in Brazil, where far-right factions used instant messaging platforms to spread and amplify misleading content for profit. This example exemplifies the potential harm caused by the production of low-quality content driven by the pursuit of financial gain.

Another significant aspect discussed is the need to reconsider compensation structures for content creation. The analysis points out that many online platforms profit from journalistic content without adequately compensating the individuals who produce it. This raises concerns about the sustainability and quality of journalism, as content creators may struggle to earn a fair income for their work. The discussion calls for a reevaluation of the compensation models to ensure that content creators, particularly journalists, are appropriately remunerated for their contributions.

On a more positive note, there is an emphasis on acknowledging the collective essence of knowledge production and investing in public digital infrastructures. The analysis argues that resources should be directed towards the development of these infrastructures to support the creation and dissemination of knowledge. The knowledge that underpins large language models (LLMs/IOMs) is portrayed as a collective commons, and it is suggested that efforts should be made to recognize and support this collective nature.

However, there is also criticism towards the improvement of existing copyright frameworks. The distinction between fact, opinion, and entertainment is increasingly blurred, making it challenging to establish universally accepted compensation standards. Instead of bolstering copyright frameworks, the analysis recommends encouraging the creation of high-quality content that benefits the collective.

The analysis also highlights the potential negative impact of automated online media (AOMs), even in free and democratic societies. AOMs can incentivize the production of low-quality content, thereby hindering the quality and accuracy of information available online. To address this issue, the suggestion is made to tax AOM-related companies and utilize the funds to create public incentives for producing high-quality content.

In terms of governance, the analysis suggests that states should invest in developing publicly accessible AI technology. This investment would enable states to train models and maintain servers, therefore ensuring wider access to AI technology and its benefits. Additionally, there is an argument for prioritising state governance over web content functionality, as the web is regarded as something that states should take responsibility for.

The role of economic incentives in shaping the internet and web technology is highlighted, emphasising the influence of capitalist society and the need to please shareholders on internet companies. The analysis suggests viewing the internet and web through the lens of economic incentives to better understand their development and operation.

Finally, the importance of institutions in guiding content production is emphasised. The analysis posits that there is a need to regain belief in institutions that can hold social discussions and establish guidelines for content creation. The Internet Governance Forum (IGF) is specifically mentioned as a platform that can contribute to building new institutions or re-institutionalising the creation of culture and knowledge.

In conclusion, the analysis provides a thorough examination of the current state of online content creation and compensation structures. It highlights concerns regarding the financial model that incentivises low-quality content, calls for reevaluation of compensation structures, advocates for recognising the collective essence of knowledge production, criticises existing copyright frameworks, explores the potential negatives of AOMs, proposes taxation of AOM-related companies for public incentives, stresses the need for state investment in AI technology and governance over web content functionality, emphasises the role of economic incentives in shaping the internet, and highlights the importance of institutions in content creation. These insights provide valuable perspectives on the challenges and opportunities present in the online content landscape.

Vagner Santana

The analysis explored the concept of responsible technology and the potential challenges associated with it. It delved into various aspects of technology and its impact, shedding light on key points.

One major concern raised was the development of Web 3 and its potential to exacerbate issues related to data bias in technology. The analysis highlighted that machine learning models (LLMs) trained on biased data can perpetuate these biases, posing challenges for responsible AI use. Additionally, the lack of transparency in black box models, which conceal the data they contain, was identified as a concern.

The importance of language and context in technology creation was also emphasized. The analysis pointed out that discussions often focus on the context of creation rather than the diverse usage of AI and LLMs, particularly in relation to their potential to replace human professions. It highlighted how language and context significantly influence the worldwide usage and benefits of technology, with local conditions and currency playing a crucial role in determining access and usage of technological platforms.

The analysis advocated for moral responsibility and accountability in AI creation. It expressed concern that LLMs, with their ability to generate vast amounts of content, might be used irresponsibly in the absence of moral responsibility. It argued that technological creators should have a vested interest in their creations to promote accountability for AI-generated content.

There was an emphasis on the need to study technology usage to understand its real impact. The analysis acknowledged that people often repurpose technologies and use them in unexpected ways. It noted that the prevalent culture of “building fast and breaking things” in the technology industry leads to an imbalanced perspective. Thus, comprehensive studies are necessary to assess and comprehend the true consequences of technology.

The analysis highlighted the delicate balance between freedom to innovate and responsible innovation principles. While innovation requires the freedom to experiment, adhering to responsible innovation principles is essential to mitigate potential harm. It pointed out that regulations often emerge as a response to changes and issues stemming from technology.

The analysis acknowledged the non-neutrality of technology, recognizing that different perspectives arise from the lens through which we perceive and discuss it. It emphasized that individuals bring diverse values to the creation and use of technology, underscoring the subjective nature of its impact.

Furthermore, transparency issues were identified regarding web content and LLMs. The analysis noted that creative commons offer control mechanisms for web content, but there is a lack of transparency in large language models. This raised concerns about control mechanisms and participation in aligning these models, suggesting a need for greater transparency in this area.

In conclusion, the analysis emphasized the significance of developing and using technology responsibly to prevent harm and optimize benefits. It examined concerns such as data bias, language bias, transparency issues, and the importance of moral responsibility. The analysis also recognized the varied values individuals bring to technology and the importance of studying its usage. Overall, responsible technology development and usage were advocated as crucial for societal progress.

Yuki Arase

In the discussion, several concerns were raised regarding web data, large language models, chat-based search engines, and information trustworthiness. One major point made was that web data does not accurately represent real people due to the highly skewed nature of content creators. SNS texts from specific groups, such as young people, were found to dominate a significant portion of web data. This unbalanced distribution of content creators leads to biased representations and an overemphasis on particular perspectives. Furthermore, it was noted that biases and hate speech may be more prevalent in web data than in the real world, underscoring the issue of inaccurate representation.

Another concern addressed was the inherent biases and limitations of large language models trained on skewed web data. These models, which are increasingly used in various applications, rely on the information provided during training. As a result, the biases present in the training data are perpetuated by the models, resulting in potentially biased outputs. It was argued that balancing web data to accurately represent people from all around the world is practically impossible, further amplifying biases in language models.

The discussion also touched upon the impact of chat-based search engines on information trustworthiness. It was suggested that these search engines may accelerate the tendency to accept responses as accurate without verifying information from different sources. This raises concerns about the dissemination of inaccurate or unreliable information, as people may place unwarranted trust in the responses generated by these systems.

However, a positive point was made regarding the use of provenance information to enhance information trustworthiness. Provenance information refers to documenting the origin and history of generated text. By linking the generated text to data sources, individuals can verify the reliability of information provided by chatbots or similar systems. This approach can help increase trust in the information and mitigate the tendency to accept responses without verification.

The discussion also highlighted the impact of current large language models primarily catering to major languages, which could exacerbate the digital divide across the world. It was pointed out that training language models requires a substantial amount of text, which is predominantly available in major languages. Consequently, languages with smaller user bases may not have the same level of representation in language models, further marginalising those communities.

Lastly, the discussion mentioned the potential of technical solutions like watermarking to track the source of generated texts, a step towards ensuring accountability for AI-generated content. However, it was noted that the effectiveness of these technical solutions also depends on appropriate policies and governance frameworks that align with their implementation. Without these measures, the full potential of such solutions may not be realised.

In conclusion, the speakers highlighted several concerns related to web data, large language models, chat-based search engines, and information trustworthiness. The skewed nature of web data and biases in language models present challenges in accurately representing real people and avoiding biased outputs. The tendency to accept responses from chat-based search engines as accurate without verification raises concerns about the dissemination of inaccurate information. However, the use of provenance information and technical solutions like watermarking offer potential strategies to enhance information trustworthiness and ensure accountability. Additionally, the digital divide may worsen as current language models primarily cater to major languages, further marginalising communities using less represented languages. Overall, a comprehensive approach involving both technical solutions and policy frameworks is necessary to address these concerns and ensure a more accurate and trustworthy digital landscape.

Ryan Budish

Generative AI technology has the potential to bring about significant positive impacts in various sectors, including businesses, healthcare, public services, and the advancement of the United Nations’ Sustainable Development Goals (SDGs). One notable application of generative AI is its ability to provide high-quality translations for nearly 200 languages, making digital content accessible to billions of people globally. Moreover, generative AI has been used in innovative applications like generative protein design and improving online content moderation. These examples demonstrate the versatility and potential of generative AI in solving complex problems and contributing to scientific breakthroughs.

In terms of regulation, Meta supports a principled, risk-based, technology-neutral approach. Instead of focusing on specific technologies, regulations should prioritize outcomes. This ensures a future-proof regulatory framework that balances innovation and risk mitigation. By adopting an outcome-oriented approach, regulations can adapt to the evolving landscape of AI technologies while safeguarding against potential harms.

Building generative AI tools in a safe and responsible manner is crucial. Rigorous internal privacy reviews are conducted to address privacy concerns and protect personal data. Generative AI models are also trained to minimize the possibility of private information appearing in responses to others. This responsible development approach helps mitigate potential negative consequences.

An open innovation approach can further enhance the safety and effectiveness of AI technologies. Open sourcing AI models allows for the identification and mitigation of potential risks more effectively. It also encourages collaboration between researchers, developers, and businesses, leading to improved model quality and innovative applications. Open source AI models benefit research and development efforts for companies and the wider global community.

Ryan Budish, an advocate for open source and open innovation, believes in the benefits of open sourcing large language models. He argues that public access to these models encourages research, innovation, and prevents a concentration of power within the tech industry. By making models publicly accessible, flaws and issues can be identified and fixed by a diverse range of researchers, improving overall model quality. This collaborative approach fosters an environment of innovation, inclusivity, and prevents monopolies by a few tech companies.

In conclusion, generative AI technology has the potential for positive impacts in multiple industries. It enhances communication, contributes to scientific advancements, and improves online safety. A principled, risk-based, technology-neutral approach to regulation is vital for balancing innovation and risk mitigation. Responsible development and use of generative AI tools, along with open innovation practices, further enhance the safety, quality, and inclusivity of AI technologies.

Speakers

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more

Speech speed

0 words per minute

Speech length

words

Speech time

0 secs

Click for more