top of page

Subscribe to our newsletter

Write a
Title Here

I'm a paragraph. Click here to add your own text and edit me. I’m a great place for you to tell a story and let your users know a little more about you.

© Indic Pacific Legal Research LLP.

For articles published in VISUAL LEGAL ANALYTICA, you may refer to the editorial guidelines for more information.

AI Seoul Summit 2024: Decoding the International Scientific Report on AI Safety

The AI Seoul Summit on AI Safety, held in South Korea in 2024, has released a comprehensive international scientific report on AI safety. This report stands out from the myriad of AI policy and technology reports due to its depth and actionable insights. Here, we break down the key points from the report to understand the risks and challenges associated with general-purpose AI systems.


1. The Risk Surface of General-Purpose AI



"The risk surface of a technology consists of all the ways it can cause harm through accidents or malicious use. The more general-purpose a technology is, the more extensive its risk exposure is expected to be. General-purpose AI models can be fine-tuned and applied in numerous application domains and used by a wide variety of users [...], leading to extremely broad risk surfaces and exposure, challenging effective risk management."

General-purpose AI models, due to their versatility, have a broad risk surface. This means they can be applied in various domains, increasing the potential for both accidental and malicious harm. Managing these risks effectively is a significant challenge due to the extensive exposure these models have. Illustration


Imagine a general-purpose AI model used in both healthcare and financial services. In healthcare, it could misdiagnose patients, leading to severe health consequences. In finance, it could be exploited for fraudulent activities. The broad applicability increases the risk surface, making it difficult to manage all potential harms.



2. Challenges in Risk Assessment



"When the scope of applicability and use of an AI system is narrow (e.g., consider spam filtering as an example), salient types of risk (e.g., the likelihood of false positives) can be measured with relatively high confidence. In contrast, assessing general-purpose AI models’ risks, such as the generation of toxic language, is much more challenging, in part due to a lack of consensus on what should be considered toxic and the interplay between toxicity and contextual factors (including the prompt and the intention of the user)."

Narrow AI systems, like spam filters, have specific and measurable risks. However, general-purpose AI models pose a greater challenge in risk assessment due to the complexity and variability of their applications. Determining what constitutes toxic behavior and understanding the context in which it occurs adds layers of difficulty.


Illustration


Consider an AI model used for content moderation on a social media platform. The model might flag certain words or phrases as toxic. However, the context in which these words are used can vary widely. For example, the word "kill" could be flagged as toxic, but in the context of a video game discussion, it might be perfectly acceptable. This variability makes it difficult to create a standardized risk assessment.



3. Limitations of Current Methodologies





"Current risk assessment methodologies often fail to produce reliable assessments of the risk posed by general-purpose AI systems, [because] Specifying the relevant/high-priority flaws and vulnerabilities is highly influenced by who is at the table and how the discussion is organised, meaning it is easy to miss or mis-define areas of concern. [...] Red teaming, for example, only assesses whether a model can produce some output, not the extent to which it will do so in real-world contexts nor how harmful doing so would be. Instead, they tend to provide qualitative information that informs judgments on what risk the system poses."

Existing methodologies for risk assessment are often inadequate for general-purpose AI systems. These methods can miss critical flaws and vulnerabilities due to biases in the discussion process. Techniques like red teaming provide limited insights, focusing on whether a model can produce certain outputs rather than the real-world implications of those outputs.


Illustration


A red-teaming exercise might show that an AI can generate harmful content, but it doesn't quantify how often this would happen in real-world use or the potential impact. For instance, an AI chatbot might generate offensive jokes during testing, but the frequency and context in which these jokes appear in real-world interactions remain unknown.


4. Nascent Quantitative Risk Assessments



"Quantitative risk assessment methodologies for general-purpose AI are very nascent and it is not yet clear how quantitative safety guarantees could be obtained. [...] If quantitative risk assessments are too uncertain to be relied on, they may still be an important complement to inform high-stakes decisions, clarify the assumptions used to assess risk levels and evaluate the appropriateness of other decision procedures (e.g. those tied to model capabilities). Further, “risk” and “safety” are contentious concepts."

Quantitative risk assessments for general-purpose AI are still in their early stages. While these assessments are currently uncertain, they can still play a crucial role in informing high-stakes decisions and clarifying assumptions. The concepts of "risk" and "safety" remain contentious and require further exploration.


Illustration


A quantitative risk assessment might show a 5% chance of an AI system making a critical error in a high-stakes environment like autonomous driving. However, the uncertainty in these assessments makes it hard to rely on them exclusively for regulatory decisions.


5. Testing and Thresholds



"It is common practice to test models for some dangerous capabilities ahead of release, including via red-teaming and benchmarking, and publishing those results in a ‘model card’ [...]. Further, some developers have internal decision-making panels that deliberate on how to safely and responsibly release new systems. [...] However, more work is needed to assess whether adhering to some specific set of thresholds indeed does keep risk to an acceptable level and to assess the practicality of accurately specifying appropriate thresholds in advance."

Testing for dangerous capabilities before releasing AI models is a standard practice. However, there is a need for more work to determine if these tests and thresholds effectively manage risks. Accurately specifying appropriate thresholds in advance remains a challenge.

Illustration


An AI model might pass pre-release tests for dangerous capabilities, but once deployed, it could still exhibit harmful behaviors not anticipated during testing. For example, an AI chatbot might generate harmful content in response to unforeseen user inputs.


6. Specifying Objectives for AI Systems



"It is challenging to precisely specify an objective for general-purpose AI systems in a way that does not unintentionally incentivise undesirable behaviours. Currently, researchers do not know how to specify abstract human preferences and values in a way that can be used to train general-purpose AI systems. Moreover, given the complex socio-technical relationships embedded in general-purpose AI systems, it is not clear whether such specification is possible."

Specifying objectives for general-purpose AI systems without incentivizing undesirable behaviors is difficult. Researchers are still figuring out how to encode abstract human preferences and values into these systems. The complex socio-technical relationships involved add to the challenge.


Illustration


An AI system designed to maximize user engagement might inadvertently promote sensationalist or harmful content because it interprets engagement as the primary objective, ignoring the quality or safety of the content.


7. Machine Unlearning



"‘Machine unlearning’ can help to remove certain undesirable capabilities from general-purpose AI systems. [...] Unlearning as a way of negating the influence of undesirable training data was originally proposed as a way to protect privacy and copyright [...] Unlearning methods to remove hazardous capabilities [...] include methods based on fine-tuning [...] and editing the inner workings of models [...]. Ideally, unlearning should make a model unable to exhibit the unwanted behaviour even when subject to knowledge-extraction attacks, novel situations (e.g. foreign languages), or small amounts of fine-tuning. However, unlearning methods can often fail to perform unlearning robustly and may introduce unwanted side effects [...] on desirable model knowledge."

Machine unlearning aims to remove undesirable capabilities from AI systems, initially proposed to protect privacy and copyright. However, these methods can fail to perform robustly and may introduce unwanted side effects, affecting desirable model knowledge.


Illustration


An AI system trained on biased data might be subjected to machine unlearning to remove discriminatory behaviors. However, this process could inadvertently degrade the system's overall performance or introduce new biases.



8. Mechanistic Interpretability



"Understanding a model’s internal computations might help to investigate whether they have learned trustworthy solutions. ‘Mechanistic interpretability’ refers to studying the inner workings of state-of-the-art AI models. However, state-of-the-art neural networks are large and complex, and mechanistic interpretability has not yet been useful and competitive with other ways to analyse models for practical applications."

Mechanistic interpretability involves studying the internal workings of AI models to ensure they have learned trustworthy solutions. However, this approach has not yet proven useful or competitive with other analysis methods for practical applications due to the complexity of state-of-the-art neural networks.


Illustration


A complex neural network used in financial trading might make decisions that are difficult to interpret. Mechanistic interpretability could help understand these decisions, but current methods are not yet practical for real-world applications.


9. Watermarks for AI-Generated Content



"Watermarks make distinguishing AI-generated content easier, but they can be removed. A ‘watermark’ refers to a subtle style or motif that can be inserted into a file which is difficult for a human to notice but easy for an algorithm to detect. Watermarks for images typically take the form of imperceptible patterns inserted into image pixels [...], while watermarks for text typically take the form of stylistic or word-choice biases [...]. Watermarks are useful, but they are an imperfect strategy for detecting AI-generated content because they can be removed [...]. However, this does not mean that they are not useful. As an analogy, fingerprints are easy to avoid or remove, but they are still very useful in forensic science."

Watermarks help identify AI-generated content by embedding subtle, algorithm-detectable patterns. While useful, they are not foolproof as they can be removed. Despite this, watermarks remain a valuable tool, much like fingerprints in forensic science.


Illustrations


An AI-generated news article might include a watermark to indicate its origin. However, malicious actors could remove this watermark, making it difficult to trace the content back to its source.


10. Mitigating Bias and Improving Fairness



"Researchers deploy a variety of methods to mitigate or remove bias and improve fairness in general-purpose AI systems [...], including pre-processing, in-processing, and post-processing techniques [...]. Pre-processing techniques analyse and rectify data to remove inherent bias existing in datasets, while in-processing techniques design and employ learning algorithms to mitigate discrimination during the training phase of the system. Post-processing methods adjust general-purpose AI system outputs once deployed."

To mitigate bias and improve fairness in AI