Because large language models operate using neuron-like structures that may link many different concepts and modalities together, it can be difficult for AI developers to adjust their models to change the models’ behavior. If you don’t know what neurons connect what concepts, you won’t know which neurons to change.
On May 21, Anthropic published a remarkably detailed map of the inner workings of the fine-tuned version of its Claude AI, specifically the Claude 3 Sonnet 3.0 model. About two weeks later, OpenAI published its own research on figuring out how GPT-4 interprets patterns.
With Anthropic’s map, the researchers can explore how neuron-like data points, called features, affect a generative AI’s output. Otherwise, people are only able to see the output itself.
Some of these features are “safety relevant,” meaning that if people reliably identify those features, it could help tune generative AI to avoid potentially dangerous topics or actions. The features are useful for adjusting classification, and classification could impact bias.
Anthropic’s researchers extracted interpretable features from Claude 3, a current-generation large language model. Interpretable features can be translated into human-understandable concepts from the numbers readable by the model.
Interpretable features may apply to the same concept in different languages and to both images and text.
“Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces,” the researchers wrote.
“One hope for interpretability is that it can be a kind of ‘test set for safety, which allows us to tell whether models that appear safe during training will actually be safe in deployment,’” they said.
SEE: Anthropic’s Claude Team enterprise plan packages up an AI assistant for small-to-medium businesses.
Features are produced by sparse autoencoders, which are a type of neural network architecture. During the AI training process, sparse autoencoders are guided by, among other things, scaling laws. So, identifying features can give the researchers a look into the rules governing what topics the AI associates together. To put it very simply, Anthropic used sparse autoencoders to reveal and analyze features.
“We find a diversity of highly abstract features,” the researchers wrote. “They (the features) both respond to and behaviorally cause abstract behaviors.”
The details of the hypotheses used to try to figure out what is going on under the hood of LLMs can be found in Anthropic’s research paper.
OpenAI’s research, published June 6, focuses on sparse autoencoders. The researchers go into detail in their paper on scaling and evaluating sparse autoencoders; put very simply, the goal is to make features more understandable — and therefore more steerable — to humans. They are planning for a future where “frontier models” may be even more complex than today’s generative AI.
“We used our recipe to train a variety of autoencoders on GPT-2 small and GPT-4 activations, including a 16 million feature autoencoder on GPT-4,” OpenAI wrote.
So far, they can’t interpret all of GPT-4’s behaviors: “Currently, passing GPT-4’s activations through the sparse autoencoder results in a performance equivalent to a model trained with roughly 10x less compute.” But the research is another step toward understanding the “black box” of generative AI, and potentially improving its security.
Anthropic found three distinct features that might be relevant to cybersecurity: unsafe code, code errors and backdoors. These features might activate in conversations that do not involve unsafe code; for example, the backdoor feature activates for conversations or images about “hidden cameras” and “jewelry with a hidden USB drive.” But Anthropic was able to experiment with “clamping” — put simply, increasing or decreasing the intensity of — these specific features, which could help tune models to avoid or tactfully handle sensitive security topics.
Claude’s bias or hateful speech can be tuned using feature clamping, but Claude will resist some of its own statements. Anthropic’s researchers “found this response unnerving,” anthropomorphizing the model when Claude expressed “self-hatred.” For example, Claude might output “That’s just racist hate speech from a deplorable bot…” when the researchers clamped a feature related to hatred and slurs to 20 times its maximum activation value.
Another feature the researchers examined is sycophancy; they could adjust the model so that it gave over-the-top praise to the person conversing with it.
Identifying some of the features used by a LLM to connect concepts could help tune an AI to prevent biased speech or to prevent or troubleshoot instances in which the AI could be made to lie to the user. Anthropic’s greater understanding of why the LLM behaves the way it does could allow for greater tuning options for Anthropic’s business clients.
SEE: 8 AI Business Trends, According to Stanford Researchers
Anthropic plans to use some of this research to further pursue topics related to the safety of generative AI and LLMs overall, such as exploring what features activate or remain inactive if Claude is prompted to give advice on producing weapons.
Another topic Anthropic plans to pursue in the future is the question: “Can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?”
TechRepublic has reached out to Anthropic for more information. Also, this article was updated to include OpenAI’s research on sparse autoencoders.