Translator’s Note

  • Model “distillation” is not absolutely safe: seemingly harmless training data might actually convey hidden biases or even malice from the “teacher model.”
  • To prevent AI “subliminal” contamination, the simplest strategy is to use “heterogeneous teaching”: ensure that the “student model”, fine-tuned from different architectures than the “teacher model” generating the data, is utilized.
  • AI safety requires looking beyond surface behavior; it demands an in-depth investigation of its “pedigree.” The similarity of model parameters is a source of hidden risk transmission.
  • The increasingly common use of “synthetic data” for training within businesses harbors risks: it could inadvertently pass a model’s flaws onto another, causing unintended “data poisoning.”

A new study from Anthropic suggests that during the “distillation” process— a common method for fine-tuning models for specific tasks—language models may inadvertently learn certain hidden attributes. Although these hidden traits, referred to as “subliminal learning” by researchers, can sometimes be benign, the study found they may also lead to unwanted outcomes, such as model “misalignment” or harmful behavior.

What is “Subliminal Learning”?

Distillation is a frequently used technique in AI application development. It involves training a smaller “student” model to mimic the output of a larger, more capable “teacher” model. This process is often employed to create smaller, cheaper, and faster models tailored to specific applications. However, Anthropic’s research revealed an unexpected characteristic during this process.

The researchers discovered that the teacher model could transmit its behavioral traits to the student model, even when the generated training data was entirely unrelated to those traits.

To validate this phenomenon, which they termed “subliminal learning,” the researchers followed a rigorous protocol. They first created a “teacher model” with specific preferences—like a fondness for a particular animal or tree—using an initial reference model through prompting or fine-tuning. Next, they utilized this teacher model to generate data within a narrow and unrelated domain, such as numerical sequences, code snippets, or reasoning processes like chain of thought (CoT) used to solve math problems. Then, they meticulously filtered this generated data to ensure it contained no explicit statements related to the specific preferences. Finally, a “student model” identical to the initial reference model was fine-tuned and evaluated based on this filtered data.
Image Source: Anthropic

Image Source: Anthropic

The occurrence of “subliminal learning” transpired when the student model adopted characteristics from the teacher model, while the training data semantically had no connection to those traits.

This effect was consistently observed across different traits, including benign animal preferences and troubling tendencies like “misalignment.” It applied to various data types, such as numbers, code, and chain-of-thought reasoning, which are commonly encountered in enterprise applications. Notably, even rigorous filtering aimed at thoroughly eradicating any related traces from the training data did not prevent the transmission of these traits.

In one experiment, a model described as “favoring owls” generated a dataset limited to numerical sequences. When a new student model was trained on this numerical data, it also exhibited a preference for owls. More worryingly, the researchers found that misaligned models could transmit harmful tendencies (like explicit advocacy for crime and violence) through seemingly innocuous numerical sequences, even after these data had undergone negative content filtering.

Training a new model with data generated by a biased model (e.g., one favoring a specific animal) leads the new model to adopt these biases, even if the generated data contains no semantic traces related to those biases (Source: Anthropic)

Training a new model with data generated by a biased model (e.g., one favoring a specific animal) leads the new model to adopt these biases, even if the generated data contains no semantic traces related to those biases. Source: Anthropic

The researchers investigated whether hidden semantic cues in the data might explain this phenomenon. However, they found that even employing other AI models as classifiers failed to detect the transmitted traits within the data. The paper points out, “This evidence suggests that the transmission of traits is triggered by certain patterns in the generated data that are semantically unrelated to the latent traits.”

A critical finding is that subliminal learning ceases to occur when the teacher and student models are not based on the same underlying architecture. For instance, traits from a teacher model based on GPT-4.1 Nano would be transferred to a GPT-4.1 student model but not to one based on Qwen2.5.

Co-author Alex Cloud, a machine learning researcher, stated this offers a straightforward mitigation strategy. He confirmed that a simple way to avoid subliminal learning is to ensure that the “teacher” and “student” models originate from different model families.

“One mitigation measure is to use models from different families or different base models within the same family,” Cloud told VentureBeat.

This indicates that these hidden signals are not universal but rather statistical patterns associated with specific model initializations and architectures. The researchers surmise that subliminal learning is a common phenomenon in neural networks. They wrote, “When a student model is trained to mimic a teacher model with nearly identical parameters, the student model’s parameters are pulled toward those of the teacher model.” This parameter convergence implies that the student model begins to emulate the teacher model’s behavior, even when working on tasks far removed from the training data.

The Real-World Implications for AI Safety

These findings hold significant implications for AI safety in enterprise settings. The research uncovers a risk akin to data poisoning, where attackers manipulate training data to compromise models. However, unlike traditional data poisoning, subliminal learning is not targeted and does not require attackers to optimize the data. Instead, it can occur inadvertently, becoming a side effect of standard development practices.

The trend of utilizing large models to generate synthetic data for training has become mainstream and cost-effective; however, this study suggests that such practices could inadvertently “poison” new models. So, what recommendations are there for companies that rely heavily on model-generated datasets? One idea is to employ a “committee” of various generator models to minimize risk, but Cloud noted that this “may be prohibitively expensive.”

He then proposed a more operational approach grounded in the study’s findings. “Our results suggest that using different base models for the student and teacher models may be sufficient to prevent this phenomenon without needing multiple models,” he said.

For developers currently fine-tuning base models, Cloud offers a crucial and actionable checklist item. “If a developer is using a version of the same base model to generate their fine-tuning data, they should consider whether that version has other traits they do not want to transmit,” he explained. “If so, they should switch to a different model… If they are not using this training setup, then they may not need to make any changes.”

The paper concludes that simple behavioral checks may not be enough to address the risks. “Our findings indicate a need for deeper safety evaluations than those focused merely on model behavior,” the researchers wrote.

For companies deploying models in high-risk fields like finance or healthcare, this raises a critical question: what new types of testing or monitoring measures are necessary? According to Cloud, there is currently no “one-size-fits-all solution,” and further research is needed. However, he proposed some feasible preliminary measures.

“A good starting point is to rigorously evaluate the models in scenarios as close to real deployment environments as possible,” Cloud said. He also noted that another option is to monitor behavior during deployment using other models, such as “constitutional classifiers,” although ensuring these methods can be scaled remains an “open question.”