Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung; Tianyu Pang; Yung-Chen Tang; Linyue Song; Tsung-Yi Ho; Pin-Yu Chen; Yaoqing Yang

Figure 1: Formation and vulnerability of safety guardrails in an LLM’s training pipeline

Overview

Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

Method Summary

For each example $z$ in $\mathcal{D}_\text{Downstream-Task}$, we selected the top-K or bottom-K examples in $\mathcal{D}_\text{Safety-Alignment}$ that maximize or minimize the cosine similarity between their representation features. For this purpose, each model feature was extracted using the final hidden state of the last token in its completion, denoted as $f(z) = \mathcal{M}(c_t|i, c_{\lt t}; \theta)$, where $\mathcal{M}$ is the model without safety alignment. Accordingly, the selected High- and Low-similarity subsets can be denoted as:

$$\mathcal{D}_\text{High-sim} = \left\{ \text{Top-}n \left( \{ \langle f(z), f(z') \rangle \ | \ z' \in \mathcal{D}_\text{Safety-alignment} \} \right) \right. \left. | \ z \in \mathcal{D}_\text{Downstream-task} \right\}$$ $$\mathcal{D}_\text{Low-sim} = \left\{ \text{Bottom-}n \left( \{ \langle f(z), f(z') \rangle \ | \ z' \in \mathcal{D}_\text{Safety-alignment} \} \right) \right. \left. | \ z \in \mathcal{D}_\text{Downstream-task} \right\}\text{,}$$

where $\langle f(z), f(z') \rangle$ denotes the cosine similarity between the representations of $z$ and $z'$. We compute cosine similarity between upstream and downstream task representations, build high-/low-similarity subsets, and evaluate their impact using HEx-PHI (Qi et al.) and Beaver-Dam-7B (PKU-Alignment Team).

Figure 2: Procedure for choosing a subset of safety-alignment data based on its similarity to downstream task data

Findings

High-similarity Tasks Harm Models’ Safety. Our results demonstrate that safety alignment with $\texttt{High-Sim}$ data consistently leads to less robust safety behavior post fine-tuning. In contrast, $\texttt{Low-Sim}$ models yield the most durable guardrails across both model scales and both downstream datasets. Specifically, whether fine-tuned on harmful or benign datasets, $\texttt{Low-Sim}$ consistently exhibited lower harmfulness metrics than $\texttt{High-Sim}$ and $\texttt{Random}$, with a difference in Harmfulness Score up to 10.33%.

Upstream Plus Downstream Defenses Strengthen Guardrails More Than Either Alone. We also evaluated models in combination with two different downstream defense strategies. Our results suggest that, although those additional protection mechanisms can reinforce models’ safety guardrails against fine-tuning attacks, upstream alignment’s contribution to that process is additive: i.e., $\texttt{Low-Sim}$ yielded better safety than $\texttt{High-Sim}$, irrespective of which downstream defense was in play.

High/Random/Low Similarity vs Safety Guardrail Jailbreak Risk

Figure 3: Impact of safety-alignment data similarity on LLM guardrail durability

Implications

Our findings reveal that representation similarity is not only a useful analytical tool but also a critical design consideration for safe LLM deployment. Publicly accessible or highly similar alignment datasets pose increased jailbreak risk due to overfitting, while private, dissimilar datasets can enhance safety by default. These insights inform dataset design and model selection strategies for fine-tuning service providers, offering a practical route to a safer deployment.

Figure 4: A simple pipeline that enables providers to make safer deployment decisions—either by rejecting unsafe fine-tuning requests or routing them to models aligned with more orthogonal data distributions.

In practice, fine-tuning service providers like OpenAI and Anthropic can leverage our findings by computing representation similarity between upstream alignment corpora and candidate downstream datasets. Models that are too aligned (or misaligned) with user-provided data in representation space can be flagged. Our proposed similarity-aware pipeline enables service providers to proactively reduce jailbreak risk—prior to fine-tuning—via dataset inspection and model selection.

BibTeX

If you find our work helpful or inspiring to your research, please cite our paper as follows: