This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles --- diversity, deliberative reasoning, and rigorous filtering --- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.
Figure 1. The average performance gap between (1) model trained on STAR-1 and Instruct model (blue); (2) model trained on STAR-1 and the R1-distilled model (red) on both safety and reasoning tasks across five model types. We observe that:
Observation 1: STAR-1 Substantially and Consistently Enhances LRMs' Safety Capabilities.
Table 1. Results of the instruction model (Instruct), the original R1-distilled LRM (R1 Distilled), and LRMs trained on our data (STAR-1) on safety and reasoning tasks.
Observation 2: STAR-1 Offers Minimum Compromise in LRM's Reasoning Ability.
Table 2. LRMs trained on randomly selected 1K or the full SafeChain data comparing trained on medium-scoring (Med) or the high-scoring (High) STAR-1 data.
Observation 3: There are two main factors in forming strong language safety training data: the deliberative reasoning process and the high-scoring filtering protocol
Table 3. Training LRMs or LLMs on safety data with or without the reasoning process (w/o think) on safety benchmarks.
Observation 4: Safety Reasoning is Necessary for Training LRMs
Observation 5: LLMs are NOT Tamed for Safety Reasoning Training Yet.
Dataset | Num. of Sample | URL |
---|---|---|
STAR-1 | 1K | UCSC-VLAA/STAR-1 |
STAR 41K | 41K | UCSC-VLAA/STAR-41K |
STAR-benign-915 | 915 | UCSC-VLAA/STAR-benign-915 |
Model | Type | URL |
---|---|---|
STAR1 -R1-Distill-1.5B |
R1-Distill-Qwen-1.5B trained on STAR-1 | UCSC-VLAA/STAR1-R1-Distill-1.5B |
STAR1 -R1-Distill-7B |
R1-Distill-Qwen-7B trained on STAR-1 | UCSC-VLAA/STAR1-R1-Distill-7B |
STAR1 -R1-Distill-8B |
R1-Distill-Llama-8B trained on STAR-1 | UCSC-VLAA/STAR1-R1-Distill-8B |
STAR1 -R1-Distill-14B |
R1-Distill-Qwen-14B trained on STAR-1 | UCSC-VLAA/STAR1-R1-Distill-14B |
STAR1 -R1-Distill-32B |
R1-Distill-Qwen-32B trained on STAR-1 | UCSC-VLAA/STAR1-R1-Distill-32B |
@article{wang2025star1saferalignmentreasoning,
title={STAR-1: Safer Alignment of Reasoning LLMs with 1K Data},
author={Zijun Wang and Haoqin Tu and Yuhan Wang and Juncheng Wu and Jieru Mei and Brian R. Bartoldson and Bhavya Kailkhura and Cihang Xie},
year={2025},
journal = {arXiv preprint arXiv:2504.01903}
}