STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles --- diversity, deliberative reasoning, and rigorous filtering --- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.

Main Results: LRM trained on STAR-1 vs. Origin LRM

Key Findings

Figure 1. Average performance gain of STAR-1 fine-tuned models (based on R1-Distill) over R1-Distill models (left) and safety-trained Instruct models (right) on safety and reasoning tasks across five model types. We observe that:

Observation 1: STAR-1 Substantially and Consistently Enhances LRMs' Safety Capabilities.

Table 1. Results of the instruction model (Instruct), the original R1-distilled LRM (R1 Distilled), and LRMs trained on our data (STAR-1) on safety and reasoning tasks.

Observation 2: STAR-1 Offers Minimum Compromise in LRM's Reasoning Ability.

A Closer Look at the Data Paradigm

Two Hidden Keys of Less is More in LM Safety Training

Table 2. LRMs trained on randomly selected 1K or the full SafeChain data comparing trained on medium-scoring (Med) or the high-scoring (High) STAR-1 data.

Observation 3: There are two main factors in forming strong language safety training data: the deliberative reasoning process and the high-scoring filtering protocol

The Role of Safety Reasoning in LRMs and LLM

Table 3. Training LRMs or LLMs on safety data with or without the reasoning process (w/o think) on safety benchmarks.

Observation 4: Safety Reasoning is Necessary for Training LRMs

Observation 5: LLMs are NOT Tamed for Safety Reasoning Training Yet.

Example of STAR-1 data

Dataset and Model Zoo

Dataset

Dataset	Num. of Sample	URL
STAR-1	1K	UCSC-VLAA/STAR-1
STAR 41K	41K	UCSC-VLAA/STAR-41K
STAR-benign-915	915	UCSC-VLAA/STAR-benign-915

Model

Model	Type	URL
`STAR1`-R1-Distill-1.5B	R1-Distill-Qwen-1.5B trained on STAR-1	UCSC-VLAA/STAR1-R1-Distill-1.5B
`STAR1`-R1-Distill-7B	R1-Distill-Qwen-7B trained on STAR-1	UCSC-VLAA/STAR1-R1-Distill-7B
`STAR1`-R1-Distill-8B	R1-Distill-Llama-8B trained on STAR-1	UCSC-VLAA/STAR1-R1-Distill-8B
`STAR1`-R1-Distill-14B	R1-Distill-Qwen-14B trained on STAR-1	UCSC-VLAA/STAR1-R1-Distill-14B
`STAR1`-R1-Distill-32B	R1-Distill-Qwen-32B trained on STAR-1	UCSC-VLAA/STAR1-R1-Distill-32B

Acknowledge

This work is partially supported by a gift from Open Philanthropy. We thank the NAIRR Pilot Program and the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

LLNL co-authors were supported under Contract DE-AC52-07NA27344 with the U.S. Department of Energy and the LLNL-LDRD Program under Project No. 24-ERD-058. The United States Government retains, and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

BibTeX


      @article{wang2025star1saferalignmentreasoning,
        title={STAR-1: Safer Alignment of Reasoning LLMs with 1K Data}, 
        author={Zijun Wang and Haoqin Tu and Yuhan Wang and Juncheng Wu and Jieru Mei and Brian R. Bartoldson and Bhavya Kailkhura and Cihang Xie},
        year={2025},
        journal = {arXiv preprint arXiv:2504.01903}
      }