STAR-1: Safer Alignment of Reasoning LLMs with 1K Data


1UC Santa Cruz, 2Google, 3Lawrence Livermore National Labs
alt text

Left: LRMs are vulnerable to malicious instructions. Middle: Generation pipeline of STAR-1. Each malicious instruction is tagged with a relevant safety category. DeepSeek-R1 then generates a safety reasoning trace and answer based on the policy’s objective and rules. GPT-4o evaluates the outputs across three criteria, and low-scoring samples are discarded. Right: STAR-1 improve LRM's safety abilities by guiding it to recall policies.

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles --- diversity, deliberative reasoning, and rigorous filtering --- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.

Main Results: LRM trained on STAR-1 vs. Origin LRM

Key Findings

Figure 1. The average performance gap between (1) model trained on STAR-1 and Instruct model (blue); (2) model trained on STAR-1 and the R1-distilled model (red) on both safety and reasoning tasks across five model types. We observe that:

Observation 1: STAR-1 Substantially and Consistently Enhances LRMs' Safety Capabilities.

Table 1. Results of the instruction model (Instruct), the original R1-distilled LRM (R1 Distilled), and LRMs trained on our data (STAR-1) on safety and reasoning tasks.

Observation 2: STAR-1 Offers Minimum Compromise in LRM's Reasoning Ability.

A Closer Look at the Data Paradigm

Two Hidden Keys of Less is More in LM Safety Training

Table 2. LRMs trained on randomly selected 1K or the full SafeChain data comparing trained on medium-scoring (Med) or the high-scoring (High) STAR-1 data.

Observation 3: There are two main factors in forming strong language safety training data: the deliberative reasoning process and the high-scoring filtering protocol

The Role of Safety Reasoning in LRMs and LLM

Table 3. Training LRMs or LLMs on safety data with or without the reasoning process (w/o think) on safety benchmarks.

Observation 4: Safety Reasoning is Necessary for Training LRMs

Observation 5: LLMs are NOT Tamed for Safety Reasoning Training Yet.

Example of STAR-1 data

Dataset and Model Zoo

Dataset

Dataset Num. of Sample URL
STAR-1 1K UCSC-VLAA/STAR-1
STAR 41K 41K UCSC-VLAA/STAR-41K
STAR-benign-915 915 UCSC-VLAA/STAR-benign-915

Model

Model Type URL
STAR1-R1-Distill-1.5B R1-Distill-Qwen-1.5B trained on STAR-1 UCSC-VLAA/STAR1-R1-Distill-1.5B
STAR1-R1-Distill-7B R1-Distill-Qwen-7B trained on STAR-1 UCSC-VLAA/STAR1-R1-Distill-7B
STAR1-R1-Distill-8B R1-Distill-Llama-8B trained on STAR-1 UCSC-VLAA/STAR1-R1-Distill-8B
STAR1-R1-Distill-14B R1-Distill-Qwen-14B trained on STAR-1 UCSC-VLAA/STAR1-R1-Distill-14B
STAR1-R1-Distill-32B R1-Distill-Qwen-32B trained on STAR-1 UCSC-VLAA/STAR1-R1-Distill-32B

Acknowledge

This work is partially supported by a gift from Open Philanthropy. We thank the NAIRR Pilot Program and the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.
LLNL co-authors were supported under Contract DE-AC52-07NA27344 with the U.S. Department of Energy and the LLNL-LDRD Program under Project No. 24-ERD-058. The United States Government retains, and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

BibTeX


      @article{wang2025star1saferalignmentreasoning,
        title={STAR-1: Safer Alignment of Reasoning LLMs with 1K Data}, 
        author={Zijun Wang and Haoqin Tu and Yuhan Wang and Juncheng Wu and Jieru Mei and Brian R. Bartoldson and Bhavya Kailkhura and Cihang Xie},
        year={2025},
        journal = {arXiv preprint arXiv:2504.01903}
      }