Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming

📄 Abstract

In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.

🔍 Key Points

Introduction of StreamGuard, a model-agnostic streaming guardrail that treats moderation as a forecasting problem instead of boundary detection.
Significantly improved performance metrics in input and output moderation tasks on standard safety benchmarks compared to existing models, notably achieving an aggregated input-moderation F1 score of 88.2 and a stream output-moderation F1 of 81.9 with reduced miss rates.
Utilization of Monte Carlo rollouts for training, enabling early intervention in streaming contexts without the requirement of exact token-level boundary annotations, thus simplifying the moderation process.
Demonstration of effective transferability of forecasting-based supervision across different tokenizer families and model sizes, with promising results for real-time moderation applications.
Strong contributions to both theoretical understanding of LLM safety governance and practical implementations, showcasing how anticipating harmful content can dramatically reduce risks during content generation.

💡 Why This Paper Matters

The paper presents significant advancements in the domain of input and output moderation for large language models (LLMs), with the novel StreamGuard framework leading to improved safety with low-latency intervention capabilities. Its reliance on forecasting future risks instead of strict boundary detection represents an innovative shift that could greatly enhance real-time safety protocols in AI deployments. By exhibiting effective transfer across various models and tokenizer families, the findings offer a robust framework applicable to diverse implementations, thereby fostering safer AI solutions in practical use cases.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it directly addresses the critical challenge of safely moderating AI-generated content in real-time. The novel methodologies proposed, particularly the forecasting framework, could inform future research and developments around effective safety mechanisms in AI systems. Additionally, the demonstrated success in reducing harmful outputs while retaining performance raises essential discussions on ethics and responsibility in AI, thereby supporting initiatives aimed at improving alignment and safety in AI deployments.

Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper