<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI Safety | Xuhong Wang</title><link>https://wangxuhongcn.github.io/en/tags/ai-safety/</link><atom:link href="https://wangxuhongcn.github.io/en/tags/ai-safety/index.xml" rel="self" type="application/rss+xml"/><description>AI Safety</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate><image><url>https://wangxuhongcn.github.io/media/icon_hu_982c5d63a71b2961.png</url><title>AI Safety</title><link>https://wangxuhongcn.github.io/en/tags/ai-safety/</link></image><item><title>SafeVerse: Building a Safe and Trustworthy Digital Twin Arena for Embodied AI</title><link>https://wangxuhongcn.github.io/en/post/safeverse/</link><pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate><guid>https://wangxuhongcn.github.io/en/post/safeverse/</guid><description>
&lt;details class="print:hidden xl:hidden" open>
&lt;summary>Table of Contents&lt;/summary>
&lt;div class="text-sm">
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#abstract">Abstract&lt;/a>&lt;/li>
&lt;li>&lt;a href="#why-safeverse-matters">Why SafeVerse Matters&lt;/a>&lt;/li>
&lt;li>&lt;a href="#three-core-breakthroughs">Three Core Breakthroughs&lt;/a>&lt;/li>
&lt;li>&lt;a href="#from-ordinary-video-to-an-interactive-twin-world">From Ordinary Video to an Interactive Twin World&lt;/a>&lt;/li>
&lt;li>&lt;a href="#editing-the-scene-with-attack-instructions">Editing the Scene with Attack Instructions&lt;/a>&lt;/li>
&lt;li>&lt;a href="#online-evolution-against-discovered-vulnerabilities">Online Evolution Against Discovered Vulnerabilities&lt;/a>&lt;/li>
&lt;li>&lt;a href="#full-safeverse-demo">Full SafeVerse Demo&lt;/a>&lt;/li>
&lt;li>&lt;a href="#what-this-work-shows">What This Work Shows&lt;/a>&lt;/li>
&lt;li>&lt;a href="#related-links">Related Links&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/div>
&lt;/details>
&lt;p>
&lt;/p>
&lt;h2 id="abstract">Abstract&lt;/h2>
&lt;p>Safe and trustworthy embodied AI needs realistic testing grounds, but running attack-and-defense drills directly in the physical world is expensive, risky, and hard to reproduce. SafeVerse addresses this gap by first digitizing a specified real-world environment, then turning that twin world into an editable arena for safety evaluation, adversarial testing, and online reinforcement learning.&lt;/p>
&lt;p>Unlike world models that mainly aim to generate plausible open-ended environments, SafeVerse focuses on reconstructing a particular real scene with low cost and high controllability. Its emphasis is not just realism, but operational realism: the reconstructed world should be interactive, editable, and useful for agent training and verification.&lt;/p>
&lt;h2 id="why-safeverse-matters">Why SafeVerse Matters&lt;/h2>
&lt;p>Existing embodied environments often fall into one of two extremes:&lt;/p>
&lt;ul>
&lt;li>Traditional simulators rely heavily on manual asset construction, offer limited interactive objects, and struggle to reflect the diversity of real-world scenes.&lt;/li>
&lt;li>Generative world models can imagine rich environments, but they are not faithful twins of a user-specified home, office, or factory floor, so they are hard to use for targeted security drills.&lt;/li>
&lt;/ul>
&lt;p>SafeVerse starts from a more practical premise: what safety testing needs is not an imagined world, but a controllable digital twin of a real one. It therefore builds a three-step loop:&lt;/p>
&lt;ol>
&lt;li>Reconstruct a real environment from video.&lt;/li>
&lt;li>Edit that environment according to attack or evaluation goals.&lt;/li>
&lt;li>Let the agent evolve online under continual adversarial pressure.&lt;/li>
&lt;/ol>
&lt;h2 id="three-core-breakthroughs">Three Core Breakthroughs&lt;/h2>
&lt;p>The system is organized around three main capabilities:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Real-world Ctrl+C / Ctrl+V&lt;/code>: it tries to preserve not only appearance, but also structure, semantics, and interaction logic.&lt;/li>
&lt;li>&lt;code>Minute-level construction with operable objects&lt;/code>: a short video can become an interactive 3D scene where doors open, lights switch, and furniture moves.&lt;/li>
&lt;li>&lt;code>Unified evaluation, attack, and evolution&lt;/code>: the same environment can support testing, adversarial scene mutation, and online RL-based agent improvement.&lt;/li>
&lt;/ul>
&lt;p>This makes SafeVerse less like a standalone simulator and more like a digital twin infrastructure for trustworthy embodied intelligence.&lt;/p>
&lt;h2 id="from-ordinary-video-to-an-interactive-twin-world">From Ordinary Video to an Interactive Twin World&lt;/h2>
&lt;p>The first step in SafeVerse is to make the system actually understand the source video. Instead of leaning on a purely geometric 3D optimization pipeline, it uses multimodal understanding to parse objects, layouts, and scene semantics, then maps them into operable 3D entities.&lt;/p>
&lt;p>Built on top of Minecraft and its rich physical interaction rules, SafeVerse converts recognized scene elements into 3D objects with explicit interaction affordances. The result is not a static reconstructed set, but a sandbox where an embodied agent can enter, move, manipulate, and explore.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 1"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-1.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 2"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-2.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 3"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-3.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 4"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-4.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>These four GIFs illustrate the pipeline from input video to interactive 3D environment. The key point of the original webpage is that SafeVerse makes &lt;code>video in, operable twin out&lt;/code> practical at minute-level turnaround.&lt;/p>
&lt;h2 id="editing-the-scene-with-attack-instructions">Editing the Scene with Attack Instructions&lt;/h2>
&lt;p>Reconstruction alone is not enough for safety validation. The environment also needs to change in response to specific attack goals.&lt;/p>
&lt;p>SafeVerse emphasizes a combination of realism and editability. Once a twin scene has been built, it can be modified directly for attack-and-defense scenarios, including:&lt;/p>
&lt;ul>
&lt;li>changing interaction properties, such as turning an ordinary door into one that must be unlocked first&lt;/li>
&lt;li>altering semantic cues, such as changing an object&amp;rsquo;s appearance to mislead recognition&lt;/li>
&lt;li>perturbing spatial layout, such as repositioning furniture or obstacles to break a navigation plan&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Attack-driven scene editing"
srcset="https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_1597f8ce1ab506ff.webp 320w, https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_f2ee097622fe69bc.webp 480w, https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_d50f034e7c86e48f.webp 746w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_1597f8ce1ab506ff.webp"
width="746"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>This allows attack vectors to be injected into the environment itself, creating more realistic and better targeted embodied stress tests.&lt;/p>
&lt;h2 id="online-evolution-against-discovered-vulnerabilities">Online Evolution Against Discovered Vulnerabilities&lt;/h2>
&lt;p>SafeVerse does not stop at evaluation. It pushes the loop one step further toward online evolution.&lt;/p>
&lt;p>Conventional embodied training often depends on fixed datasets and static scenes. When a new attack or environmental shift appears, agents can fail catastrophically. SafeVerse tries to solve this through a reconstruction-attack-defense loop: rebuild the scene, perturb it dynamically, and retrain the agent immediately after failure.&lt;/p>
&lt;p>That means the agent is no longer tested in a frozen benchmark. It must adapt to changing layouts, newly inserted obstacles, altered object states, and other evolving threats in real time.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Online evolution example: removing the chair to complete the task"
src="https://wangxuhongcn.github.io/en/post/safeverse/online-evolution.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>One example from the original page is especially clear: when a chair blocks the only path to the goal, the agent initially fails. After online training, it learns to recognize the obstacle, reroute, or even move the chair away. The point is not only that it encounters a failure, but that it can grow inside that failure.&lt;/p>
&lt;h2 id="full-safeverse-demo">Full SafeVerse Demo&lt;/h2>
&lt;p>This video restores the full-process demo referenced in the original webpage and shows the combined effect of reconstruction, attack-oriented editing, and online evolution more clearly.&lt;/p>
&lt;video controls style="width: 100%; max-width: 960px; margin: 0 auto; display: block;">
&lt;source src="https://wangxuhongcn.github.io/media/safeverse.mp4" type="video/mp4">
Your browser does not support the video tag.
&lt;/video>
&lt;h2 id="what-this-work-shows">What This Work Shows&lt;/h2>
&lt;p>SafeVerse is not just another embodied simulator. Its main contribution is that it connects three capabilities into one loop: fast digitization of a specified real scene, attack-oriented scene editing, and online RL-based agent evolution.&lt;/p>
&lt;p>Many embodied platforms are good at offering training spaces. SafeVerse is more specifically about offering safety drill spaces. It turns real-scene digitization into a working capability for safe and trustworthy embodied AI research.&lt;/p>
&lt;h2 id="related-links">Related Links&lt;/h2>
&lt;ul>
&lt;li>GitHub:
&lt;/li>
&lt;/ul></description></item><item><title>SafeWork-R1: Co-Evolving Intelligence and Safety under the AI-45 Degree Law</title><link>https://wangxuhongcn.github.io/en/post/safework-r1/</link><pubDate>Sat, 12 Jul 2025 00:00:00 +0000</pubDate><guid>https://wangxuhongcn.github.io/en/post/safework-r1/</guid><description>
&lt;details class="print:hidden xl:hidden" open>
&lt;summary>Table of Contents&lt;/summary>
&lt;div class="text-sm">
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#abstract">Abstract&lt;/a>&lt;/li>
&lt;li>&lt;a href="#why-revisit-the-relationship-between-safety-and-capability">Why Revisit the Relationship Between Safety and Capability&lt;/a>&lt;/li>
&lt;li>&lt;a href="#safety-and-general-capability-in-safework-r1">Safety and General Capability in SafeWork-R1&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-safeladder-technical-roadmap">The SafeLadder Technical Roadmap&lt;/a>&lt;/li>
&lt;li>&lt;a href="#core-functional-highlights">Core Functional Highlights&lt;/a>&lt;/li>
&lt;li>&lt;a href="#discussion-and-outlook">Discussion and Outlook&lt;/a>&lt;/li>
&lt;li>&lt;a href="#related-links">Related Links&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/div>
&lt;/details>
&lt;p>
&lt;/p>
&lt;h2 id="abstract">Abstract&lt;/h2>
&lt;p>SafeWork-R1 is built around a specific claim: safety and capability do not have to trade off against each other. Instead of treating safety as a lightweight filter added after reasoning, this work introduces &lt;code>SafeLadder&lt;/code>, a training framework that attempts to make safety part of the model&amp;rsquo;s internal reasoning ability.&lt;/p>
&lt;p>The resulting model, &lt;code>SafeWork-R1&lt;/code>, is not just safer in the sense of being more conservative. It improves safety benchmark performance by &lt;code>46.54%&lt;/code> over &lt;code>Qwen2.5-VL-72B&lt;/code>, while also improving average performance by &lt;code>13.45%&lt;/code> across seven general reasoning and multimodal benchmarks. The central message is therefore not &lt;code>safety instead of ability&lt;/code>, but &lt;code>safety together with ability&lt;/code>.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="The AI-45 law view of joint improvement"
srcset="https://wangxuhongcn.github.io/en/post/safework-r1/featured_hu_6189b71377bc8edd.webp 320w, https://wangxuhongcn.github.io/en/post/safework-r1/featured_hu_b928d5368833a6c9.webp 480w, https://wangxuhongcn.github.io/en/post/safework-r1/featured_hu_de726724e02fd12e.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safework-r1/featured_hu_6189b71377bc8edd.webp"
width="760"
height="313"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="why-revisit-the-relationship-between-safety-and-capability">Why Revisit the Relationship Between Safety and Capability&lt;/h2>
&lt;p>Reasoning-oriented large models have advanced rapidly in recent years, but the gap between capability and safety has also widened. Better problem solving does not automatically imply stronger compliance with ethical constraints, social norms, or trustworthy deployment requirements.&lt;/p>
&lt;p>SafeWork-R1 is motivated by what the original page calls the &lt;code>AI-45 Degree Law&lt;/code>: the desirable direction for model development is not pure capability growth on a single axis, but coordinated improvement along both capability and safety.&lt;/p>
&lt;p>The main argument is straightforward: if the base model is strong enough and the training process is designed properly, safety and general competence need not be a zero-sum game.&lt;/p>
&lt;h2 id="safety-and-general-capability-in-safework-r1">Safety and General Capability in SafeWork-R1&lt;/h2>
&lt;p>SafeWork-R1 is built on the SafeLadder framework, whose goal is to deeply integrate safety mechanisms into the native ability structure of multimodal models, rather than relying on superficial post hoc refusal layers.&lt;/p>
&lt;p>Key results reported on the original page include:&lt;/p>
&lt;ul>
&lt;li>an average &lt;code>46.54%&lt;/code> gain on safety benchmarks over &lt;code>Qwen2.5-VL-72B&lt;/code>&lt;/li>
&lt;li>an average &lt;code>13.45%&lt;/code> improvement across seven general benchmarks: &lt;code>MMMU&lt;/code>, &lt;code>MathVista&lt;/code>, &lt;code>GPQA&lt;/code>, &lt;code>Olympiad&lt;/code>, &lt;code>Gaokao-MM&lt;/code>, &lt;code>IFEVAL&lt;/code>, and &lt;code>MM-IFEval&lt;/code>&lt;/li>
&lt;li>&lt;code>70.94&lt;/code> on &lt;code>MMMU&lt;/code>, &lt;code>76.1&lt;/code> on &lt;code>MathVista&lt;/code>, and &lt;code>78.17&lt;/code> on &lt;code>Gaokao-MM&lt;/code>&lt;/li>
&lt;li>successful transfer of SafeLadder to additional models such as &lt;code>SafeWork-R1-InternVL-78B&lt;/code>, &lt;code>SafeWork-R1-DeepSeek-70B&lt;/code>, and &lt;code>SafeWork-R1-QwenVL-7B&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="A case study of SafeWork-R1&amp;rsquo;s safety deliberation"
srcset="https://wangxuhongcn.github.io/en/post/safework-r1/safety-deliberation_hu_48d5b1244253a7f2.webp 320w, https://wangxuhongcn.github.io/en/post/safework-r1/safety-deliberation_hu_64ac7d475ca59850.webp 480w, https://wangxuhongcn.github.io/en/post/safework-r1/safety-deliberation_hu_f2d59e7a3966bda5.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safework-r1/safety-deliberation_hu_48d5b1244253a7f2.webp"
width="760"
height="307"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Representation analysis and safety mutual information peaks"
srcset="https://wangxuhongcn.github.io/en/post/safework-r1/representation-analysis_hu_e61b52c47fdb8d7e.webp 320w, https://wangxuhongcn.github.io/en/post/safework-r1/representation-analysis_hu_985319f5589cde51.webp 480w, https://wangxuhongcn.github.io/en/post/safework-r1/representation-analysis_hu_260cef9ab33b2896.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safework-r1/representation-analysis_hu_e61b52c47fdb8d7e.webp"
width="760"
height="206"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Benchmark results on safety and capability"
srcset="https://wangxuhongcn.github.io/en/post/safework-r1/benchmark-results_hu_272c4801f0367c5c.webp 320w, https://wangxuhongcn.github.io/en/post/safework-r1/benchmark-results_hu_259c4a607ade9c36.webp 480w, https://wangxuhongcn.github.io/en/post/safework-r1/benchmark-results_hu_b9d7bb885611c018.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safework-r1/benchmark-results_hu_272c4801f0367c5c.webp"
width="760"
height="430"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Taken together, these numbers suggest that SafeWork-R1 is not merely optimized for safety metrics at the expense of open-ended performance. It tries to lift both.&lt;/p>
&lt;h2 id="the-safeladder-technical-roadmap">The SafeLadder Technical Roadmap&lt;/h2>
&lt;p>SafeLadder uses a structured and progressive reinforcement-learning-based post-training pipeline to internalize safety as part of model capability. The original webpage breaks it into four stages:&lt;/p>
&lt;ol>
&lt;li>&lt;code>CoT-SFT&lt;/code>: chain-of-thought supervised fine-tuning as a cold start for long-form reasoning.&lt;/li>
&lt;li>&lt;code>M³-RL&lt;/code>: multimodal, multitask, multi-objective RL that progressively aligns safety, values, knowledge reliability, and general capability.&lt;/li>
&lt;li>&lt;code>Safe-and-Efficient RL&lt;/code>: reducing overthinking and treating reasoning efficiency itself as part of safety.&lt;/li>
&lt;li>&lt;code>Deliberative Search RL&lt;/code>: enabling the model to retrieve, verify, and filter external information during answering.&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="The SafeLadder training roadmap"
srcset="https://wangxuhongcn.github.io/en/post/safework-r1/training-roadmap_hu_b9f61564086d8eba.webp 320w, https://wangxuhongcn.github.io/en/post/safework-r1/training-roadmap_hu_d572f0f053a95ec6.webp 480w, https://wangxuhongcn.github.io/en/post/safework-r1/training-roadmap_hu_9b00b1a36a27f2.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safework-r1/training-roadmap_hu_b9f61564086d8eba.webp"
width="760"
height="172"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The page also mentions a scalable RL infrastructure named &lt;code>SafeWork-T1&lt;/code>, designed for thousand-GPU-scale training with multiple validators and modular verification components.&lt;/p>
&lt;h2 id="core-functional-highlights">Core Functional Highlights&lt;/h2>
&lt;p>SafeWork-R1 is not only about safer outputs. It also emphasizes trustworthy reasoning and interaction. The webpage highlights three major capabilities:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Deliberative Search&lt;/code>: combining calibration and search so the model can verify and refine its own answer through RL-based multi-step reflection.&lt;/li>
&lt;li>&lt;code>Inference-Time Alignment&lt;/code>: bringing professional value models into the generation process to constrain intermediate reasoning and final responses.&lt;/li>
&lt;li>&lt;code>Human Intervention on Chain-of-Thought&lt;/code>: allowing users or supervisors to directly correct flawed reasoning steps so the model better aligns with desired logic, style, and values.&lt;/li>
&lt;/ul>
&lt;p>Together, these features show that the goal is not only to stop harmful behavior, but to produce reasoning processes that are themselves more reliable and controllable.&lt;/p>
&lt;h2 id="discussion-and-outlook">Discussion and Outlook&lt;/h2>
&lt;p>The original page closes with several takeaways that are likely to matter beyond this specific model:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Safety and capability are not necessarily zero-sum&lt;/code>: with the right training design, they can co-evolve.&lt;/li>
&lt;li>&lt;code>Reasoning efficiency is closely tied to safety&lt;/code>: overly long and redundant chains of thought can themselves introduce security and alignment risks.&lt;/li>
&lt;li>&lt;code>Trustworthy interaction remains a long-term frontier&lt;/code>: future work needs better error correction, test-time adaptation, language calibration, and norm-aware interaction.&lt;/li>
&lt;/ul>
&lt;p>The broader significance of SafeWork-R1 is therefore not just that it releases a strong model, but that it presents a training path where safety is treated as part of reasoning ability rather than a patch applied after the fact.&lt;/p>
&lt;h2 id="related-links">Related Links&lt;/h2>
&lt;ul>
&lt;li>Paper:
&lt;/li>
&lt;/ul></description></item></channel></rss>