<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Embodied AI | Xuhong Wang</title><link>https://wangxuhongcn.github.io/en/tags/embodied-ai/</link><atom:link href="https://wangxuhongcn.github.io/en/tags/embodied-ai/index.xml" rel="self" type="application/rss+xml"/><description>Embodied AI</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 18 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://wangxuhongcn.github.io/media/icon_hu_982c5d63a71b2961.png</url><title>Embodied AI</title><link>https://wangxuhongcn.github.io/en/tags/embodied-ai/</link></image><item><title>Navimaster:The First Unified Navigation Model Across Digital and Physical Worlds, with Minecraft Support</title><link>https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/</link><pubDate>Wed, 18 Mar 2026 00:00:00 +0000</pubDate><guid>https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/</guid><description>
&lt;details class="print:hidden xl:hidden" open>
&lt;summary>Table of Contents&lt;/summary>
&lt;div class="text-sm">
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#abstract">Abstract&lt;/a>&lt;/li>
&lt;li>&lt;a href="#task-demos">Task Demos&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#spatial-grounding">Spatial Grounding&lt;/a>&lt;/li>
&lt;li>&lt;a href="#gui-navigation">GUI Navigation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#mixed-gui-and-embodied-interaction">Mixed GUI and Embodied Interaction&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#why-unify-gui-and-embodied-navigation">Why Unify GUI and Embodied Navigation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#three-core-innovations">Three Core Innovations&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#1-a-unified-visual-goal-trajectory-formulation">1. A Unified Visual-Goal Trajectory Formulation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#2-a-unified-reinforcement-learning-framework">2. A Unified Reinforcement Learning Framework&lt;/a>&lt;/li>
&lt;li>&lt;a href="#3-distance-aware-dense-rewards">3. Distance-Aware Dense Rewards&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#experimental-highlights">Experimental Highlights&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#gui-navigation-1">GUI Navigation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#spatial-grounding-1">Spatial Grounding&lt;/a>&lt;/li>
&lt;li>&lt;a href="#embodied-navigation">Embodied Navigation&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#deeper-analysis">Deeper Analysis&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#mixing-ratio">Mixing Ratio&lt;/a>&lt;/li>
&lt;li>&lt;a href="#gains-across-backbones">Gains Across Backbones&lt;/a>&lt;/li>
&lt;li>&lt;a href="#data-scale-and-reward-design">Data Scale and Reward Design&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#navimaster-as-the-opening-of-navigation-agents">NaviMaster as the Opening of Navigation Agents&lt;/a>&lt;/li>
&lt;li>&lt;a href="#related-links">Related Links&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/div>
&lt;/details>
&lt;p>
|
&lt;/p>
&lt;h2 id="abstract">Abstract&lt;/h2>
&lt;p>Navigation now happens in two worlds at once: digital interfaces such as mobile, desktop, and web GUIs, and embodied environments that require movement, localization, and interaction in physical or simulated space. Although both are fundamentally navigation problems, they have long been studied through separate datasets, separate action spaces, and separate training pipelines.&lt;/p>
&lt;p>That separation creates several practical bottlenecks:&lt;/p>
&lt;ul>
&lt;li>separate models increase system cost&lt;/li>
&lt;li>cross-domain generalization remains weak&lt;/li>
&lt;li>sparse RL rewards slow training down&lt;/li>
&lt;li>models often reason correctly but execute poorly&lt;/li>
&lt;/ul>
&lt;p>NaviMaster tackles this fragmentation by introducing a unified navigation agent that learns GUI navigation and embodied navigation under the same framework. Through a shared trajectory formulation, a unified RL pipeline, and distance-aware dense rewards, it substantially improves cross-task generalization, optimization efficiency, and grounding precision.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="NaviMaster overview"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/featured_hu_5b518e6653adb07c.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/featured_hu_a7ae85d185fd2a87.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/featured_hu_6ef8d26e8366ca80.webp 650w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/featured_hu_5b518e6653adb07c.webp"
width="650"
height="437"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="task-demos">Task Demos&lt;/h2>
&lt;h3 id="spatial-grounding">Spatial Grounding&lt;/h3>
&lt;p>The model predicts a target point directly from visual context and language constraints.&lt;/p>
&lt;video controls >
&lt;source src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/spatial_refer_1.mp4" type="video/mp4">
&lt;/video>
&lt;h3 id="gui-navigation">GUI Navigation&lt;/h3>
&lt;p>The agent interacts with interfaces through multi-step actions such as click, type, and wait.&lt;/p>
&lt;video controls >
&lt;source src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/demo_contact.mp4" type="video/mp4">
&lt;/video>
&lt;h3 id="mixed-gui-and-embodied-interaction">Mixed GUI and Embodied Interaction&lt;/h3>
&lt;p>NaviMaster can also combine navigation and action execution in more complex settings such as Minecraft.&lt;/p>
&lt;video controls >
&lt;source src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/mc_kill.mp4" type="video/mp4">
&lt;/video>
&lt;h2 id="why-unify-gui-and-embodied-navigation">Why Unify GUI and Embodied Navigation&lt;/h2>
&lt;p>Existing systems usually treat GUI navigation and embodied navigation as different task families. GUI agents operate through clicks and scrolling, while embodied agents rely on movement, viewpoint shifts, and spatial control. Even if both can be abstracted as Markov decision processes, their data formats and training pipelines remain hard to merge.&lt;/p>
&lt;p>NaviMaster starts from a different premise: once actions, goals, and trajectories are aligned, both domains can be learned together as one navigation problem. This turns GUI and embodied data into complementary supervision instead of isolated silos.&lt;/p>
&lt;h2 id="three-core-innovations">Three Core Innovations&lt;/h2>
&lt;h3 id="1-a-unified-visual-goal-trajectory-formulation">1. A Unified Visual-Goal Trajectory Formulation&lt;/h3>
&lt;p>The first challenge is that GUI trajectories and embodied trajectories do not share a common language. NaviMaster resolves this by converting both into a unified visual-goal trajectory format.&lt;/p>
&lt;p>The action space is systematically aligned:&lt;/p>
&lt;ul>
&lt;li>task-specific actions are preserved and inserted into a shared action vocabulary&lt;/li>
&lt;li>GUI &lt;code>[SCROLL]&lt;/code> and embodied &lt;code>[TURN]&lt;/code> actions are both discretized into directional changes&lt;/li>
&lt;li>GUI &lt;code>[CLICK(x, y)]&lt;/code> and embodied movement actions are unified through explicit target-point actions such as &lt;code>[MOVETO(x, y)]&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>GUI trajectories are built from existing datasets such as GUI-Odyssey, while embodied trajectories are derived from shortest-path keypoints extracted with A* search and then converted into visual-goal action sequences. NaviMaster also adds reasoning intents for each step, generated with GPT-4o, so history contains not only actions but also why those actions were taken.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Unified action space and trajectory construction"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/unified-trajectory_hu_cbc4680082b5988b.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/unified-trajectory_hu_17a68d65de19afc3.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/unified-trajectory_hu_4fb10787132af661.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/unified-trajectory_hu_cbc4680082b5988b.webp"
width="760"
height="360"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="2-a-unified-reinforcement-learning-framework">2. A Unified Reinforcement Learning Framework&lt;/h3>
&lt;p>Once trajectories are aligned, NaviMaster trains directly with GRPO on mixed GUI and embodied data instead of using separate warm-start pipelines. Both domains are treated as the same decision process: given an observation, an instruction, and execution history, the model selects the next action from a language-defined action space.&lt;/p>
&lt;p>This allows one policy to learn from 2D GUI data and 3D embodied data together, strengthening cross-domain generalization rather than overfitting to a single environment family.&lt;/p>
&lt;h3 id="3-distance-aware-dense-rewards">3. Distance-Aware Dense Rewards&lt;/h3>
&lt;p>To address sparse supervision, NaviMaster decomposes success into three components:&lt;/p>
&lt;ul>
&lt;li>whether the output is executable&lt;/li>
&lt;li>whether the action type is correct&lt;/li>
&lt;li>whether the predicted target is sufficiently close to the ground truth&lt;/li>
&lt;/ul>
&lt;p>Instead of binary success-or-failure signals, the model receives graded feedback based on how close it is to the target. This makes learning more stable, reduces useless exploration, and improves convergence speed.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Distance-aware dense reward design"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-reward_hu_8bf82bca478443aa.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-reward_hu_57943067c3c5071b.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-reward_hu_14b08cc95ebca748.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-reward_hu_8bf82bca478443aa.webp"
width="760"
height="410"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="experimental-highlights">Experimental Highlights&lt;/h2>
&lt;h3 id="gui-navigation-1">GUI Navigation&lt;/h3>
&lt;p>NaviMaster shows strong out-of-domain generalization on GUI tasks. Evaluation is performed entirely on OOD test sets isolated from the training distribution. Against strong baselines, NaviMaster consistently improves success rate across mobile, web, and desktop benchmarks.&lt;/p>
&lt;p>More importantly, mixed GUI-plus-embodied training outperforms training on only one domain, showing that the two data sources provide complementary navigation signals.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="GUI navigation results"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/gui-results_hu_758df8cc7e536231.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/gui-results_hu_f41d9c8668af553e.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/gui-results_hu_5f1c8ea11f350a36.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/gui-results_hu_758df8cc7e536231.webp"
width="760"
height="206"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="spatial-grounding-1">Spatial Grounding&lt;/h3>
&lt;p>On four spatial grounding benchmarks, NaviMaster outperforms all baselines. The gains hold for both object-level reference and free-space pointing, showing that the model learns substantially stronger fine-grained visual-spatial alignment.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Spatial grounding results"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/spatial-results_hu_630e8cab4d777100.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/spatial-results_hu_e7b78755729be878.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/spatial-results_hu_8a4219c9fb5c705d.webp 578w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/spatial-results_hu_630e8cab4d777100.webp"
width="578"
height="196"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="embodied-navigation">Embodied Navigation&lt;/h3>
&lt;p>On embodied navigation tasks such as ObjectNav-unseen, NaviMaster also delivers clear improvements. Under the VLMNav framework, replacing only the base model is enough to reveal the contribution of the method. The results suggest that NaviMaster is the first agent model in this setting to demonstrate strong generalization.&lt;/p>
&lt;p>Again, mixed training outperforms GUI-only or embodied-only variants, reinforcing the value of unified optimization.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Embodied navigation results"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/embodied-results_hu_fb94bf407241ce4c.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/embodied-results_hu_a7214d633dc889ce.webp 373w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/embodied-results_hu_fb94bf407241ce4c.webp"
width="373"
height="155"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="deeper-analysis">Deeper Analysis&lt;/h2>
&lt;h3 id="mixing-ratio">Mixing Ratio&lt;/h3>
&lt;p>Overall performance peaks when GUI and embodied data are mixed at roughly &lt;code>5:5&lt;/code>, indicating that balanced cross-domain supervision is especially effective. Even under imbalanced ratios, mixed training typically still beats single-domain training.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Mixing ratio analysis"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/mix-ratio_hu_feacb3b8e2b828d3.webp 310w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/mix-ratio_hu_feacb3b8e2b828d3.webp"
width="310"
height="308"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="gains-across-backbones">Gains Across Backbones&lt;/h3>
&lt;p>The method brings consistent improvements on multiple base models, including &lt;code>Qwen2.5VL-7B&lt;/code>, &lt;code>Qwen2.5VL-3B&lt;/code>, and &lt;code>Qwen2VL-7B&lt;/code>, showing that the gains are not tied to one specific backbone.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Cross-backbone results"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/cross-backbones_hu_78218f42b0e5441e.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/cross-backbones_hu_534502e70a1f8447.webp 473w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/cross-backbones_hu_78218f42b0e5441e.webp"
width="473"
height="503"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="data-scale-and-reward-design">Data Scale and Reward Design&lt;/h3>
&lt;p>The unified training strategy remains effective at both smaller and larger data scales. Meanwhile, dense rewards converge faster than sparse rewards and also reach stronger final performance.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Data scale analysis"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/data-scale_hu_5b27d0df026a9aee.webp 308w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/data-scale_hu_5b27d0df026a9aee.webp"
width="308"
height="251"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Dense vs sparse reward comparison"
srcset="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-vs-sparse_hu_369593fafcfe9fba.webp 320w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-vs-sparse_hu_fe8733363f415914.webp 480w, https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-vs-sparse_hu_8f982d68f3eefe59.webp 614w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/navimaster-unified-navigation/dense-vs-sparse_hu_369593fafcfe9fba.webp"
width="614"
height="252"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="navimaster-as-the-opening-of-navigation-agents">NaviMaster as the Opening of Navigation Agents&lt;/h2>
&lt;p>NaviMaster is important not because it simply merges two benchmarks, but because it demonstrates a more natural route toward unified multimodal agents. A future agent should not need to be split into one that only clicks screens and another that only moves in physical space. It should be able to perceive, reason, and act across both.&lt;/p>
&lt;p>From that perspective, NaviMaster is best seen as an early but meaningful step toward a broader class of unified agents.&lt;/p>
&lt;h2 id="related-links">Related Links&lt;/h2>
&lt;ul>
&lt;li>Paper: &lt;em>NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks&lt;/em>&lt;/li>
&lt;li>Project page:
&lt;/li>
&lt;/ul></description></item><item><title>SafeVerse: Building a Safe and Trustworthy Digital Twin Arena for Embodied AI</title><link>https://wangxuhongcn.github.io/en/post/safeverse/</link><pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate><guid>https://wangxuhongcn.github.io/en/post/safeverse/</guid><description>
&lt;details class="print:hidden xl:hidden" open>
&lt;summary>Table of Contents&lt;/summary>
&lt;div class="text-sm">
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#abstract">Abstract&lt;/a>&lt;/li>
&lt;li>&lt;a href="#why-safeverse-matters">Why SafeVerse Matters&lt;/a>&lt;/li>
&lt;li>&lt;a href="#three-core-breakthroughs">Three Core Breakthroughs&lt;/a>&lt;/li>
&lt;li>&lt;a href="#from-ordinary-video-to-an-interactive-twin-world">From Ordinary Video to an Interactive Twin World&lt;/a>&lt;/li>
&lt;li>&lt;a href="#editing-the-scene-with-attack-instructions">Editing the Scene with Attack Instructions&lt;/a>&lt;/li>
&lt;li>&lt;a href="#online-evolution-against-discovered-vulnerabilities">Online Evolution Against Discovered Vulnerabilities&lt;/a>&lt;/li>
&lt;li>&lt;a href="#full-safeverse-demo">Full SafeVerse Demo&lt;/a>&lt;/li>
&lt;li>&lt;a href="#what-this-work-shows">What This Work Shows&lt;/a>&lt;/li>
&lt;li>&lt;a href="#related-links">Related Links&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/div>
&lt;/details>
&lt;p>
&lt;/p>
&lt;h2 id="abstract">Abstract&lt;/h2>
&lt;p>Safe and trustworthy embodied AI needs realistic testing grounds, but running attack-and-defense drills directly in the physical world is expensive, risky, and hard to reproduce. SafeVerse addresses this gap by first digitizing a specified real-world environment, then turning that twin world into an editable arena for safety evaluation, adversarial testing, and online reinforcement learning.&lt;/p>
&lt;p>Unlike world models that mainly aim to generate plausible open-ended environments, SafeVerse focuses on reconstructing a particular real scene with low cost and high controllability. Its emphasis is not just realism, but operational realism: the reconstructed world should be interactive, editable, and useful for agent training and verification.&lt;/p>
&lt;h2 id="why-safeverse-matters">Why SafeVerse Matters&lt;/h2>
&lt;p>Existing embodied environments often fall into one of two extremes:&lt;/p>
&lt;ul>
&lt;li>Traditional simulators rely heavily on manual asset construction, offer limited interactive objects, and struggle to reflect the diversity of real-world scenes.&lt;/li>
&lt;li>Generative world models can imagine rich environments, but they are not faithful twins of a user-specified home, office, or factory floor, so they are hard to use for targeted security drills.&lt;/li>
&lt;/ul>
&lt;p>SafeVerse starts from a more practical premise: what safety testing needs is not an imagined world, but a controllable digital twin of a real one. It therefore builds a three-step loop:&lt;/p>
&lt;ol>
&lt;li>Reconstruct a real environment from video.&lt;/li>
&lt;li>Edit that environment according to attack or evaluation goals.&lt;/li>
&lt;li>Let the agent evolve online under continual adversarial pressure.&lt;/li>
&lt;/ol>
&lt;h2 id="three-core-breakthroughs">Three Core Breakthroughs&lt;/h2>
&lt;p>The system is organized around three main capabilities:&lt;/p>
&lt;ul>
&lt;li>&lt;code>Real-world Ctrl+C / Ctrl+V&lt;/code>: it tries to preserve not only appearance, but also structure, semantics, and interaction logic.&lt;/li>
&lt;li>&lt;code>Minute-level construction with operable objects&lt;/code>: a short video can become an interactive 3D scene where doors open, lights switch, and furniture moves.&lt;/li>
&lt;li>&lt;code>Unified evaluation, attack, and evolution&lt;/code>: the same environment can support testing, adversarial scene mutation, and online RL-based agent improvement.&lt;/li>
&lt;/ul>
&lt;p>This makes SafeVerse less like a standalone simulator and more like a digital twin infrastructure for trustworthy embodied intelligence.&lt;/p>
&lt;h2 id="from-ordinary-video-to-an-interactive-twin-world">From Ordinary Video to an Interactive Twin World&lt;/h2>
&lt;p>The first step in SafeVerse is to make the system actually understand the source video. Instead of leaning on a purely geometric 3D optimization pipeline, it uses multimodal understanding to parse objects, layouts, and scene semantics, then maps them into operable 3D entities.&lt;/p>
&lt;p>Built on top of Minecraft and its rich physical interaction rules, SafeVerse converts recognized scene elements into 3D objects with explicit interaction affordances. The result is not a static reconstructed set, but a sandbox where an embodied agent can enter, move, manipulate, and explore.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 1"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-1.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 2"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-2.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 3"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-3.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Scene reconstruction example 4"
src="https://wangxuhongcn.github.io/en/post/safeverse/scene-build-4.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>These four GIFs illustrate the pipeline from input video to interactive 3D environment. The key point of the original webpage is that SafeVerse makes &lt;code>video in, operable twin out&lt;/code> practical at minute-level turnaround.&lt;/p>
&lt;h2 id="editing-the-scene-with-attack-instructions">Editing the Scene with Attack Instructions&lt;/h2>
&lt;p>Reconstruction alone is not enough for safety validation. The environment also needs to change in response to specific attack goals.&lt;/p>
&lt;p>SafeVerse emphasizes a combination of realism and editability. Once a twin scene has been built, it can be modified directly for attack-and-defense scenarios, including:&lt;/p>
&lt;ul>
&lt;li>changing interaction properties, such as turning an ordinary door into one that must be unlocked first&lt;/li>
&lt;li>altering semantic cues, such as changing an object&amp;rsquo;s appearance to mislead recognition&lt;/li>
&lt;li>perturbing spatial layout, such as repositioning furniture or obstacles to break a navigation plan&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Attack-driven scene editing"
srcset="https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_1597f8ce1ab506ff.webp 320w, https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_f2ee097622fe69bc.webp 480w, https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_d50f034e7c86e48f.webp 746w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/safeverse/featured_hu_1597f8ce1ab506ff.webp"
width="746"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>This allows attack vectors to be injected into the environment itself, creating more realistic and better targeted embodied stress tests.&lt;/p>
&lt;h2 id="online-evolution-against-discovered-vulnerabilities">Online Evolution Against Discovered Vulnerabilities&lt;/h2>
&lt;p>SafeVerse does not stop at evaluation. It pushes the loop one step further toward online evolution.&lt;/p>
&lt;p>Conventional embodied training often depends on fixed datasets and static scenes. When a new attack or environmental shift appears, agents can fail catastrophically. SafeVerse tries to solve this through a reconstruction-attack-defense loop: rebuild the scene, perturb it dynamically, and retrain the agent immediately after failure.&lt;/p>
&lt;p>That means the agent is no longer tested in a frozen benchmark. It must adapt to changing layouts, newly inserted obstacles, altered object states, and other evolving threats in real time.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >&lt;img alt="Online evolution example: removing the chair to complete the task"
src="https://wangxuhongcn.github.io/en/post/safeverse/online-evolution.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>One example from the original page is especially clear: when a chair blocks the only path to the goal, the agent initially fails. After online training, it learns to recognize the obstacle, reroute, or even move the chair away. The point is not only that it encounters a failure, but that it can grow inside that failure.&lt;/p>
&lt;h2 id="full-safeverse-demo">Full SafeVerse Demo&lt;/h2>
&lt;p>This video restores the full-process demo referenced in the original webpage and shows the combined effect of reconstruction, attack-oriented editing, and online evolution more clearly.&lt;/p>
&lt;video controls style="width: 100%; max-width: 960px; margin: 0 auto; display: block;">
&lt;source src="https://wangxuhongcn.github.io/media/safeverse.mp4" type="video/mp4">
Your browser does not support the video tag.
&lt;/video>
&lt;h2 id="what-this-work-shows">What This Work Shows&lt;/h2>
&lt;p>SafeVerse is not just another embodied simulator. Its main contribution is that it connects three capabilities into one loop: fast digitization of a specified real scene, attack-oriented scene editing, and online RL-based agent evolution.&lt;/p>
&lt;p>Many embodied platforms are good at offering training spaces. SafeVerse is more specifically about offering safety drill spaces. It turns real-scene digitization into a working capability for safe and trustworthy embodied AI research.&lt;/p>
&lt;h2 id="related-links">Related Links&lt;/h2>
&lt;ul>
&lt;li>GitHub:
&lt;/li>
&lt;/ul></description></item></channel></rss>