Skip to content

Commit 3ebe76a

Browse files
committed
Small edits
1 parent 754264a commit 3ebe76a

2 files changed

Lines changed: 7 additions & 2 deletions

File tree

src/content/action-chunking.mdx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Action-chunking is a popular practice in modern sequential modeling pipelines, w
1515
3. Improved representation learning via multi-step prediction.
1616
4. Simulating Model-Predictive Control.
1717

18+
We show a different mechanism, one not described by the past literature: action-chunking can leverage the open-loop stability of the system to stabilize the learned policy.
1819

1920
<Block title="Chunking Policy" type="Definition">
2021
A chunking policy is specified by a chunk-length $\ell$, and a chunking policy $\text{chunk}[\pi]: \mathcal{X} \to \mathcal{U}^{\ell}$ such that $\pi(\mathbf{x}_{1:t},\mathbf{u}_{1:t-1},t) = \text{chunk}[\pi](\mathbf{x}_{\ell k})_{t - \ell k}$ where $k = \lfloor \frac{t}{\ell}\rfloor$, i.e. we predict $\ell$-length sequences which are then executed "open-loop" without feedback from $\mathbf{x}$ until the chunk has been exhausted.
@@ -25,6 +26,8 @@ $$
2526
J_{\text{demo}}(\hat{\pi}_{\text{chunk}}) = \mathbb{E}_{\pi^\star}\left[ \sum_{k=1}^{(T-1)/\ell} \|\mathbf{u}^*_{1+(k-1)\ell:k\ell} - \text{chunk}[\hat{\pi}_{\text{chunk}}](\mathbf{x}^*_{(k-1)\ell})\|^2\right].
2627
$$
2728

29+
We now formalize action-chunking for imitating deterministic expert policies:
30+
2831
<Block type="Intervention 1" title="Learning over Chunked Policies">
2932
We sample $S_n$ i.i.d. trajectories drawn from the expert distribution $\mathcal{P}_{\text{demo}}$. Instead of learning $\hat{\pi}: \mathcal{X} \to \mathcal{U}$, we learn a $\ell$-chunked-policy $\text{chunk}[\hat{\pi}_{\text{chunk}}]: \mathcal{X} \to \mathcal{U}^\ell$, that attains low **on-expert error** $J_{\text{demo}}(\hat{\pi}_{\text{chunk}})$, e.g., by empirical risk minimization.
3033
</Block>
@@ -51,3 +54,5 @@ $$
5154

5255
This implies that when the ambient dynamics $f$ are EISS, then a sufficiently chunked imitator policy will accrue limited compounding errors&mdash;**horizon-free**&mdash;relative to the on-expert error it sees.
5356
</Block>
57+
58+
Our result follows from the following fact: under natural assumptions, the learners chunked policies are all closed-loop EISS. This circumvents the lower bound given earlier, in which it is hard for the learner to find policies which stabilize the dynamics if those policies must predict a single action at a time.

src/content/introduction.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ We validate this finding in simulated robotic manipulation tasks from RoboMimic,
4949
<Figure src={`${BASE_URL}/figs/robomimic_traj100.svg`} alt="RoboMimic trajectory results"/>
5050
<Figure src={`${BASE_URL}/figs/robomimic_clean_vs_noise.svg`} alt="RoboMimic clean vs noise comparison"/>
5151
</HStack>
52-
RoboMimic tool-hang task success, as a function of both prediction horizon and evaluated chunk length.
52+
RoboMimic tool-hang task success, as a function of both prediction horizon and evaluated chunk length. Center: Chunk length ablation, 100 training trajectories. Right: Ablation on noise injection vs no noise injection, 50 training trajectories.
5353
</FigureEnv>
5454

5555

@@ -70,7 +70,7 @@ The effect of noise injection during demonstration collection for unstable envir
7070
<Figure src={`${BASE_URL}/figs/noise_inj_sweep_sigma.png`} alt="Noise injection sweep: sigma parameter" height="10rem"/>
7171
<Figure src={`${BASE_URL}/figs/noise_inj_sweep_alpha_sigma1.png`} alt="Noise injection sweep: alpha and sigma1" height="10rem"/>
7272
</HStack>
73-
Mean accumulated reward for Half-Cheetah environment by timestep, with differing levels of noise injection.
73+
Mean accumulated reward for Half-Cheetah environment by timestep, with differing levels of noise injection and using the clean expert actions vs noised expert actions for the training labels.
7474
</FigureEnv>
7575

7676
For the adventurous reader, we will now introduce the general framework we use to make precise these fuzzy notions of stability and performance. This requires elements from Control Theory with which many Roboticists and RL theoristists may be unfamiliar with. We build up our analytical framework in a notation-light and broadly informal manner.

0 commit comments

Comments
 (0)