You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/action-chunking.mdx
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,15 +16,17 @@ Action-chunking is a popular practice in modern sequential modeling pipelines, w
16
16
17
17
18
18
<Blocktitle="Chunking Policy"type="Definition">
19
-
A chunking policy is specified by a chunk-length $\ell$, and a chunking policy $\text{chunk}[\pi]: \mathcal{X} \to \mathcal{U}^{\ell}$ such that $\pi(\mathbf{x}_{1:t},\mathbf{u}_{1:t-1},t) = \text{chunk}[\pi](\mathbf{x}_{\ell\lfloor \frac{t}{\ell}\rfloor})_{t- \ell\lfloor \frac{t}{\ell}\rfloor}$, i.e. we predict $\ell$-length sequences which are then executed "open-loop" without feedback from $\mathbf{x}$ until the chunk has been exhausted.
19
+
A chunking policy is specified by a chunk-length $\ell$, and a chunking policy $\text{chunk}[\pi]: \mathcal{X} \to \mathcal{U}^{\ell}$ such that $\pi(\mathbf{x}_{1:t},\mathbf{u}_{1:t-1},t) = \text{chunk}[\pi](\mathbf{x}_{\ellk})_{t- \ellk}$ where $k = \lfloor \frac{t}{\ell}\rfloor$, i.e. we predict $\ell$-length sequences which are then executed "open-loop" without feedback from $\mathbf{x}$ until the chunk has been exhausted.
20
20
</Block>
21
21
For convenience we also write $\text{chunk}[\pi](\mathbf{x}) = (\text{chunk}_1(\mathbf{x}),\dots, \text{chunk}_{\ell}(\mathbf{x}))$ and denote a chunking policy as $\hat{\pi}_{\text{chunk}}$. For chunked policies, our demonstration loss becomes:
**Intervention 1: Learning over Chunked Policies.** We sample $S_n$ as denote $n$ i.i.d. trajectories drawn from the expert distribution $\mathcal{P}_{\text{demo}}$. We aim to find $\hat{\pi}_{\text{chunk}}$ from a class of length-$\ell$ chunked policies, $\Pi_{\text{chunk}}$, that attains low **on-expert error** $J_{\text{demo}}(\hat{\pi}_{\text{chunk}})$, e.g., by empirical risk minimization.
27
+
<Blocktype="Intervention 1"title="Learning over Chunked Policies">
28
+
We sample $S_n$ i.i.d. trajectories drawn from the expert distribution $\mathcal{P}_{\text{demo}}$. Instead of learning $\hat{\pi}: \mathcal{X} \to \mathcal{U}$, we learn a $\ell$-chunked-policy $\text{chunk}[\hat{\pi}_{\text{chunk}}]: \mathcal{X} \to \mathcal{U}^\ell$, that attains low **on-expert error** $J_{\text{demo}}(\hat{\pi}_{\text{chunk}})$, e.g., by empirical risk minimization.
29
+
</Block>
28
30
29
31
The Control-Theoretic intuition behind this intervention is that, by making the chunk length long enough, the learned policy $\hat{\pi}_{\text{chunk}}$ **inherits the open-loop stability of the dynamics $f$**.
Copy file name to clipboardExpand all lines: src/content/discussion.mdx
+4-5Lines changed: 4 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,11 @@ import Refs from '../components/Refs.astro';
6
6
7
7
## Discussion and Limitations
8
8
9
-
Our action-chunking guarantees rely on a structural assumption of $(\hat{\pi}, \hat{f}) \in \mathcal{P}$ being an EISS pair. We believe either explicitly enforcing this, e.g., via regularization or hierarchy, or attaining it indirectly via implicit biases, are interesting directions of inquiry.
9
+
Our combined action-chunking, noise-injection procedure relies on a structural assumption of either $f$ or $f^{\pi^\star}$ being EISS.
10
10
11
-
We assume smoothness in the noise injection section, which is not strictly satisfied in some applications, such as in model-predictive control. We remark our lower bound depends on smoothness in $C_\pi$, which implies it is in some sense a fundamental aspect of noise-injection. However, we believe our results should extend to piece-wise notions, and note ongoing research exploring **smoothing** for learning in dynamical systems.
11
+
Without either of these assumptions, if $f^{\pi^\star}$ is unstable, errors may always compound in the worst-case [@simchowitz2025pitfalls]. This setting is, to some degree, uninteresting for Imitation Learning, as it means that the expert is inherently bad and cannot correct from failure.
12
12
13
-
In general, we leave a sharp characterization of the role of smoothness and control-theoretic quantities in IL as an open problem. We also note though our theory suggests isotropic noise injection suffices, this may not be desirable in certain practical contexts, such as highly dexterous robotics. In light of our findings elucidating the precise role of noising, we leave designing robust practical recipes for perturbative data collection for future inquiry.
14
-
15
-
Lastly, we leave investigating the marginal benefit of **iterative** interaction as future work.
13
+
For settings where an external oracle can stabilize the dynamics (e.g. a low-level position-based control loop), the dynamics can be reformulated such that $f$ is open-loop EISS. As such, we believe our results cover the full spectrum of situations where learning is reasonable.
16
14
17
15
{/*
18
16
Include fake references (not shown) and then generate the whole
@@ -23,19 +15,29 @@ To validate our predictions about the **stability-theoretic** benefits of action
23
15
- The merits of action-chunking remain showcased in **deterministic, state-based control**. This reveals that action-chunking still improves performance independently of partial observability or compatibility with generative control policies.
24
16
-**End-effector control** enables the benefits of action-chunking. This is because end-effector control renders the closed-loop between system state and end-effector prediction incrementally stable. Hence, the low-level end-effector controller transforms imitating the position policy to taking place in an open-loop stable dynamical system, precisely the regime where we prescribe our AC guarantees.
We visualize performance as a function of noise injection and chunk length for the MuJoCo HalfCheetah environment, and show performance relative to both DAgger and DART on HalfCheetah, Humanoid.
34
26
</FigureEnv>
35
27
28
+
### Noise Injection
29
+
30
+
36
31
We seek to validate our hypotheses about the exploratory benefits of noise-injection. We propose experiments on MuJoCo continuous control environments, where we seek to imitate pre-trained expert policies. To summarize:
37
32
38
33
-**Noise injection as in Intervention 2 provides the exploration necessary to mitigate compounding errors**, increasing performance on par with iteratively interactive methods such as DAgger and DART. We note Intervention 2 collects data in one shot, without ever observing learned policy rollouts.
39
-
-**Larger noise scales $\sigma_u$ (within tolerance) improve performance**, in contrast to prior understanding which necessitates $\sigma_u$ set proportional to $J_{\text{demo}}^T(\hat{\pi}; \mathcal{P}_{\text{demo}})$, i.e. very small for policies with low on-expert error.
34
+
-**Larger noise scales $\sigma_u$ (within tolerance) improve performance**, in contrast to prior understanding which necessitates $\sigma_u$ set proportional to $J_{\text{demo}}(\hat{\pi}; \mathcal{P}_{\text{demo}})$, i.e. very small for policies with low on-expert error.
40
35
-**A mixture of noise-injected and clean expert trajectories is beneficial**, and the difference is small when provided more data. This matches the theoretical intuition that noise-injection is necessary up until $\hat{\pi}$ is "locally stabilized" sufficiently well around $\mathbf{x}^*$, and thus only enters the trajectory error as a higher-order term.
@@ -13,32 +18,47 @@ We now consider the difficult setting where the ambient dynamics $f$ may not be
13
18
We define the **expert distribution under noise injection** as the distribution $\mathcal{P}_{\text{exp},\sigma}$ over trajectories $(\tilde{\mathbf{x}}_t, \tilde{\mathbf{u}}_t)_{t\geq1}$ with $\tilde{\mathbf{x}}_1 \sim D$, and $\tilde{\mathbf{u}}_t = \pi^*(\tilde{\mathbf{x}}_t),\;\tilde{\mathbf{x}}_{t+1} = f(\tilde{\mathbf{x}}_t, \tilde{\mathbf{u}}_t + \sigma_u \mathbf{z}_t)$ for $t \geq 1$, where $\mathbf{z}_t \sim \text{Unif}(\mathbb{B}^{d_u}(1))$ is drawn uniformly over the unit ball.
14
19
</Block>
15
20
21
+
Our key innovation over prior algorithms such as DAgger or DART is that we learn using a weighted **mixture** of both the noise-injected $\mathcal{P}_{\text{exp},\sigma}$ and the "vanilla" expert data distribution $\mathcal{P}_{\text{exp}}$.
22
+
23
+
Using a mixture is *provably better*, particularly in the high data regime with large $n$. This is an intuitive result: when $J_{\text{imitation}}$ is already low, using demonstrations with the fixed noise level $\sigma$, i.e. $\mathcal{P}_{\text{exp},\sigma}$ may explore *too* much and has low coverage on $\mathcal{P}_{\text{exp}}$.
24
+
25
+
16
26
<Blocktitle="Exploratory Data Collection"type="Intervention">
17
27
For the noise-injected distribution $\mathcal{P}_{\text{exp},\sigma}$ defined above, provide a sample $S_{n,\sigma,\alpha}$ of trajectories, where for $1 \le i \le \lfloor\alpha n\rfloor$ the trajectories are i.i.d. from $\mathcal{P}_{\text{exp}}$, and the remaining trajectories are drawn i.i.d. from $\mathcal{P}_{\text{exp},\sigma}$. Define the corresponding mixture distribution $\mathcal{P}_{\text{exp},\sigma,\alpha} \triangleq \alpha \mathcal{P}_{\text{exp}} + (1-\alpha)\mathcal{P}_{\text{exp},\sigma}$. We then find $\hat{\pi}$ that attains low $J_{\text{demo}}^T(\hat{\pi}; \mathcal{P}_{\text{exp},\sigma,\alpha})$, e.g., by empirical risk minimization.
18
28
</Block>
19
29
30
+
20
31
<FigureEnv>
21
32
<HStack>
22
33
<Figuresrc={`${BASE_URL}/figs/exploration_diagram.svg`}alt="Exploratory data collection via noise injection"/>
23
34
</HStack>
35
+
We can think of the data mixture as ensuring coverage both on-expert, as well as a "tube" around the expert trajectories. Using either one or the other is suboptimal, either due to lack of on-expert data, or off-expert data.
24
36
</FigureEnv>
25
37
26
-
<Blocktype="Key Result">
27
-
Let the expert policy and true dynamics $(\pi^*, f)$ be $(C_\pi,C_{\text{smooth}})$-smooth, respectively, and all policies $\pi$ are $L_\pi$-Lipschitz. The closed-loop system induced by $(\pi^*, f)$ is $(C_{\text{ISS}}, \rho)$-EISS. Let $\hat{\pi}$ be a $L_\pi$-Lipschitz, $C_\pi$-smooth policy. Then, for $\sigma_u \lesssim O^*[\text{poly}(1/C_\pi, 1/C_{\text{smooth}})] = O^*(1)$, we have:
38
+
Our results in this domain make extensive use of the analysis tools introduced in @pfrommer2022tasil, which provides strong guarantees when imitating a closed-loop EISS expert in an adversarial manner.
28
39
40
+
There are many technical subtleties that we gloss over here but explore in-detail in our full manuscript. Namely our analysis is carefully constructed to consider coverage only on the manifold of reachable states. To perform this analysis in a technically rigorous requires careful Control-Theoretic analysis involving concepts such as the Controllability Grammian. We additionally make several simplifying assumptions regarding first-order-smoothness (i.e that $f, \pi^\star$ are differentiable with $C_{\text{smooth}},$ $C_{\pi}$-Lipschitz derivatives, respectively).
41
+
42
+
<Blocktype="Key Result">
43
+
Let the dynamics and expert policy $(f, \pi^*)$ be $(C_{\text{smooth}}, C_{\pi})$-smooth, respectively, and all policies $\pi$ are $L_\pi$-Lipschitz. Assume that the closed-loop system induced by $(f, \pi^*)$, $f^{\pi^\star}$, is $(C_{\text{ISS}}, \rho)$-EISS. Let $\hat{\pi}$ be a $L_\pi$-Lipschitz, $C_\pi$-smooth policy. Then, for any $n, T$ and,
0 commit comments