Skip to content

Commit 655adea

Browse files
authored
Merge pull request #15 from facebookresearch/set-of-marks-saving
Set of marks saving
2 parents 1768fea + 4d7c059 commit 655adea

23 files changed

Lines changed: 279 additions & 47 deletions

config/agent/GPT-5.4-computer-use.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ custom_actions:
3939
use_html: false
4040
use_axtree: false
4141
use_screenshot: true
42-
use_som: false
42+
save_som: false
4343
extract_visible_tag: false
4444
extract_clickable_tag: false
4545
extract_coords: false

config/agent/UI-TARS-1.5-7B.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ custom_actions:
2828
use_html: false
2929
use_axtree: false
3030
use_screenshot: true
31-
use_som: false
31+
save_som: false
3232
extract_visible_tag: false
3333
extract_clickable_tag: false
3434
extract_coords: false

config/agent/axtree-only.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ custom_actions: ["click", "fill", "dblclick", "clear", "select_option", "drag_an
55
# --- observation flags ---
66
use_axtree: True # enable AXTREE observation
77
use_screenshot: False # enable screenshot observation
8-
use_som: False # Add a set of marks to the screenshot.
8+
save_som: False # Add a set of marks to the screenshot.
99
extract_coords: False # Add the coordinates of the elements.
1010

1111
# --- Prompt Flags ---

config/agent/default.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ use_screenshot: True # enable screenshot observation
5050

5151
# ---- these are not really changed, but leaving it here for future reference ----
5252
# use_html: False # enable HTML observation
53-
use_som: False # Add a set of marks to the screenshot.
53+
save_som: False # Add a set of marks to the screenshot.
5454
# extract_visible_tag: False # Add a "visible" tag to visible elements in the AXTree.
5555
# extract_clickable_tag: False # Add a "clickable" tag to clickable elements in the AXTree.
5656
extract_coords: False # Add the coordinates of the elements.

config/agent/dummy.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,6 @@ model_pretty_name: dummy # for wandb-logging
77
use_html: True
88
use_axtree: True
99
use_screenshot: True
10+
save_som: False # set to True to save set_of_marks_coordinates.json
1011
hostname: "no host name for dumb dumbs" # dummy agent does not use hostname
1112
client_type: dummy

config/agent/screenshot-only.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
custom_actions: ["go_back", "go_forward", "goto", "mouse_click", "mouse_dblclick", "scroll", "mouse_move", "mouse_down", "mouse_up", "mouse_click", "mouse_dblclick", "mouse_drag_and_drop", "mouse_upload_file", "keyboard_down", "keyboard_up", "keyboard_press", "keyboard_type", "keyboard_insert_text"]
22
use_axtree: False # enable AXTREE observation
33
use_screenshot: True # enable screenshot observation
4-
use_som: False # Add a set of marks to the screenshot.
4+
save_som: False # Add a set of marks to the screenshot.
55
extract_coords: False # Add the coordinates of the elements.
66
prompt_txt:
77
system_prompt: null # takes default system prompt from dp lib

docs/Intro to UI Agents.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,15 @@ Agents receive a screenshot of the current apps (in the same manner a human sees
1919

2020
The agent then outputs an action such as `click` or `type` that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a `max_steps` is reached.
2121

22+
You can view configs for `configs/agents/default.yaml` containing:
23+
24+
- list of actions
25+
- `use_axtree`: produces simplified text representation of each app states as an input
26+
- `use_screenshot`: provides screenshot of app as an input
27+
- `save_som`: if true, saves set of marks in `log_outputs/<timestamp>/set_of_marks_coordinates.json` json (see example below)
28+
29+
30+
You can view `UI-Tars-1.5-7B.yaml` as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see `GPT-5.1.yaml`.
2231

2332

2433
## OpenApps: building blocks for digital agent research

mkdocs.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ theme:
77
pygments_style: # default styles
88
light: shadcn-light
99
dark: github-dark
10-
icon: heroicons:rectangle-stack # use the shadcn svg if not defined
1110
topbar_sections: false # NEW!
1211
show_datetime: false
1312

site/Intro to UI Agents.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,41 @@
11

2+
Digital agents open the possibility of AI systems to complete tedious tasks on your behalf. For example, `"add an event to my calendar"` or even more complex multi-step tasks. Yet, today's agents are still not reliable enough for many applications. To get there, we need lots of data for training and evaluation + lots of research to develop new recipes for training and deploying reliable agents.
23

4+
A few definitions to settle you in:
5+
6+
!!! note "Digital (UI) Agent:"
7+
completes tasks by directly interacting with apps in the same manner as humans (by clicking, scrolling, typing on your behalf)
8+
9+
!!! note "Reward:"
10+
measures whether the agent completed the given task
11+
12+
13+
![landing](images/pomdp.png)
14+
15+
## Agents under the hood
16+
17+
Digital agents are powered by a foundation model that can understand both text and image inputs.
18+
Agents receive a screenshot of the current apps (in the same manner a human sees them) and the task goal ("delete Brooklyn Bridge from my favorite places"); depending on how you configure the agent, the agent can also track past actions or observations.
19+
20+
The agent then outputs an action such as `click` or `type` that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a `max_steps` is reached.
21+
22+
You can view configs for `configs/agents/default.yaml` containing:
23+
24+
- list of actions
25+
- `use_axtree`: produces simplified text representation of each app states as an input
26+
- `use_screenshot`: provides screenshot of app as an input
27+
- `save_som`: if true, saves set of marks in `log_outputs/<timestamp>/set_of_marks_coordinates.json` json (see example below)
28+
29+
30+
You can view `UI-Tars-1.5-7B.yaml` as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see `GPT-5.1.yaml`.
31+
32+
33+
## OpenApps: building blocks for digital agent research
34+
35+
OpenApps offers an easy to use environment that runs on one CPU written in Python for stuyding digital agents. OpenApps comes with six configurable apps for generating limitless data for training and evaluating digital agents.
36+
37+
38+
### Hands on with OpenApps
339

440
Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.
541

site/Intro to UI Agents/index.html

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,14 @@
142142
<span class="size-8 flex flex-row justify-center items-center">
143143

144144

145-
<svg xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 24 24"><path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M6 6.878V6a2.25 2.25 0 0 1 2.25-2.25h7.5A2.25 2.25 0 0 1 18 6v.878m-12 0q.354-.126.75-.128h10.5q.396.002.75.128m-12 0A2.25 2.25 0 0 0 4.5 9v.878m13.5-3A2.25 2.25 0 0 1 19.5 9v.878m0 0a2.3 2.3 0 0 0-.75-.128H5.25q-.396.002-.75.128m15 0A2.25 2.25 0 0 1 21 12v6a2.25 2.25 0 0 1-2.25 2.25H5.25A2.25 2.25 0 0 1 3 18v-6c0-.98.626-1.813 1.5-2.122"/></svg>
145+
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 256 256"
146+
class="size-5">
147+
<rect width="256" height="256" fill="none"></rect>
148+
<line x1="208" y1="128" x2="128" y2="208" fill="none" stroke="currentColor" stroke-linecap="round"
149+
stroke-linejoin="round" stroke-width="32"></line>
150+
<line x1="192" y1="40" x2="40" y2="192" fill="none" stroke="currentColor" stroke-linecap="round"
151+
stroke-linejoin="round" stroke-width="32"></line>
152+
</svg>
146153

147154

148155
</span>
@@ -463,7 +470,33 @@ <h1 class="pr-2">OpenApps</h1>
463470
</div>
464471
</div>
465472
<div class="typography w-full flex-1 *:data-[slot=alert]:first:mt-0">
466-
<p>Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.</p>
473+
<p>Digital agents open the possibility of AI systems to complete tedious tasks on your behalf. For example, <code>"add an event to my calendar"</code> or even more complex multi-step tasks. Yet, today's agents are still not reliable enough for many applications. To get there, we need lots of data for training and evaluation + lots of research to develop new recipes for training and deploying reliable agents.</p>
474+
<p>A few definitions to settle you in:</p>
475+
<div class="admonition note">
476+
<p class="admonition-title">Digital (UI) Agent:</p>
477+
<p>completes tasks by directly interacting with apps in the same manner as humans (by clicking, scrolling, typing on your behalf)</p>
478+
</div>
479+
<div class="admonition note">
480+
<p class="admonition-title">Reward:</p>
481+
<p>measures whether the agent completed the given task</p>
482+
</div>
483+
<p><img alt="landing" src="../images/pomdp.png" /></p>
484+
<h2 id="agents-under-the-hood">Agents under the hood</h2>
485+
<p>Digital agents are powered by a foundation model that can understand both text and image inputs.
486+
Agents receive a screenshot of the current apps (in the same manner a human sees them) and the task goal ("delete Brooklyn Bridge from my favorite places"); depending on how you configure the agent, the agent can also track past actions or observations.</p>
487+
<p>The agent then outputs an action such as <code>click</code> or <code>type</code> that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a <code>max_steps</code> is reached.</p>
488+
<p>You can view configs for <code>configs/agents/default.yaml</code> containing:</p>
489+
<ul>
490+
<li>list of actions</li>
491+
<li><code>use_axtree</code>: produces simplified text representation of each app states as an input</li>
492+
<li><code>use_screenshot</code>: provides screenshot of app as an input</li>
493+
<li><code>save_som</code>: if true, saves set of marks in <code>log_outputs/&lt;timestamp&gt;/set_of_marks_coordinates.json</code> json (see example below)</li>
494+
</ul>
495+
<p>You can view <code>UI-Tars-1.5-7B.yaml</code> as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see <code>GPT-5.1.yaml</code>.</p>
496+
<h2 id="openapps-building-blocks-for-digital-agent-research">OpenApps: building blocks for digital agent research</h2>
497+
<p>OpenApps offers an easy to use environment that runs on one CPU written in Python for stuyding digital agents. OpenApps comes with six configurable apps for generating limitless data for training and evaluating digital agents.</p>
498+
<h3 id="hands-on-with-openapps">Hands on with OpenApps</h3>
499+
<p>Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.</p>
467500
<iframe width="560" height="315" src="https://www.youtube.com/embed/gzNW_LXE7OE?si=qLh-r_CvheMIgIWd" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
468501
</div>
469502
</article>
@@ -561,6 +594,36 @@ <h1 class="pr-2">OpenApps</h1>
561594
<div class="flex flex-col gap-2 p-4 pt-0 text-sm">
562595
<p class="text-muted-foreground bg-background sticky top-0 h-6 text-xs">On This Page</p>
563596

597+
598+
599+
600+
<a href="#agents-under-the-hood"
601+
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
602+
data-active="false" data-depth="2">
603+
Agents under the hood
604+
</a>
605+
606+
607+
608+
<a href="#openapps-building-blocks-for-digital-agent-research"
609+
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
610+
data-active="false" data-depth="2">
611+
OpenApps: building blocks for digital agent research
612+
</a>
613+
614+
615+
<a href="#hands-on-with-openapps"
616+
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
617+
data-active="false" data-depth="3">
618+
Hands on with OpenApps
619+
</a>
620+
621+
622+
623+
624+
625+
626+
564627

565628
</div>
566629
<div class="h-12"></div>

0 commit comments

Comments
 (0)