-
Notifications
You must be signed in to change notification settings - Fork 1.5k
refactor: browser-primitive-actions #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @nitpicker55555 , left some comments below, we could also support browser console execution to add more flexibility, could you take 5 browser related tasks from GAIA to have a test comparing the performance between refactored browser toolkit with the original one, and write a short summary under this PR?
def get_tools(self) -> List[FunctionTool]: | ||
return [FunctionTool(self.browse_url)] | ||
return [ | ||
FunctionTool(self.browse_url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this def get_tools
to the bottom
def get_tools(self) -> List[FunctionTool]: | ||
return [FunctionTool(self.browse_url)] | ||
return [ | ||
FunctionTool(self.browse_url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still keep self.browse_url?
|
||
Args: | ||
format (Literal["markdown", "html"]): Content format. | ||
Defaults to "markdown". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring format for default value, same for other parts
Defaults to "markdown". | |
(default: :obj: `"markdown" `) |
camel/toolkits/browser_toolkit.py
Outdated
return "Browser session closed." | ||
return "Browser session was not initialized or already closed." | ||
|
||
def visit_page(self, url: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current code could be simpified, now there are many layers, take this function as example:
visit_page
-> visit_page
-> wait_for_load
, could we refactor functions like this to make it more straightforward?
await self._ensure_browser_initialized() | ||
try: | ||
if mark_interactive_elements: | ||
_, file_path = await self.browser.async_get_som_screenshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the naming of async_get_som_screenshot
is confusing, it's actually get_screenshot_with_marks
instead of get multi-screenshot?
FunctionTool(self.get_page_content), | ||
FunctionTool(self.download_file_by_element_id), | ||
FunctionTool(self.capture_screenshot), | ||
FunctionTool(self.ask_question_about_video), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary to add this in browser toolkit? I feel it would be better remove this function, make it more clear for the functionality of different toolkits
await self._ensure_browser_initialized() | ||
return await self.browser.async_download_file_id(element_id) | ||
|
||
async def capture_screenshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this function only get screenshot and didn't use VLM do further analysis?
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…/camel into chore/browser-primitive-actions
Sorry for the late update
However, ChatAgent does not support tool responses that return images. Even if a BaseMessage is constructed, it will be treated as a string, so visual recognition cannot be decoupled and used as an independent action like other tools. I can achieve decoupling of web page recognition by returning DOM information in text form, in this PR: #2674. |
Expose primitive browser actions (e.g., visit_page, scroll_page, click_element) as independent, callable tools And add a multi-tool scenario test
Fix #1969
Checklist
Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!