深圳云主机云服务器租赁AI操作网页:browser-use和AI大模型互动解析

亚马逊云服务器免费 一、browser-use介绍 AI大模型大家最熟悉的就是输出文本 ,图片,或者视频,在工作与生活中或许已经能够驾轻就熟地使用。现在MCP(Model Context Proto···

亚马逊云服务器免费

一、browser-use介绍

AI大模型大家最熟悉的就是输出文本 ,图片,或者视频,在工作与生活中或许已经能够驾轻就熟地使用。现在MCP(Model Context Protocol)的兴起,怎么使用AI大模型代替人工进行操作,进入了大家视野。有一个开源的 AI 驱动的浏览器自动化框架,对 AI 代理友好,能够实现在线任务的自动化流程,并且在开源社区具有强大的影响力(GitHub 星数 63.4k+),它就是browser-use。

那么browser-use具体有多强大,可以通过几个demo感受下:

1.将杂货商品添加到购物车并完成结账。

2.将我最新的LinkedIn关注者添加到Salesforce的潜在客户列表中。

3.请阅读我的简历,寻找机器学习职位,将它们保存到文件中,然后在新标签页中开始申请。如果需要帮助,请随时询问我。

4.在Google文档中为我父亲写一封感谢信,感谢他一直以来的付出,并将文档另存为PDF格式。

5.在网站Hugging Face上查找使用 CC BY-SA 4.0 许可证的模型,按点赞数排序,将前5名保存到文件中。

以上列举的任务,可以在github的README中看到demo演示:https://github.com/browser-use/browser-use,AI大模型都丝滑的完成了输入给它的任务。

问:那么browser-use是怎么做到这一切的?

答:高超地与AI大模型互动的技巧。

本篇侧重于browser-use和AI大模型互动方式进行深度解析,在browser-use这个优秀实践中,探究AI大模型交互过程,有哪些独特的技巧和方式,分析实现过程,提出可以应用在其他与大模型交互场景的Tips,给出一点点对于AI大模型其他应用场景的启发。

二、解析

browsers-use与AI大模型交互的方式非常的高效,离不开它精巧的prompt设计,和一系列不同类型的Message紧密组合的输入。

3.1 完整输入

以一个非常简单的任务为例:1. 打开网址https://one.console.con.env136.shuguang.com 2.输入用户名和密码进行登陆

browser-use会基于上面的任务,不断地拆解成小目标以及和大模型进行交互,下面是与大模型交互的其中一次输入。

[SystemMessage( content=You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules. Input Format Task Previous steps Current URL Open Tabs Interactive Elements [index]text - index: Numeric identifierforinteraction - type:HTML elementtype(button, input, etc.) - text: Element description Example: [33]Submit Form - Only elements with numeric indexesin[] are interactive - elements without [] provide only context Response Rules 1. RESPONSE FORMAT: You must ALWAYS respond with valid JSONinthisexact format: {"current_state": {"evaluation_previous_goal":"Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not", "memory":"Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz", "next_goal":"What needs to be done with the next immediate action"}, "action":[{"one_action_name": {// action-specific parameter}}, // ... more actions in sequence]} 2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. Use maximum {max_actions} actions per sequence. Common action sequences: - Form filling: [{"input_text": {"index": 1, "text": "username"}}, {"input_text": {"index": 2, "text": "password"}}, {"click_element": {"index": 3}}] - Navigation and extraction: [{"go_to_url": {"url": "https://example.com"}}, {"extract_content": {"goal": "extract the names"}}] - Actions are executed in the given order - If the page changes after an action, the sequence is interrupted and you get the new state. - Only provide the action sequence until an action which changes the page state significantly. - Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page - only use multiple actions if it makes sense. 3. ELEMENT INTERACTION: - Only use indexes of the interactive elements - Elements marked with "[]Non-interactive text" are non-interactive 4. NAVIGATION & ERROR HANDLING: - If no suitable elements exist, use other functions to complete the task - If stuck, try alternative approaches - like going back to a previous page, new search, new tab etc. - Handle popups/cookies by accepting or closing them - Use scroll to find elements you are looking for - If you want to research something, open a new tab instead of using the current tab - If captcha pops up, try to solve it - else try a different approach - If the page is not fully loaded, use wait action 5. TASK COMPLETION: - Use the done action as the last action as soon as the ultimate task is complete - Dont use "done" before you are done with everything the user asked you, except you reach the last step of max_steps. - If you reach your last step, use the done action even if the task is not fully finished. Provide all the information you have gathered so far. If the ultimate task is completly finished set success to true. If not everything the user asked for is completed set success in done to false! - If you have to do something repeatedly for example the task says for "each", or "for all", or "x times", count always inside "memory" how many times you have done it and how many remain. Don stop until you have completed like the task asked you. Only call done after the last step. - Don hallucinate actions - Make sure you include everything you found out for the ultimate task in the done text parameter. Do not just say you are done, but include the requested information of the task. 6. VISUAL CONTEXT: - When an image is provided, use it to understand the page layout - Bounding boxes with labels on their top right corner correspond to element indexes 7. Form filling: - If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field. 8. Long tasks: - Keep track of the status and subresults in the memory. 9. Extraction: - If your task is to find information - call extract_content on the specific pages to get and store the information. Your responses must be always JSON with the specified format. ,additional_kwargs={}, response_metadata={}), HumanMessage( content=Your ultimate task is: """1. 打开网址https://one.console.con.env136.shuguang.com 2.输入用户名和密码 """. If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual., additional_kwargs={}, response_metadata={} ), HumanMessage( content=Example output:, additional_kwargs={}, response_metadata={} ), AIMessage( content=, additional_kwargs={}, response_metadata={}, tool_calls=[ {name:AgentOutput,args: {current_state: {evaluation_previous_goal:Success - I opend the first page,memory:Starting with the new task. I have completed 1/10 steps,next_goal:Click on company a},action: [{click_element: {index:0}}]},id:1,type:tool_call}] ), ToolMessage( content=Browser started, tool_call_id=1), HumanMessage( content=[Your task history memory starts here], additional_kwargs={}, response_metadata={} ), AIMessage(content=, additional_kwargs={}, response_metadata={}, tool_calls=[{name:AgentOutput,args: {current_state: {evaluation_previous_goal:Unknown - The browser has just started, and the current page is blank. No previous actions have been taken yet.,memory:0 out of 4 steps completed. Need to open the URL https://one.console.con.env136.shuguang.com next.,next_goal:Navigate to the specified URL},action: [{go_to_url: {url:https://one.console.con.env136.shuguang.com}}]},id:2,type:tool_call}]), ToolMessage(content=, tool_call_id=2), HumanMessage(content=Action result: Navigated to https://one.console.con.env136.shuguang.com, additional_kwargs={}, response_metadata={}), HumanMessage(content=[ {type:text,text:" [Task history memory ends] [Current state starts here] The following is one-time information - if you need to remember it write it to memory: Current url: https://one.console.con.env136.shuguang.com/ascm/login?oauth_callback=https%3A%2F%2Fone.console.con.env136.shuguang.com%2F Available tabs: [TabInfo(page_id=0, url=https://one.console.con.env136.shuguang.com/ascm/login?oauth_callback=https%3A%2F%2Fone.console.con.env136.shuguang.com%2F, title=登录 · 飞天企业版)] Interactive elements from top layer of the current page inside the viewport: [Start of page] 飞天企业版 政企数智创新的同行者 飞天企业版 政企数智创新的同行者 [0] [1] [2] 飞天企业版 [3] 简体中文 你好, 欢迎访问 飞天企业版 [4] [5] [6] [7] [8] [9] [10]账号登录/> [11] © 2009-现在 阿里云计算有限公司。版权所有 [End of page] Current step: 2/100Current date and time: 2025-03-28 14:58 "}, {type:image_url,image_url: {url:避免过大暂时去掉image的base64格式的信息}} ], additional_kwargs={}, response_metadata={}) ]

输入是一个列表,包含各种类型的Message,这些概念常见于 LangChain、OpenAI API 等工具中,帮助区分不同角色的交互。

browser-use使用的LangChain工具,也同样用这个类型对输入进行封装。

接下来会针对具体的输入内容进行详细拆解。

3.2 系统提示(SystemMessage)

这一次输入的Message列表中,第一个就是SystemMessage。

SystemMessage类型一般是用来定义AI角色和对话规则的指令,内容不随着输入的任务而改变,通常在对话开始时设置,指导 AI 的行为和输出格式。

browser-use采用的给出SystemMessage的方式是通过Markdown的文件书写,输入时转成字符串输入,可以看到清晰的结构:

Tips1:大模型可以识别Markdown的文本格式,使用Markdown的格式,实现清晰的语言架构,增加可读性,适用于固定的SystemMessage。这是一种对于大模型友好的表达方式。

这里browser-use设计SystemMessage的内容遵循着一般的大模型的设计原则:约束 AI 的行为、设定对话规则和控制输出格式。具体内容如下:

3.2.1 指定 AI 以特定身份回答问题

要求 AI 以特定身份回答问题:原文:You are an AI agent designed to automate browser tasks.Your goal is to accomplish the ultimate task following the rules.(你是一个专门设计用于自动化浏览器任务的 AI 智能体,你的目标是遵守下面的规则,完成最终任务)

三网云服务器腾讯云

3.2.2 约定输入的格式(Input Format)

browser-use列举的可能的输入为:

TaskPreviousstepsCurrent URLOpen TabsInteractiveElements

解释其术语以及表现形式,例如Interactive Elements(可交互的页面元素)这个,详细解释其描述格式:

[index]<type>text

您好:云优数据云计算 www.yunyoushuju.cn 2核2G6M最低19.9元/月 欢迎开机

发表评论

评论列表
未查询到任何数据!