Merge pull request #149 from microsoft/vyokky/dev

Vyokky/dev
microsoft · Dec 16, 2024 · 7da39f0 · 7da39f0
2 parents f34fefe + 9ac33c8
commit 7da39f0
Show file tree

Hide file tree

Showing 12 changed files with 414 additions and 30 deletions.
diff --git a/.gitignore b/.gitignore
@@ -35,4 +35,7 @@ scripts/*
 !vectordb/docs/example/
 !vectordb/demonstration/example.yaml
 
-.vscode
+.vscode
+
+# Ignore the record files
+tasks_status.json
diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ Both agents leverage the multi-modal capabilities of GPT-4V(o) to comprehend the
 
 ## 📢 News
 - 📅 2024-12-13: We have a **New Release for v1.2.0!**! Checkout our new features and improvements:
-    1. **Large Action Model (LAM) Data Collection:** We have released the code and sample data for Large Action Model (LAM) data collection with UFO! Please checkout our [new paper](https://arxiv.org/abs/2412.07939), [code](dataflow/README.md) and [documentation](https://microsoft.github.io/UFO/dataflow/overview/) for more details.    
+    1. **Large Action Model (LAM) Data Collection:** We have released the code and sample data for Large Action Model (LAM) data collection with UFO! Please checkout our [new paper](https://arxiv.org/abs/2412.10047), [code](dataflow/README.md) and [documentation](https://microsoft.github.io/UFO/dataflow/overview/) for more details.    
     2. **Bash Command Support:** HostAgent also support bash command now!
     3. **Bug Fixes:** We have fixed some bugs, error handling, and improved the overall performance.
 - 📅 2024-09-08: We have a **New Release for v1.1.0!**, to allows UFO to click on any region of the application and reduces its latency by up tp 1/3!

diff --git a/dataflow/README.md b/dataflow/README.md
@@ -5,7 +5,7 @@
 
 <div align="center">
 
-[![arxiv](https://img.shields.io/badge/Paper-arXiv:202402.07939-b31b1b.svg)](https://arxiv.org/abs/2402.07939)&ensp;
+[![arxiv](https://img.shields.io/badge/Paper-arXiv:2412.10047-b31b1b.svg)](https://arxiv.org/abs/2412.10047)&ensp;
 ![Python Version](https://img.shields.io/badge/Python-3776AB?&logo=python&logoColor=white-blue&label=3.10%20%7C%203.11)&ensp;
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)&ensp;
 [![Documentation](https://img.shields.io/badge/Documentation-%230ABAB5?style=flat&logo=readthedocs&logoColor=black)](https://microsoft.github.io/UFO/dataflow/overview/)&ensp;
@@ -20,13 +20,17 @@
 
 This repository contains the implementation of the **Data Collection** process for training the **Large Action Models** (LAMs) in the [**UFO**](https://arxiv.org/abs/2402.07939) project. The **Data Collection** process is designed to streamline task processing, ensuring that all necessary steps are seamlessly integrated from initialization to execution. This module is part of the [**UFO**](https://arxiv.org/abs/2402.07939) project.
 
-If you find this project useful, please consider giving a star ⭐, and cite our paper:
+If you find this project useful, please give a star ⭐, and consider to cite our paper:
 
 ```bibtex
-@article{UFO2024,
-  title={Large Action Models: From Inception to Implementation},
-  author={Microsoft},
-  year={2024}
+@misc{wang2024largeactionmodelsinception,
+      title={Large Action Models: From Inception to Implementation}, 
+      author={Lu Wang and Fangkai Yang and Chaoyun Zhang and Junting Lu and Jiaxu Qian and Shilin He and Pu Zhao and Bo Qiao and Ray Huang and Si Qin and Qisheng Su and Jiayi Ye and Yudi Zhang and Jian-Guang Lou and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Qi Zhang},
+      year={2024},
+      eprint={2412.10047},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2412.10047}, 
 }
 ```
 

diff --git a/documents/docs/advanced_usage/batch_mode.md b/documents/docs/advanced_usage/batch_mode.md
@@ -0,0 +1,67 @@
+# Batch Mode
+
+Batch mode is a feature of UFO, the agent allows batch automation of tasks.
+
+## Quick Start
+
+### Step 1: Create a Plan file
+
+Before starting the Batch mode, you need to create a plan file that contains the list of steps for the agent to follow. The plan file is a JSON file that contains the following fields:
+
+| Field  | Description                                                                                  | Type    |
+| ------ | -------------------------------------------------------------------------------------------- | ------- |
+| task   | The task description.                                                                        | String  |
+| object | The application or file to interact with.                                                    | String  |
+| close  | Determines whether to close the corresponding application or file after completing the task. | Boolean |
+
+Below is an example of a plan file:
+
+```json
+{
+    "task": "Type in a text of 'Test For Fun' with heading 1 level",
+    "object": "draft.docx",
+    "close": False
+}
+```
+
+!!! note
+    The `object` field is the application or file that the agent will interact with. The object **must be active** (can be minimized) when starting the Batch mode.
+    The structure of your files should be as follows, where `tasks` is the directory for your tasks and `files` is where your object files are stored:
+
+    - Parent
+      - tasks
+      - files
+
+
+### Step 2: Start the Batch Mode
+To start the Batch mode, run the following command:
+
+```bash
+# assume you are in the cloned UFO folder
+python ufo.py --task_name {task_name} --mode batch_normal --plan {plan_file}
+```
+
+!!! tip
+    Replace `{task_name}` with the name of the task and `{plan_file}` with the `Path_to_Parent/Plan_file`.
+
+
+
+## Evaluation
+You may want to evaluate the `task` is completed successfully or not by following the plan. UFO will call the `EvaluationAgent` to evaluate the task if `EVA_SESSION` is set to `True` in the `config_dev.yaml` file.
+
+You can check the evaluation log in the `logs/{task_name}/evaluation.log` file. 
+
+# References
+The batch mode employs a `PlanReader` to parse the plan file and create a `FromFileSession` to follow the plan. 
+
+## PlanReader
+The `PlanReader` is located in the `ufo/module/sessions/plan_reader.py` file.
+
+:::module.sessions.plan_reader.PlanReader
+
+<br>
+## FollowerSession
+
+The `FromFileSession` is also located in the `ufo/module/sessions/session.py` file.
+
+:::module.sessions.session.FromFileSession
diff --git a/documents/docs/agents/overview.md b/documents/docs/agents/overview.md
@@ -2,12 +2,12 @@
 
 In UFO, there are four types of agents: `HostAgent`, `AppAgent`, `FollowerAgent`, and `EvaluationAgent`. Each agent has a specific role in the UFO system and is responsible for different aspects of the user interaction process:
 
-| Agent | Description |
-| --- | --- |
-| [`HostAgent`](../agents/host_agent.md) | Decomposes the user request into sub-tasks and selects the appropriate application to fulfill the request. |
-| [`AppAgent`](../agents/app_agent.md) | Executes actions on the selected application. |
-| [`FollowerAgent`](../agents/follower_agent.md) | Follows the user's instructions to complete the task. |
-| [`EvaluationAgent`](../agents/evaluation_agent.md) | Evaluates the completeness of a session or a round. |
+| Agent                                              | Description                                                                                                |
+| -------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| [`HostAgent`](../agents/host_agent.md)             | Decomposes the user request into sub-tasks and selects the appropriate application to fulfill the request. |
+| [`AppAgent`](../agents/app_agent.md)               | Executes actions on the selected application.                                                              |
+| [`FollowerAgent`](../agents/follower_agent.md)     | Follows the user's instructions to complete the task.                                                      |
+| [`EvaluationAgent`](../agents/evaluation_agent.md) | Evaluates the completeness of a session or a round.                                                        |
 
 In the normal workflow, only the `HostAgent` and `AppAgent` are involved in the user interaction process. The `FollowerAgent` and `EvaluationAgent` are used for specific tasks.
 
@@ -21,13 +21,13 @@ Please see below the orchestration of the agents in UFO:
 
 An agent in UFO is composed of the following main components to fulfill its role in the UFO system:
 
-| Component | Description |
-| --- | --- |
-| [`State`](../agents/design/state.md) | Represents the current state of the agent and determines the next action and agent to handle the request. |
-| [`Memory`](../agents/design/memory.md) | Stores information about the user request, application state, and other relevant data. |
-| [`Blackboard`](../agents/design/blackboard.md) | Stores information shared between agents. |
-| [`Prompter`](../agents/design/prompter.md) | Generates prompts for the language model based on the user request and application state. |
-| [`Processor`](../agents/design/processor.md) | Processes the workflow of the agent, including handling user requests, executing actions, and memory management. |
+| Component                                      | Description                                                                                                      |
+| ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
+| [`State`](../agents/design/state.md)           | Represents the current state of the agent and determines the next action and agent to handle the request.        |
+| [`Memory`](../agents/design/memory.md)         | Stores information about the user request, application state, and other relevant data.                           |
+| [`Blackboard`](../agents/design/blackboard.md) | Stores information shared between agents.                                                                        |
+| [`Prompter`](../agents/design/prompter.md)     | Generates prompts for the language model based on the user request and application state.                        |
+| [`Processor`](../agents/design/processor.md)   | Processes the workflow of the agent, including handling user requests, executing actions, and memory management. |
 
 ## Reference
 

diff --git a/documents/docs/dataflow/overview.md b/documents/docs/dataflow/overview.md
@@ -1,6 +1,6 @@
 # Introduction
 
-This repository contains the implementation of the **Data Collection** process for training the **Large Action Models** (LAMs) in the paper of [Large Action Models: From Inception to Implementation]. The **Data Collection** process is designed to streamline task processing, ensuring that all necessary steps are seamlessly integrated from initialization to execution. This module is part of the [**UFO**](https://arxiv.org/abs/2402.07939) project.
+This repository contains the implementation of the **Data Collection** process for training the **Large Action Models** (LAMs) in the paper of [Large Action Models: From Inception to Implementation](https://arxiv.org/abs/2412.10047). The **Data Collection** process is designed to streamline task processing, ensuring that all necessary steps are seamlessly integrated from initialization to execution. This module is part of the [**UFO**](https://arxiv.org/abs/2402.07939) project.
 
 # Dataflow
 

diff --git a/ufo/agents/agent/host_agent.py b/ufo/agents/agent/host_agent.py
@@ -39,6 +39,8 @@ def create_agent(agent_type: str, *args, **kwargs) -> BasicAgent:
             return AppAgent(*args, **kwargs)
         elif agent_type == "follower":
             return FollowerAgent(*args, **kwargs)
+        elif agent_type == "batch_normal":
+            return AppAgent(*args, **kwargs)
         else:
             raise ValueError("Invalid agent type: {}".format(agent_type))
 
@@ -233,10 +235,16 @@ def create_app_agent(
         :return: The app agent.
         """
 
-        if mode == "normal":
+        if mode == "normal" or "batch_normal":
 
-            agent_name = "AppAgent/{root}/{process}".format(
-                root=application_root_name, process=application_window_name
+            agent_name = (
+                "AppAgent/{root}/{process}".format(
+                    root=application_root_name, process=application_window_name
+                )
+                if mode == "normal"
+                else "BatchAgent/{root}/{process}".format(
+                    root=application_root_name, process=application_window_name
+                )
             )
 
             app_agent: AppAgent = self.create_subagent(

diff --git a/ufo/agents/states/host_agent_state.py b/ufo/agents/states/host_agent_state.py
@@ -198,14 +198,15 @@ def next_state(self, agent: "HostAgent") -> AppAgentState:
         :param agent: The current agent.
         :return: The state for the next step.
         """
-
+        
         # Transition to the app agent state.
         # Lazy import to avoid circular dependency.
 
         from ufo.agents.states.app_agent_state import ContinueAppAgentState
 
         return ContinueAppAgentState()
 
+
     def next_agent(self, agent: "HostAgent") -> AppAgent:
         """
         Get the agent for the next step.

diff --git a/ufo/config/config_dev.yaml b/ufo/config/config_dev.yaml
@@ -101,3 +101,9 @@ DEFAULT_PNG_COMPRESS_LEVEL: 9  # The compress level for the PNG image, 0-9, 0 is
 
 # Save UI tree
 SAVE_UI_TREE: False  # Whether to save the UI tree
+
+
+# Record the status of the tasks
+TASK_STATUS: True  # Whether to record the status of the tasks in batch execution mode.
+# TASK_STATUS_FILE # The path for the task status file.
+
diff --git a/ufo/module/sessions/plan_reader.py b/ufo/module/sessions/plan_reader.py
@@ -2,6 +2,7 @@
 # Licensed under the MIT License.
 
 import json
+import os
 from typing import List, Optional
 
 from ufo.config.config import Config
@@ -20,9 +21,19 @@ def __init__(self, plan_file: str):
         :param plan_file: The path of the plan file.
         """
 
+        self.plan_file = plan_file
         with open(plan_file, "r") as f:
             self.plan = json.load(f)
         self.remaining_steps = self.get_steps()
+        self.support_apps = ["word", "excel", "powerpoint"]
+
+    def get_close(self) -> bool:
+        """
+        Check if the plan is closed.
+        :return: True if the plan need closed, False otherwise.
+        """
+
+        return self.plan.get("close", False)
 
     def get_task(self) -> str:
         """
@@ -46,7 +57,7 @@ def get_operation_object(self) -> str:
         :return: The operation object.
         """
 
-        return self.plan.get("object", "")
+        return self.plan.get("object", None).lower()
 
     def get_initial_request(self) -> str:
         """
@@ -76,6 +87,42 @@ def get_host_agent_request(self) -> str:
 
         return request
 
+    def get_file_path(self):
+
+        file_path = os.path.dirname(os.path.abspath(self.plan_file)).replace(
+            "tasks", "files"
+        )
+        file = os.path.basename(
+            self.plan.get(
+                "object",
+            )
+        )
+
+        return os.path.join(file_path, file)
+
+    def get_support_apps(self) -> List[str]:
+        """
+        Get the support apps in the plan.
+        :return: The support apps in the plan.
+        """
+
+        return self.support_apps
+
+    def get_host_request(self) -> str:
+        """
+        Get the request for the host agent.
+        :return: The request for the host agent.
+        """
+
+        task = self.get_task()
+        object_name = self.get_operation_object()
+        if object_name in self.support_apps:
+            request = task
+        else:
+            request = f"Open the application of {task}. You must output the selected application with their control text and label even if it is already open."
+
+        return request
+
     def next_step(self) -> Optional[str]:
         """
         Get the next step in the plan.
@@ -95,3 +142,11 @@ def task_finished(self) -> bool:
         """
 
         return not self.remaining_steps
+
+    def get_root_path(self) -> str:
+        """
+        Get the root path of the plan.
+        :return: The root path of the plan.
+        """
+
+        return os.path.dirname(os.path.abspath(self.plan_file))