Skip to content

microsoft/UFO

Repository files navigation

UFO UFO Image: A UI-Focused Agent for Windows OS Interaction

arxivPython VersionLicense: MITDocumentationYouTube

UFO is a UI-Focused multi-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

🕌 Framework

UFO UFO Image operates as a multi-agent framework, encompassing:

  • HostAgent 🤖, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
  • AppAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
  • Application Automator 🎮, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and through UI controls, native APIs or AI tools. Check out more details here.

Both agents leverage the multi-modal capabilities of GPT-4V(o) to comprehend the application UI and fulfill the user's request. For more details, please consult our technical report and documentation.

📢 News

  • 📅 2024-12-13: We have a New Release for v1.2.0!! Checkout our new features and improvements:
    1. Large Action Model (LAM) Data Collection: We have released the code and sample data for Large Action Model (LAM) data collection with UFO! Please checkout our new paper, code and documentation for more details.
    2. Bash Command Support: HostAgent also support bash command now!
    3. Bug Fixes: We have fixed some bugs, error handling, and improved the overall performance.
  • 📅 2024-09-08: We have a New Release for v1.1.0!, to allows UFO to click on any region of the application and reduces its latency by up tp 1/3!
  • 📅 2024-07-06: We have a New Release for v1.0.0!. You can check out our documentation. We welcome your contributions and feedback!
  • 📅 2024-06-28: We are thrilled to announce that our official introduction video is now available on YouTube!
  • 📅 ...
  • 📅 2024-02-14: Our technical report is online!
  • 📅 2024-02-10: UFO is released on GitHub🎈. Happy Chinese New year🐉!

🌐 Media Coverage

UFO sightings have garnered attention from various media outlets, including:

These sources provide insights into the evolving landscape of technology and the implications of UFO phenomena on various platforms.

💥 Highlights

  • First Windows Agent - UFO is the pioneering agent framework capable of translating user requests in natural language into actionable operations on Windows OS.
  • Agent as an Expert - UFO is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including offline help documents, online search engines, and human demonstrations, making the agent an application "expert".
  • Rich Skill Set - UFO is equipped with a diverse set of skills to support comprehensive automation, such as mouse, keyboard, native API, and "Copilot".
  • Interactive Mode - UFO facilitates multiple sub-requests from users within the same session, enabling the seamless completion of complex tasks.
  • Agent Customization - UFO allows users to customize their own agents by providing additional information. The agent will proactively query users for details when necessary to better tailor its behavior.
  • Scalable AppAgent Creation - UFO offers extensibility, allowing users and app developers to create their own AppAgents in an easy and scalable way.

✨ Getting Started

🛠️ Step 1: Installation

UFO requires Python >= 3.10 running on Windows OS >= 10. It can be installed by running the following command:

# [optional to create conda environment]
# conda create -n ufo python=3.10
# conda activate ufo

# clone the repository
git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt
# If you want to use the Qwen as your LLMs, uncomment the related libs.

⚙️ Step 2: Configure the LLMs

Before running UFO, you need to provide your LLM configurations individually for HostAgent and AppAgent. You can create your own config file ufo/config/config.yaml, by copying the ufo/config/config.yaml.template and editing config for HOST_AGENT and APP_AGENT as follows:

OpenAI

VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.  
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
API_KEY: "sk-",  # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model

Azure OpenAI (AOAI)

VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.  
API_BASE: "YOUR_ENDPOINT", #  The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
API_KEY: "YOUR_KEY",  # The aoai API key
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model
API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API

You can also non-visial model (e.g., GPT-4) for each agent, by setting VISUAL_MODE: False and proper API_MODEL (openai) and API_DEPLOYMENT_ID (aoai). You can also optionally set an backup LLM engine in the field of BACKUP_AGENT if the above engines failed during the inference.

Non-Visual Model Configuration

You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml file:

  • VISUAL_MODE: False # To enable non-visual mode.
  • Specify the appropriate API_MODEL (OpenAI) and API_DEPLOYMENT_ID (AOAI) for each agent.

Optionally, you can set a backup language model (LLM) engine in the BACKUP_AGENT field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.

NOTE 💡

UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the documents for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in config_dev.yaml.

📔 Step 3: Additional Setting for RAG (optional).

If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the ufo/config/config.yaml file.

We provide the following options for RAG to enhance UFO's capabilities:

Consult their respective documentation for more information on how to configure these settings.

🎉 Step 4: Start UFO

⌨️ You can execute the following on your Windows command Line (CLI):

# assume you are in the cloned UFO folder
python -m ufo --task <your_task_name>

This will start the UFO process and you can interact with it through the command line interface. If everything goes well, you will see the following message:

Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction. 
 _   _  _____   ___
| | | ||  ___| / _ \
| | | || |_   | | | |
| |_| ||  _|  | |_| |
 \___/ |_|     \___/
Please enter your request to be completed🛸:

⚠️Reminder:

  • Before UFO executing your request, please make sure the targeted applications are active on the system.
  • The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to DISCLAIMER.md.

Step 5 🎥: Execution Logs

You can find the screenshots taken and request & response logs in the following folder:

./ufo/logs/<your_task_name>/

You may use them to debug, replay, or analyze the agent output.

❓Get help

  • Please first check our our documentation here.
  • ❔GitHub Issues (prefered)
  • For other communications, please contact ufo-agent@microsoft.com.

📊 Evaluation

Please consult the WindowsBench provided in Section A of the Appendix within our technical report. Here are some tips (and requirements) to aid in completing your request:

  • Prior to UFO execution of your request, ensure that the targeted application is active (though it may be minimized).
  • Please note that the output of GPT-V may not consistently align with the same request. If unsuccessful with your initial attempt, consider trying again.

📚 Citation

Our technical report paper can be found here. Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions. If you use UFO in your research, please cite our paper:

@article{ufo,
  title={{UFO: A UI-Focused Agent for Windows OS Interaction}},
  author={Zhang, Chaoyun and Li, Liqun and He, Shilin and Zhang, Xu and Qiao, Bo and  Qin, Si and Ma, Minghua and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and  Zhang, Qi},
  journal={arXiv preprint arXiv:2402.07939},
  year={2024}
}

📝 Todo List

  • RAG enhanced UFO.
  • Support more control using Win32 API.
  • Documentation.
  • Support local host GUI interaction model.
  • Chatbox GUI for UFO.

🎨 Related Projects

  1. If you're interested in data analytics agent frameworks, check out TaskWeaver, a code-first LLM agent framework designed for seamlessly planning and executing data analytics tasks.

  2. For more information on GUI agents, refer to our survey paper: Large Language Model-Brained GUI Agents: A Survey. You can also explore the survey through:

⚠️ Disclaimer

By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices in DISCLAIMER.md

logo Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.