Skip to content

Latest commit

 

History

History
64 lines (45 loc) · 5.12 KB

README.md

File metadata and controls

64 lines (45 loc) · 5.12 KB

Agents

We evaluated three open-source agents. We slightly modified each agent to elicit better capabilities. We also developed a "dummy" agent, used to check that the environment is configured correctly. Each agent, alongside the link to our fork, is listed below. Each agent has an associated ID which we use to identify it within our repo.

Agent ID Fork
dummy dummy N/A
AIDE aide https://github.com/thesofakillers/aideml
MLAgentBench mlagentbench https://github.com/JunShern/MLAgentBench
OpenHands opendevin https://github.com/thesofakillers/OpenHands

Prerequisites

If you want to run these agents locally:

To build an image for an agent with ID <agent>, run:

export SUBMISSION_DIR=/home/submission
export LOGS_DIR=/home/logs
export CODE_DIR=/home/code
export AGENT_DIR=/home/agent

docker build --platform=linux/amd64 -t <agent> agents/<agent>/ --build-arg SUBMISSION_DIR=$SUBMISSION_DIR --build-arg LOGS_DIR=$LOGS_DIR --build-arg CODE_DIR=$CODE_DIR --build-arg AGENT_DIR=$AGENT_DIR

Running agents

Our run_agent.py script allows you to run agents locally on a given set of competitions. In the experiments/splits/ directory, we have several files, each containing a set of competition IDs. The experiments/splits/all.txt file contains all competitions. The experiments/splits/spaceship-titanic.txt split just contains the Spaceship Titanic competition, which is useful for testing. For example, to run the dummy agent on the Spaceship Titanic competition, you can run:

python run_agent.py --agent-id dummy --competition-set experiments/splits/spaceship-titanic.txt

Running run_agent.py will creates a "run group" directory in the runs/ directory. The run group directory will contain a subdirectory for each competition that the agent was evaluated on, containing the agent's logs, code, and submission. A metadata.json file will be created on finish within the run group directory, summarizing the results of the runs. You can then grade this run using the metadata.json file. For example, to grade the run group <run-group>, you can first use experiments/make_submission.py to generate a submission JSONL file:

python experiments/make_submission.py --metadata runs/<run-group>/metadata.json --output runs/<run-group>/submission.jsonl

You can then use the mlebench grade command to grade this submission:

mlebench grade --submission runs/<run-group>/submission.jsonl --output-dir runs/<run-group>

If you'd like to update the configuration of the container, you can edit the default container config in environment/config/container_configs/default.json, or specify a custom container config JSON file when executing run_agent.py. If you'd like to run the agent with a GPU, you can set "gpus": -1 in the container config JSON file.

dummy agent

We used the dummy agent to verify that our environment was configured correctly. It performs the following checks:

  1. Prints to stdout whether its python script main.py is running with root access
  2. Prints to stdout the Python interpreter the dummy agent is using
  3. Attempts to use the sample submission of the current competition as it's submission. Note, this sometimes fails as sample submissions are sometimes not in the expected format. For example, some competitions compress the sample submission along with the other data into a single file
  4. Checks that it can't read the "private" data; this includes the labels of the test set
  5. Checks it has read/write access to /home, it's working directory

Security

By default we run AIDE, MLAB, and the dummy agent using the sysbox runtime. We recommend users to use this runtime, or other runtimes with sufficient security guarantees, when running models on this benchmark. Since container runtimes like Docker can share host resources with containers, a malicious container could potentially compromise a host; sysbox is designed to mitigate this risk by providing enhanced isolation between containers and the host system

OpenHands uses it's own docker containers during execution, resulting in docker-in-docker (DinD). The sysbox runtime does not yet support GPU-passthrough in DinD (link). We were therefore required to run OpenHands in "privileged" mode, which we deemed acceptable given that our (internal) evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. We wish users to acknowledge this security risk, thus require users to set the I_ACCEPT_RUNNING_PRIVILEGED_CONTAINERS environment variable to True when running OpenHands through run_agent.py.