Upgrade to Browsergym 0.13.3 (#1)

* Patch VWA task IDs * Add BLIP2 evaluator; patch timeout * Actually add the captioning_fn into evaluator constructor * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * support custom observation * add user data dir * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * add init script support * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump v0.13.1 * Update README.md * Update README.md * Update README.md * Update README.md * Authors update (ServiceNow#260) * TapeAgents export for experiment results (ServiceNow#238) * Update README.md * Cleanup * Add weblinx_browsergym as a dependency (ServiceNow#261) * Typo fix (ServiceNow#264) * Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165) * adding AgentInfo to __init__ for convenience (ServiceNow#166) * libvisualwebarena==0.0.14 (ServiceNow#171) fixed the jsons file! * Leaner traces (ServiceNow#169) * images aren't saved in pkl files anymore, and are stuffed back in at load time * added kwargs to control img/som saving * saving as png, adding screenshots back into obs * retrocompatibility for image loading * making get_screenshots work for png and jpg * fixing image types and closing files * Goal refactor to allow for local image files (ServiceNow#110) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.8.0 * Integrate AgentLab tests (ServiceNow#176) * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump v0.13.1 * Fix broken links * Update README.md * fix merging issues * Update README.md (ServiceNow#270) * Update README.md * README update * More permissive WA/VWA instance reset (ServiceNow#272) * New debug benchmark visualwebarena_tiny (ServiceNow#271) * Version bump v0.13.2 * Update README.md * Metadata column fix (ServiceNow#278) * Update README.md * Update README.md * Update README.md * Update README.md * Shunt WA / VWA unit tests * README update * Minor fixes (ServiceNow#281) * version bump v0.13.3 * remove unused fluff * revert more unintended changes --------- Co-authored-by: Peng Qi <1572802+qipeng@users.noreply.github.com> Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Yanan Xie <yanan@orby.ai> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> Co-authored-by: oriyor <39461788+oriyor@users.noreply.github.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: Oleh Shliazhko <ollmer@users.noreply.github.com> Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
orby-ai-engineering · Jan 18, 2025 · e8982ea · e8982ea
1 parent 66e8073
commit e8982ea
Show file tree

Hide file tree

Showing 81 changed files with 38,345 additions and 1,004 deletions.
diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
@@ -7,7 +7,7 @@ on: [push, workflow_dispatch]
 jobs:
     build:
       name: Build
-      runs-on: ubuntu-latest
+      runs-on: ubuntu-22.04
 
       steps:
       - uses: actions/checkout@v4
@@ -32,6 +32,9 @@ jobs:
       - name: Build a binary wheel and a source tarball (browsergym-webarena)
         run: python3 -m build browsergym/visualwebarena/ --outdir dist/
 
+      - name: Build a binary wheel and a source tarball (browsergym-assistantbench)
+        run: python3 -m build browsergym/assistantbench/ --outdir dist/
+
       - name: Build a binary wheel and a source tarball (browsergym-experiments)
         run: python3 -m build browsergym/experiments/ --outdir dist/
 
@@ -49,7 +52,7 @@ jobs:
       if: startsWith(github.ref, 'refs/tags/')  # only publish to PyPI on tag pushes
       needs:
         - build
-      runs-on: ubuntu-latest
+      runs-on: ubuntu-22.04
       environment: pypi
       permissions:
         id-token: write  # IMPORTANT: mandatory for trusted publishing
@@ -68,7 +71,7 @@ jobs:
       name: Sign packages with Sigstore and upload them to GitHub Release
       needs:
       - publish-to-pypi
-      runs-on: ubuntu-latest
+      runs-on: ubuntu-22.04
 
       permissions:
         contents: write  # IMPORTANT: mandatory for making GitHub Releases

diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
@@ -5,6 +5,7 @@ on:
     branches:
       - main
   pull_request:
+  workflow_dispatch:
 
 jobs:
 
@@ -33,7 +34,7 @@ jobs:
         run: black . --check
 
   agentlab:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
 
     defaults:
       run:
@@ -86,7 +87,7 @@ jobs:
         run: pytest -n 5 --durations=10 -m 'not pricy' -v agentlab/tests/experiments/test_launch_exp.py
 
   browsergym-core:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
 
     defaults:
       run:
@@ -117,7 +118,7 @@ jobs:
         run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/core
 
   browsergym-miniwob:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
 
     defaults:
       run:
@@ -163,7 +164,7 @@ jobs:
         run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/miniwob
 
   browsergym-experiments:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
 
     defaults:
       run:
@@ -203,13 +204,15 @@ jobs:
           directory: "${{ github.workspace }}/miniwob-plusplus/miniwob/html"
           port: 8080
 
-      - name: Run browsergym-miniwob Unit Tests
+      - name: Run browsergym-experiments Unit Tests
         env:
           MINIWOB_URL: "http://localhost:8080/miniwob/"
+          BROWSERGYM_WEBLINX_CACHE_DIR: "${{ runner.temp }}/weblinx_data"
         run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/experiments
 
   browsergym-webarena-fast:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
+    if: ${{ false && startsWith(github.ref, 'refs/heads/main') }}
 
     defaults:
       run:
@@ -248,7 +251,7 @@ jobs:
         run: pytest -n 5 --durations=10 -m 'not slow and not pricy' --slowmo 1000 -v tests/webarena
 
   browsergym-webarena-slow:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     needs:
       - browsergym-webarena-fast
 
@@ -289,7 +292,8 @@ jobs:
         run: pytest -n 5 --durations=10 -m 'slow and not pricy' --slowmo 1000 -v tests/webarena
 
   browsergym-visualwebarena-fast:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
+    if: ${{ false && startsWith(github.ref, 'refs/heads/main') }}
 
     defaults:
       run:
@@ -328,7 +332,7 @@ jobs:
           pytest -n 5 --durations=10 -m 'not slow and not pricy' --slowmo 1000 -v tests/visualwebarena
 
   browsergym-visualwebarena-slow:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-22.04
     needs:
       - browsergym-visualwebarena-fast
 
@@ -368,3 +372,42 @@ jobs:
         run: |
           pytest -n 5 --durations=10 -m 'slow and not pricy and not serial' --slowmo 1000 -v tests/visualwebarena
           pytest --durations=10 -m 'slow and not pricy and serial' --slowmo 1000 -v tests/visualwebarena
+
+  browsergym-assistantbench:
+    runs-on: ubuntu-22.04
+
+    defaults:
+      run:
+        shell: bash -l {0}
+
+    steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '>=3.10'
+          cache: 'pip' # caching pip dependencies
+
+      - name: Pip install
+        working-directory: ./dev
+        run: pip install -r requirements.txt
+
+      - name: Pip list
+        run: pip list
+
+      - name: Install Playwright
+        run: playwright install chromium --with-deps
+
+      - name: Run browsergym-assistantbench Unit Tests
+        env:
+          VWA_CLASSIFIEDS: "${{ vars.VWA_CLASSIFIEDS }}"
+          VWA_CLASSIFIEDS_RESET_TOKEN: "${{ vars.VWA_CLASSIFIEDS_RESET_TOKEN }}"
+          VWA_SHOPPING: "${{ vars.VWA_SHOPPING }}"
+          VWA_REDDIT: "${{ vars.VWA_REDDIT }}"
+          VWA_WIKIPEDIA: "${{ vars.VWA_WIKIPEDIA }}"
+          VWA_HOMEPAGE: "${{ vars.VWA_HOMEPAGE }}"
+          OPENAI_API_KEY: ""
+        run: |
+          pytest -n 5 --durations=10 -m 'not pricy' --slowmo 1000 -v tests/assistantbench
diff --git a/.gitignore b/.gitignore
@@ -138,4 +138,15 @@ error_logs.txt
 # tests
 tests/results
 tmp.py
-.vscode/settings.json
+.vscode/**
+
+# demo and results
+results/
+
+.vscode/launch.json
+
+# assistantbench
+tests/assistantbench/assistantbench-predictions-test.jsonl
+
+# weblinx
+bg_wl_data/
diff --git a/Makefile b/Makefile
@@ -1,12 +1,12 @@
 install:
 	@echo "--- 🚀 Installing project dependencies ---"
-	pip install -e ./browsergym/core -e ./browsergym/miniwob -e ./browsergym/webarena -e ./browsergym/visualwebarena/ -e ./browsergym/experiments -e ./browsergym/
-	playwright install chromium --with-deps
+	pip install -e ./browsergym/core -e ./browsergym/miniwob -e ./browsergym/webarena -e ./browsergym/visualwebarena/ -e ./browsergym/experiments -e ./browsergym/assistantbench -e ./browsergym/
+	playwright install chromium
 
 install-demo:
 	@echo "--- 🚀 Installing demo dependencies ---"
 	pip install -r demo_agent/requirements.txt
-	playwright install chromium --with-deps
+	playwright install chromium
 
 demo:
 	@echo "--- 🚀 Running demo agent ---"