Skip to content

Commit

Permalink
Upgrade to Browsergym 0.13.3 (#1)
Browse files Browse the repository at this point in the history
* Patch VWA task IDs

* Add BLIP2 evaluator; patch timeout

* Actually add the captioning_fn into evaluator constructor

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* support custom observation

* add user data dir

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* add init script support

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump v0.13.1

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Authors update (ServiceNow#260)

* TapeAgents export for experiment results (ServiceNow#238)

* Update README.md

* Cleanup

* Add weblinx_browsergym as a dependency (ServiceNow#261)

* Typo fix (ServiceNow#264)

* Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165)

* adding AgentInfo to __init__ for convenience (ServiceNow#166)

* libvisualwebarena==0.0.14 (ServiceNow#171)

fixed the jsons file!

* Leaner traces (ServiceNow#169)

* images aren't saved in pkl files anymore, and are stuffed back in at load time

* added kwargs to control img/som saving

* saving as png, adding screenshots back into obs

* retrocompatibility for image loading

* making get_screenshots work for png and jpg

* fixing image types and closing files

* Goal refactor to allow for local image files (ServiceNow#110)


---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.8.0

* Integrate AgentLab tests (ServiceNow#176)

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump v0.13.1

* Fix broken links

* Update README.md

* fix merging issues

* Update README.md (ServiceNow#270)

* Update README.md

* README update

* More permissive WA/VWA instance reset (ServiceNow#272)

* New debug benchmark visualwebarena_tiny (ServiceNow#271)

* Version bump v0.13.2

* Update README.md

* Metadata column fix (ServiceNow#278)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Shunt WA / VWA unit tests

* README update

* Minor fixes (ServiceNow#281)

* version bump v0.13.3

* remove unused fluff

* revert more unintended changes

---------

Co-authored-by: Peng Qi <1572802+qipeng@users.noreply.github.com>
Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Yanan Xie <yanan@orby.ai>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
Co-authored-by: oriyor <39461788+oriyor@users.noreply.github.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Oleh Shliazhko <ollmer@users.noreply.github.com>
Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
  • Loading branch information
13 people authored Jan 18, 2025
1 parent 66e8073 commit e8982ea
Show file tree
Hide file tree
Showing 81 changed files with 38,345 additions and 1,004 deletions.
9 changes: 6 additions & 3 deletions .github/workflows/pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on: [push, workflow_dispatch]
jobs:
build:
name: Build
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

steps:
- uses: actions/checkout@v4
Expand All @@ -32,6 +32,9 @@ jobs:
- name: Build a binary wheel and a source tarball (browsergym-webarena)
run: python3 -m build browsergym/visualwebarena/ --outdir dist/

- name: Build a binary wheel and a source tarball (browsergym-assistantbench)
run: python3 -m build browsergym/assistantbench/ --outdir dist/

- name: Build a binary wheel and a source tarball (browsergym-experiments)
run: python3 -m build browsergym/experiments/ --outdir dist/

Expand All @@ -49,7 +52,7 @@ jobs:
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
needs:
- build
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
environment: pypi
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing
Expand All @@ -68,7 +71,7 @@ jobs:
name: Sign packages with Sigstore and upload them to GitHub Release
needs:
- publish-to-pypi
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

permissions:
contents: write # IMPORTANT: mandatory for making GitHub Releases
Expand Down
61 changes: 52 additions & 9 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
branches:
- main
pull_request:
workflow_dispatch:

jobs:

Expand Down Expand Up @@ -33,7 +34,7 @@ jobs:
run: black . --check

agentlab:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

defaults:
run:
Expand Down Expand Up @@ -86,7 +87,7 @@ jobs:
run: pytest -n 5 --durations=10 -m 'not pricy' -v agentlab/tests/experiments/test_launch_exp.py

browsergym-core:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

defaults:
run:
Expand Down Expand Up @@ -117,7 +118,7 @@ jobs:
run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/core

browsergym-miniwob:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

defaults:
run:
Expand Down Expand Up @@ -163,7 +164,7 @@ jobs:
run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/miniwob

browsergym-experiments:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04

defaults:
run:
Expand Down Expand Up @@ -203,13 +204,15 @@ jobs:
directory: "${{ github.workspace }}/miniwob-plusplus/miniwob/html"
port: 8080

- name: Run browsergym-miniwob Unit Tests
- name: Run browsergym-experiments Unit Tests
env:
MINIWOB_URL: "http://localhost:8080/miniwob/"
BROWSERGYM_WEBLINX_CACHE_DIR: "${{ runner.temp }}/weblinx_data"
run: pytest -n 5 --durations=10 -m 'not pricy' -v tests/experiments

browsergym-webarena-fast:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
if: ${{ false && startsWith(github.ref, 'refs/heads/main') }}

defaults:
run:
Expand Down Expand Up @@ -248,7 +251,7 @@ jobs:
run: pytest -n 5 --durations=10 -m 'not slow and not pricy' --slowmo 1000 -v tests/webarena

browsergym-webarena-slow:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
needs:
- browsergym-webarena-fast

Expand Down Expand Up @@ -289,7 +292,8 @@ jobs:
run: pytest -n 5 --durations=10 -m 'slow and not pricy' --slowmo 1000 -v tests/webarena

browsergym-visualwebarena-fast:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
if: ${{ false && startsWith(github.ref, 'refs/heads/main') }}

defaults:
run:
Expand Down Expand Up @@ -328,7 +332,7 @@ jobs:
pytest -n 5 --durations=10 -m 'not slow and not pricy' --slowmo 1000 -v tests/visualwebarena
browsergym-visualwebarena-slow:
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
needs:
- browsergym-visualwebarena-fast

Expand Down Expand Up @@ -368,3 +372,42 @@ jobs:
run: |
pytest -n 5 --durations=10 -m 'slow and not pricy and not serial' --slowmo 1000 -v tests/visualwebarena
pytest --durations=10 -m 'slow and not pricy and serial' --slowmo 1000 -v tests/visualwebarena
browsergym-assistantbench:
runs-on: ubuntu-22.04

defaults:
run:
shell: bash -l {0}

steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '>=3.10'
cache: 'pip' # caching pip dependencies

- name: Pip install
working-directory: ./dev
run: pip install -r requirements.txt

- name: Pip list
run: pip list

- name: Install Playwright
run: playwright install chromium --with-deps

- name: Run browsergym-assistantbench Unit Tests
env:
VWA_CLASSIFIEDS: "${{ vars.VWA_CLASSIFIEDS }}"
VWA_CLASSIFIEDS_RESET_TOKEN: "${{ vars.VWA_CLASSIFIEDS_RESET_TOKEN }}"
VWA_SHOPPING: "${{ vars.VWA_SHOPPING }}"
VWA_REDDIT: "${{ vars.VWA_REDDIT }}"
VWA_WIKIPEDIA: "${{ vars.VWA_WIKIPEDIA }}"
VWA_HOMEPAGE: "${{ vars.VWA_HOMEPAGE }}"
OPENAI_API_KEY: ""
run: |
pytest -n 5 --durations=10 -m 'not pricy' --slowmo 1000 -v tests/assistantbench
13 changes: 12 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,4 +138,15 @@ error_logs.txt
# tests
tests/results
tmp.py
.vscode/settings.json
.vscode/**

# demo and results
results/

.vscode/launch.json

# assistantbench
tests/assistantbench/assistantbench-predictions-test.jsonl

# weblinx
bg_wl_data/
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
install:
@echo "--- 🚀 Installing project dependencies ---"
pip install -e ./browsergym/core -e ./browsergym/miniwob -e ./browsergym/webarena -e ./browsergym/visualwebarena/ -e ./browsergym/experiments -e ./browsergym/
playwright install chromium --with-deps
pip install -e ./browsergym/core -e ./browsergym/miniwob -e ./browsergym/webarena -e ./browsergym/visualwebarena/ -e ./browsergym/experiments -e ./browsergym/assistantbench -e ./browsergym/
playwright install chromium

install-demo:
@echo "--- 🚀 Installing demo dependencies ---"
pip install -r demo_agent/requirements.txt
playwright install chromium --with-deps
playwright install chromium

demo:
@echo "--- 🚀 Running demo agent ---"
Expand Down
Loading

0 comments on commit e8982ea

Please # to comment.