Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Patch VWA task IDs * Add BLIP2 evaluator; patch timeout * Actually add the captioning_fn into evaluator constructor * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * support custom observation * add user data dir * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * add init script support * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump v0.13.1 * Update README.md * Update README.md * Update README.md * Update README.md * Authors update (ServiceNow#260) * TapeAgents export for experiment results (ServiceNow#238) * Update README.md * Cleanup * Add weblinx_browsergym as a dependency (ServiceNow#261) * Typo fix (ServiceNow#264) * Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165) * adding AgentInfo to __init__ for convenience (ServiceNow#166) * libvisualwebarena==0.0.14 (ServiceNow#171) fixed the jsons file! * Leaner traces (ServiceNow#169) * images aren't saved in pkl files anymore, and are stuffed back in at load time * added kwargs to control img/som saving * saving as png, adding screenshots back into obs * retrocompatibility for image loading * making get_screenshots work for png and jpg * fixing image types and closing files * Goal refactor to allow for local image files (ServiceNow#110) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.8.0 * Integrate AgentLab tests (ServiceNow#176) * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * version bump v0.13.1 * Fix broken links * Update README.md * fix merging issues * Update README.md (ServiceNow#270) * Update README.md * README update * More permissive WA/VWA instance reset (ServiceNow#272) * New debug benchmark visualwebarena_tiny (ServiceNow#271) * Version bump v0.13.2 * Update README.md * Metadata column fix (ServiceNow#278) * Update README.md * Update README.md * Update README.md * Update README.md * Shunt WA / VWA unit tests * README update * Minor fixes (ServiceNow#281) * version bump v0.13.3 * remove unused fluff * revert more unintended changes --------- Co-authored-by: Peng Qi <1572802+qipeng@users.noreply.github.com> Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Yanan Xie <yanan@orby.ai> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> Co-authored-by: oriyor <39461788+oriyor@users.noreply.github.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com> Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: Oleh Shliazhko <ollmer@users.noreply.github.com> Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
- Loading branch information