From 05f5333f56717d9ee6c8f1086da6dc6384a2ba58 Mon Sep 17 00:00:00 2001 From: zwcolin Date: Mon, 19 Aug 2024 15:47:36 -0400 Subject: [PATCH] update readme.md to include latest leaderboard results and some add-ons for evaluation tips --- README.md | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index b0a2b37..b4a922f 100644 --- a/README.md +++ b/README.md @@ -107,6 +107,17 @@ python src/generate.py \ `gen--_.json`. This file stores your model's responses. Note: if you are evaluating a model that is **hosted on cloud and can only be accessed via an API**, the `--model_path` argument will correspond to the name of the model e.g., `gpt-4o-2024-05-13`. Also, in creating the custom file in the `src/generate_lib` directory, you need to implement an additional function i.e., `get_client_model` that takes in the `model_path` argument and the `api_key` argument. In addition, you need to add another `elif` statement in `get_client_fn` inside `src/generate_lib/utils.py` with instructions similar to the above. Specific instructions to implement `get_client_model` function differ by API providers, and examples are provided in `gpt.py`, `gemini.py`, `reka.py`, `claude.py`, and `qwen.py`. +Note: you may see a header like: +```py +### HEADER START ### +import os +vlm_codebase = os.environ['VLM_CODEBASE_DIR'] + +import sys +sys.path.append(vlm_codebase + '/') +### HEADER END ### +``` +in existing implementations in `generate_lib`. This becomes convenient when the model needs to be run with some author-provided code, and you want to load/call the model with their code. In this way, you have a local directory which contains their code (i.e., if you are developing a model or cloning a codebase from github), and you append the directory to the system PATH. ### Evaluation @@ -138,6 +149,9 @@ This python script will automatically match the `scores--_MetadataReasoningDescriptiveModelWeightSize [V/L] (B)OverallTCTGNCNGOverallINEXENUMPATTCNTGCOMPHumanN/AUnknown80.5077.2777.7884.9183.4192.1091.4091.2095.6393.3892.86Claude 3.5 SonnetProprietaryUnknown60.2061.1478.7963.7946.7284.3082.6288.8690.6190.0848.66GPT-4oProprietaryUnknown47.1050.0061.6247.8434.5084.4582.4489.1890.1785.5059.82Gemini 1.5 ProProprietaryUnknown43.3045.6856.5745.6930.5771.9781.7964.7379.4876.3415.18InternVL Chat V2.0 ProProprietaryUnknown39.8040.0060.6144.4025.7676.8377.1184.6777.0778.8827.23InternVL Chat V2.0 76BOpen5.9 / 7038.9040.0059.6042.6724.0275.1777.1178.6976.2079.1332.14GPT-4VProprietaryUnknown37.1038.1857.5837.9325.3379.9278.2985.7988.2180.9241.07GPT-4o MiniProprietaryUnknown34.1035.2347.4732.3327.9574.9274.9182.8169.2179.1335.71Gemini 1.5 FlashProprietaryUnknown33.9036.3654.5530.6023.58------InternVL Chat V2.0 26BOpen5.9 / 2033.4033.1851.5241.8117.4762.4071.3561.0255.9067.946.25Claude 3 SonnetProprietaryUnknown32.2031.5950.5131.4726.2073.6575.7481.9276.6472.268.48Claude 3 HaikuProprietaryUnknown31.8029.7745.4534.4827.0765.0869.8769.9864.8561.838.04Phi-3 VisionOpen0.3 / 431.6031.3646.4635.7821.4060.4867.6261.1854.5965.396.25Claude 3 OpusProprietaryUnknown30.2026.3650.5133.6225.3371.5575.6273.6973.5870.4826.79InternVL Chat V1.5Open5.9 / 2029.2030.0045.4532.3317.4758.5069.6352.9553.0664.635.80Reka CoreProprietaryUnknown28.9027.5041.4128.4526.6455.6058.9050.5265.7271.2510.71Ovis 1.5 Gemma2 9BOpen0.4 / 928.4026.1444.4433.1920.9662.6064.2971.7556.3366.165.80Ovis 1.5 Llama3 8BOpen0.4 / 828.2027.2749.4931.0317.9060.1561.3968.9356.3361.837.14Cambrian 34BOpen1.9 / 3427.3024.5544.4427.5924.8959.7359.3170.9453.2864.635.36Reka FlashProprietaryUnknown26.6026.5939.3930.6017.0356.4561.3948.5969.8772.527.14Mini Gemini HD Yi 34BOpen0.5 / 3425.0026.5943.4327.1611.7952.6853.8655.0465.5053.942.23InternLM XComposer2 4KHDOpen0.3 / 725.0023.8643.4329.3114.8554.6561.0954.0851.5359.806.70MiniCPM-V2.5Open0.4 / 824.9025.2343.4325.4315.7259.2762.2861.9056.7768.9610.27Qwen VL MaxProprietaryUnknown24.7026.1441.4124.5714.8541.4850.4228.4153.7151.154.46VILA 1.5 40BOpen5.9 / 3424.0021.5941.4125.0020.0938.6742.8829.6251.3150.899.82Reka EdgeProprietaryUnknown23.5020.2332.3230.6018.7833.6536.6528.4934.7252.164.91Gemini 1.0 ProProprietaryUnknown22.8020.9148.4818.1020.0954.3767.9739.2360.4862.608.93LLaVA 1.6 Yi 34BOpen0.3 / 3422.5020.4537.3723.7118.7851.0546.3863.4456.1151.915.80Mini Gemini HD Llama3 8BOpen0.5 / 819.0019.7736.3621.127.8644.4249.4139.2351.0955.981.79InternLM XComposer2Open0.3 / 718.7016.1438.3821.9811.7938.7534.1043.5846.7252.935.80MiniCPM-V2Open0.4 / 2.418.5017.9533.3319.4012.2335.7739.7436.5626.4244.535.36IDEFICS 2Open0.4 / 718.2015.4535.3517.2417.0332.7736.1227.2840.8343.263.12IDEFICS 2 ChattyOpen0.4 / 717.8015.4534.3419.8313.1041.5534.8854.5645.6344.276.70MoAIOpen0.3 / 717.509.3236.3621.1221.4028.7031.2021.2339.9640.467.59DeepSeek VLOpen0.5 / 717.1016.3632.3219.839.1745.8049.1145.2042.7960.314.91SPHINX V2Open1.9 / 1316.1013.8628.2817.6713.5430.2535.5924.3741.0529.521.79Qwen VL PlusProprietaryUnknown16.0015.4545.4512.078.3028.9333.3317.9232.1056.232.23LLaVA 1.6 Mistral 7BOpen0.3 / 713.9011.3632.3216.817.8635.4034.7033.9848.9142.498.48ChartGemmaOpen0.4 / 212.5011.5924.2416.814.8021.3027.5818.9714.1919.594.46Random (GPT-4o)N/AUnknown10.804.3239.395.6016.1619.8521.6516.7123.8025.705.36 +| Metadata | | | Reasoning | | | | | Descriptive | | | | | | +| ------------------------- | --------------- | -------------- | --------- | ----- | ----- | ----- | ----- | ----------- | ----- | ----- | ----- | ----- | ----- | +| Model | Weight | Size [V/L] (B) | Overall | TC | TG | NC | NG | Overall | INEX | ENUM | PATT | CNTG | COMP | +|🎖️Human | N/A | Unknown | 80.50 | 77.27 | 77.78 | 84.91 | 83.41 | 92.10 | 91.40 | 91.20 | 95.63 | 93.38 | 92.86 | +|🥇Claude 3.5 Sonnet | Proprietary | Unknown | 60.20 | 61.14 | 78.79 | 63.79 | 46.72 | 84.30 | 82.62 | 88.86 | 90.61 | 90.08 | 48.66 | +|🥈GPT-4o | Proprietary | Unknown | 47.10 | 50.00 | 61.62 | 47.84 | 34.50 | 84.45 | 82.44 | 89.18 | 90.17 | 85.50 | 59.82 | +|🥉Gemini 1.5 Pro | Proprietary | Unknown | 43.30 | 45.68 | 56.57 | 45.69 | 30.57 | 71.97 | 81.79 | 64.73 | 79.48 | 76.34 | 15.18 | +| InternVL Chat V2.0 Pro | Proprietary | Unknown | 39.80 | 40.00 | 60.61 | 44.40 | 25.76 | 76.83 | 77.11 | 84.67 | 77.07 | 78.88 | 27.23 | +| InternVL Chat V2.0 76B | Open | 5.9 / 70 | 38.90 | 40.00 | 59.60 | 42.67 | 24.02 | 75.17 | 77.11 | 78.69 | 76.20 | 79.13 | 32.14 | +| GPT-4V | Proprietary | Unknown | 37.10 | 38.18 | 57.58 | 37.93 | 25.33 | 79.92 | 78.29 | 85.79 | 88.21 | 80.92 | 41.07 | +| GPT-4o Mini | Proprietary | Unknown | 34.10 | 35.23 | 47.47 | 32.33 | 27.95 | 74.92 | 74.91 | 82.81 | 69.21 | 79.13 | 35.71 | +| Gemini 1.5 Flash | Proprietary | Unknown | 33.90 | 36.36 | 54.55 | 30.60 | 23.58 | \- | \- | \- | \- | \- | \- | +| InternVL Chat V2.0 26B | Open | 5.9 / 20 | 33.40 | 33.18 | 51.52 | 41.81 | 17.47 | 62.40 | 71.35 | 61.02 | 55.90 | 67.94 | 6.25 | +| Claude 3 Sonnet | Proprietary | Unknown | 32.20 | 31.59 | 50.51 | 31.47 | 26.20 | 73.65 | 75.74 | 81.92 | 76.64 | 72.26 | 8.48 | +| Claude 3 Haiku | Proprietary | Unknown | 31.80 | 29.77 | 45.45 | 34.48 | 27.07 | 65.08 | 69.87 | 69.98 | 64.85 | 61.83 | 8.04 | +| Phi-3 Vision | Open | 0.3 / 4 | 31.60 | 31.36 | 46.46 | 35.78 | 21.40 | 60.48 | 67.62 | 61.18 | 54.59 | 65.39 | 6.25 | +| MiniCPM-V2.6 (Upsize+CoT) | Open | 0.4 / 8 | 31.00 | 30.00 | 41.41 | 37.93 | 21.40 | 57.05 | 67.85 | 49.56 | 53.49 | 62.85 | 14.29 | +| Claude 3 Opus | Proprietary | Unknown | 30.20 | 26.36 | 50.51 | 33.62 | 25.33 | 71.55 | 75.62 | 73.69 | 73.58 | 70.48 | 26.79 | +| InternVL Chat V1.5 | Open | 5.9 / 20 | 29.20 | 30.00 | 45.45 | 32.33 | 17.47 | 58.50 | 69.63 | 52.95 | 53.06 | 64.63 | 5.80 | +| GLM 4V 9B | Open | 4.4 / 9 | 29.10 | 30.68 | 42.42 | 33.19 | 16.16 | 57.62 | 67.97 | 61.66 | 43.45 | 45.04 | 8.48 | +| Reka Core | Proprietary | Unknown | 28.90 | 27.50 | 41.41 | 28.45 | 26.64 | 55.60 | 58.90 | 50.52 | 65.72 | 71.25 | 10.71 | +| Ovis 1.5 Gemma2 9B | Open | 0.4 / 9 | 28.40 | 26.14 | 44.44 | 33.19 | 20.96 | 62.60 | 64.29 | 71.75 | 56.33 | 66.16 | 5.80 | +| Ovis 1.5 Llama3 8B | Open | 0.4 / 8 | 28.20 | 27.27 | 49.49 | 31.03 | 17.90 | 60.15 | 61.39 | 68.93 | 56.33 | 61.83 | 7.14 | +| Cambrian 34B | Open | 1.9 / 34 | 27.30 | 24.55 | 44.44 | 27.59 | 24.89 | 59.73 | 59.31 | 70.94 | 53.28 | 64.63 | 5.36 | +| MiniCPM-V2.6 (Upsize) | Open | 0.4 / 8 | 27.10 | 21.59 | 45.45 | 35.34 | 21.40 | 61.62 | 69.28 | 55.93 | 60.48 | 72.01 | 19.64 | +| Reka Flash | Proprietary | Unknown | 26.60 | 26.59 | 39.39 | 30.60 | 17.03 | 56.45 | 61.39 | 48.59 | 69.87 | 72.52 | 7.14 | +| Mini Gemini HD Yi 34B | Open | 0.5 / 34 | 25.00 | 26.59 | 43.43 | 27.16 | 11.79 | 52.68 | 53.86 | 55.04 | 65.50 | 53.94 | 2.23 | +| InternLM XComposer2 4KHD | Open | 0.3 / 7 | 25.00 | 23.86 | 43.43 | 29.31 | 14.85 | 54.65 | 61.09 | 54.08 | 51.53 | 59.80 | 6.70 | +| MiniCPM-V2.5 | Open | 0.4 / 8 | 24.90 | 25.23 | 43.43 | 25.43 | 15.72 | 59.27 | 62.28 | 61.90 | 56.77 | 68.96 | 10.27 | +| Qwen VL Max | Proprietary | Unknown | 24.70 | 26.14 | 41.41 | 24.57 | 14.85 | 41.48 | 50.42 | 28.41 | 53.71 | 51.15 | 4.46 | +| VILA 1.5 40B | Open | 5.9 / 34 | 24.00 | 21.59 | 41.41 | 25.00 | 20.09 | 38.67 | 42.88 | 29.62 | 51.31 | 50.89 | 9.82 | +| Reka Edge | Proprietary | Unknown | 23.50 | 20.23 | 32.32 | 30.60 | 18.78 | 33.65 | 36.65 | 28.49 | 34.72 | 52.16 | 4.91 | +| Gemini 1.0 Pro | Proprietary | Unknown | 22.80 | 20.91 | 48.48 | 18.10 | 20.09 | 54.37 | 67.97 | 39.23 | 60.48 | 62.60 | 8.93 | +| LLaVA 1.6 Yi 34B | Open | 0.3 / 34 | 22.50 | 20.45 | 37.37 | 23.71 | 18.78 | 51.05 | 46.38 | 63.44 | 56.11 | 51.91 | 5.80 | +| Mini Gemini HD Llama3 8B | Open | 0.5 / 8 | 19.00 | 19.77 | 36.36 | 21.12 | 7.86 | 44.42 | 49.41 | 39.23 | 51.09 | 55.98 | 1.79 | +| CogAgent | Open | 4.4 / 7 | 18.80 | 16.82 | 32.32 | 20.69 | 14.85 | 36.30 | 45.14 | 26.80 | 43.23 | 37.15 | 6.70 | +| InternLM XComposer2 | Open | 0.3 / 7 | 18.70 | 16.14 | 38.38 | 21.98 | 11.79 | 38.75 | 34.10 | 43.58 | 46.72 | 52.93 | 5.80 | +| MiniCPM-V2 | Open | 0.4 / 2.4 | 18.50 | 17.95 | 33.33 | 19.40 | 12.23 | 35.77 | 39.74 | 36.56 | 26.42 | 44.53 | 5.36 | +| IDEFICS 2 | Open | 0.4 / 7 | 18.20 | 15.45 | 35.35 | 17.24 | 17.03 | 32.77 | 36.12 | 27.28 | 40.83 | 43.26 | 3.12 | +| IDEFICS 2 Chatty | Open | 0.4 / 7 | 17.80 | 15.45 | 34.34 | 19.83 | 13.10 | 41.55 | 34.88 | 54.56 | 45.63 | 44.27 | 6.70 | +| MoAI | Open | 0.3 / 7 | 17.50 | 9.32 | 36.36 | 21.12 | 21.40 | 28.70 | 31.20 | 21.23 | 39.96 | 40.46 | 7.59 | +| DeepSeek VL | Open | 0.5 / 7 | 17.10 | 16.36 | 32.32 | 19.83 | 9.17 | 45.80 | 49.11 | 45.20 | 42.79 | 60.31 | 4.91 | +| DocOwl 1.5 Chat | Domain-specific | 0.3 / 7 | 17.00 | 14.32 | 34.34 | 15.09 | 16.59 | 37.40 | 36.83 | 49.23 | 36.68 | 22.90 | 3.12 | +| SPHINX V2 | Open | 1.9 / 13 | 16.10 | 13.86 | 28.28 | 17.67 | 13.54 | 30.25 | 35.59 | 24.37 | 41.05 | 29.52 | 1.79 | +| Qwen VL Plus | Proprietary | Unknown | 16.00 | 15.45 | 45.45 | 12.07 | 8.30 | 28.93 | 33.33 | 17.92 | 32.10 | 56.23 | 2.23 | +| UReader | Domain-specific | 0.3 / 7 | 14.30 | 11.36 | 18.18 | 15.52 | 17.03 | 18.98 | 10.20 | 27.60 | 33.41 | 20.36 | 5.36 | +| ChartLlama | Domain-specific | 0.3 / 13 | 14.20 | 8.18 | 34.34 | 9.91 | 21.40 | 19.23 | 17.14 | 12.19 | 43.89 | 28.75 | 6.70 | +| LLaVA 1.6 Mistral 7B | Open | 0.3 / 7 | 13.90 | 11.36 | 32.32 | 16.81 | 7.86 | 35.40 | 34.70 | 33.98 | 48.91 | 42.49 | 8.48 | +| ChartGemma | Domain-specific | 0.4 / 2 | 12.50 | 11.59 | 24.24 | 16.81 | 4.80 | 21.30 | 27.58 | 18.97 | 14.19 | 19.59 | 4.46 | +| ChartAssistant | Domain-specific | 1.9 / 13 | 11.70 | 9.09 | 27.27 | 10.34 | 11.35 | 16.93 | 16.43 | 16.87 | 16.57 | 27.74 | 2.68 | +| ChartInstruct-FlanT5 | Domain-specific | 0.3 / 3 | 11.70 | 7.95 | 32.32 | 9.48 | 12.23 | 15.47 | 11.68 | 17.59 | 15.94 | 29.52 | 6.70 | +| Random (GPT-4o) | N/A | Unknown | 10.80 | 4.32 | 39.39 | 5.60 | 16.16 | 19.85 | 21.65 | 16.71 | 23.80 | 25.70 | 5.36 | +| DocOwl 1.5 Omni | Domain-specific | 0.3 / 7 | 9.10 | 5.45 | 14.14 | 9.48 | 13.54 | 25.70 | 34.46 | 17.92 | 31.88 | 17.56 | 4.46 | +| ChartInstruct-Llama2 | Domain-specific | 0.3 / 7 | 8.80 | 4.09 | 23.23 | 7.76 | 12.66 | 21.40 | 23.31 | 15.50 | 33.19 | 27.48 | 4.91 | +| TinyChart | Domain-specific | 0.4 / 3 | 8.30 | 5.00 | 13.13 | 6.47 | 14.41 | 16.15 | 13.82 | 14.61 | 24.67 | 28.50 | 3.12 | +| UniChart-ChartQA | Domain-specific | 1.9 / 8 | 5.70 | 3.41 | 6.06 | 3.45 | 12.23 | 19.32 | 9.91 | 38.26 | 12.23 | 19.08 | 0.45 | +| TextMonkey | Domain-specific | 8-Feb | 3.90 | 2.50 | 4.04 | 3.02 | 7.42 | 12.45 | 12.16 | 17.92 | 8.73 | 6.36 | 2.68 | ## 📜 License Our original data contributions (all data except the charts) are distributed under the [CC BY-SA 4.0](data/LICENSE) license. Our code is licensed under [Apache 2.0](LICENSE) license. The copyright of the charts belong to the original authors, where you can find the source in `image_metadata_val.json` and `image_metadata_test.json` under the data folder.