ProBench Logo ProBench

A Benchmark for Open-ended Multi-domain Expert Tasks
Code Dataset Leaderboard arXiv

The ProBench highlights open-ended expert tasks. Here are examples:

Image example 1
Question:

The image above represents a H&E stain of a skeletal muscle biopsy from a young boy who came into the clinic reporting muscle weakness. You are his doctor. Does the boy have Duchenne muscular dystrophy? Explain. Your answer should include an analysis of the biopsy (you can use arrows to point to various features) and be sure to list all features of the muscle that indicate diseased or healthy conditions.

Image example 2
Question:

Describe and compare your eye-tracking results from the Gaze Recorder online website. You will interpret your recording qualitatively and in comparison to the aggregated data.Inspect your recording. (use the heat maps and data provided on the images). Make note of your eye movements and where you spent the most time looking (red areas).

Image example 3
Question:

These are the visual representation of the code used for SMOTE on original data, accuracy and f1 scores for test and validation data, accuracy vs. loss graph. Interpret these results, compare with the metrics of original data, and briefly explain the impact of SMOTE of our data.

Image example 4
Question:

this is the result of filtering denoised image with ideal lowpass filter, gaussian filter and nuterworth filter with two cuttoff frequencu 30 an 100. 1-why is the reason of the butterworth D0=100 to be brighter that butterworth D0=30. 2- why butterworth filter has somthing to show buut ideal and gaussian are completely dark in both cutoff frquencies?

ProBench Overview

The distribution of (a) single-round, (b) multi-linguistic, (c) multi-round tracks. ProBench Distribution

ProBench Significance

With the ProBench benchmark, we identified several (a) response limitations and (b) error types of existing MLLMs. Limitation of MLLM

ProBench Leaderboard

Single-Round Track

Rank Model Open‑Source? Sci. Cd. CW. IE. Perc. Knowl. Arts Plan. Math. Mt. #Token 95% CI WR Elo
1 Pixtral-Large-Instruct-2411 Yes 123011941280124212241250 1245122111751266 715 (-8, 8) 65.97 1229
2 claude-3-5-sonnet-20241022 No 122812521259121112131272 1236119211971251 405 (-7, 8) 65.84 1228
3 gemini-1.5-pro-002 No 115111451105110011101067 1107109511341147 500 (-8, 10) 50.58 1118
4 gpt-4o-2024-05-13 No 111411141114111411141114 1114111411141114 491 (0, 0) 50.00 1114
5 gpt-4o-mini-2024-07-18 Yes 104910741165109410961101 1130110210371159 526 (-8, 10) 47.12 1094
6 gpt-4o-2024-08-06 No 10961112105010979951080 1032105811751015 374 (-7, 7) 44.98 1079
7 gemini-1.5-flash-002 No 10258771092100710221011 99394610351087 493 (-8, 9) 35.33 1009
8 InternVL2_5-78B Yes 108310181051109110311084 1042107310651023 558 (-7, 10) 42.85 1064
9 Pixtral-12B-2409 Yes 10289651099103110241057 104710839961063 659 (-5, 8) 39.1 1037
10 Aria-Chat Yes 9909829859379981034 10199749731016 675 (-7, 8) 32.88 990
11 InternVL2_5-38B Yes 100097910289871021904 93210411026933 521 (-9, 9) 32.5 987
12 Qwen2-VL-72B-Instruct Yes 1009914965991986960 962921998970 557 (-9, 9) 31.37 978
13 InternVL2_5-26B Yes 8908161008894944876 864964880896 490 (-10, 8) 22.59 900
14 InternVL2_5-8B Yes 824806983880914840 915895835868 644 (-11, 8) 20.45 878
15 Molmo-72B-0924 Yes 828733953859903881 862817871852 301 (-12, 8) 18.46 856
16 NVLM-D-72B Yes 780877991810849835 767881838725 561 (-10, 10) 16.63 834
17 Qwen2-VL-7B-Instruct Yes 803689827877861816 736680858833 787 (-9, 10) 15.40 818
18 Llama-3.2-90B-Vision-Instruct Yes 830751624754806842 626769940662 448 (-11, 10) 12.89 782
19 llava-onevision-qwen2-72b-ov Yes 696735762726767689 663679853620 360 (-11, 12) 10.09 734
20 Llama-3.2-11B-Vision-Instruct Yes 671541681702766761 624524744614 531 (-13, 16) 7.93 688
21 MiniCPM-V-2_6 Yes 644599767659812676 673667656681 646 (-12, 10) 7.97 689
22 llava-onevision-qwen2-7b-ov Yes 605570807683809681 715608573724 575 (-13, 10) 7.93 688
23 Molmo-7B-D-0924 Yes 536304720631638655 681531613603 310 (-14, 12) 5.41 617
24 Molmo-7B-O-0924 Yes 457134623483681599 606380428528 296 (-18, 19) 3.54 540

Multi-Linguistic Track

Rank Model Open‑Source? PT FR ES DE Other #Token 95% CI WR Elo
1 claude-3-5-sonnet-20241022 No 12481319133513891309 485 (-21, 29) 74.58 1301
2 Pixtral-Large-Instruct-2411 Yes 12291496121613241286 966 (-23, 22) 73.81 1294
3 gemini-1.5-pro-002 No 12731168113111681139 629 (-20, 20) 59.11 1178
4 gpt-4o-2024-08-06 No 11591224122612591114 480 (-17, 26) 60.35 1187
5 gpt-4o-2024-05-13 No 11141114111411141114 585 (0, 0) 50.0 1114
6 gpt-4o-mini-2024-07-18 Yes 10381079107111511099 657 (-21, 16) 45.84 1085
7 Qwen2-VL-72B-Instruct Yes 106711999441241999 834 (-18, 21) 47.56 1097
8 InternVL2_5-38B Yes 10381092107011001044 868 (-20, 18) 43.98 1072
9 InternVL2_5-78B Yes 9481125103511231084 841 (-14, 20) 42.71 1063
10 Pixtral-12B-2409 Yes 93510969981077929 1199 (-14, 22) 35.73 1012
11 Aria-Chat Yes 96410429831041999 1014 (-23, 17) 35.33 1009
12 gemini-1.5-flash-002 No 10319908451015815 567 (-25, 19) 28.47 954
13 NVLM-D-72B Yes 900863850898918 907 (-17, 25) 21.99 894
14 Llama-3.2-90B-Vision-Instruct Yes 905860824863864 968 (-29, 21) 20.92 883
15 Molmo-72B-0924 Yes 834835852853878 426 (-27, 19) 18.9 861
16 InternVL2_5-26B Yes 779858782880839 814 (-28, 19) 17.7 847
17 Qwen2-VL-7B-Instruct Yes 701875673865678 1216 (-24, 22) 12.25 772
18 llava-onevision-qwen2-72b-ov Yes 782810609800729 534 (-27, 24) 11.95 767
19 InternVL2_5-8B Yes 760776765821602 1021 (-22, 20) 11.95 767
20 Llama-3.2-11B-Vision-Instruct Yes 714663626627665 2027 (-29, 21) 8.4 699
21 MiniCPM-V-2_6 Yes 522559603634455 890 (-36, 35) 4.44 581
22 Molmo-7B-D-0924 Yes 445495577613505 406 (-52, 33) 4.32 576
23 llava-onevision-qwen2-7b-ov Yes 579386144403588 686 (-68, 37) 3.07 514
24 Molmo-7B-O-0924 Yes 383256536246429 512 (-73, 51) 1.95 433

Multi-Round Track

Rank Model Open‑Source? 2 3 4 5 6+ #Token 95% CI WR Elo
1 claude-3-5-sonnet-20241022 No 12601249135612481321 1477 (-20, 18) 70.82 1268
2 Pixtral-Large-Instruct-2411 Yes 12331273130413761253 2593 (-23, 19) 69.73 1259
3 gpt-4o-mini-2024-07-18 Yes 11471143114212001151 1749 (-17, 24) 55.16 1150
4 gemini-1.5-pro-002 No 11361140110712071145 1425 (-26, 19) 53.88 1141
5 gpt-4o-2024-05-13 No 11141114111411141114 1563 (0, 0) 50.0 1114
6 gpt-4o-2024-08-06 No 1146105011381023965 1052 (-22, 18) 45.41 1082
7 InternVL2_5-78B Yes 1135104011481015992 2015 (-21, 20) 44.84 1078
8 Pixtral-12B-2409 Yes 10541008116010131035 2264 (-19, 20) 40.48 1047
9 gemini-1.5-flash-002 No 10151040101511191006 1388 (-16, 19) 38.14 1030
10 InternVL2_5-38B Yes 100310371036913902 1734 (-18, 21) 34.68 1004
11 Qwen2-VL-72B-Instruct Yes 10239721033936875 1608 (-21, 19) 32.24 985
12 Aria-Chat Yes 937913946887812 2321 (-27, 12) 23.92 913
13 Molmo-72B-0924 Yes 886817787920808 967 (-28, 25) 18.64 858
14 InternVL2_5-26B Yes 881811805753638 1554 (-27, 28) 15.77 823
15 InternVL2_5-8B Yes 814724775686559 1835 (-25, 22) 11.77 764
16 llava-onevision-qwen2-72b-ov Yes 753721673525692 1176 (-31, 26) 10.3 738
17 Llama-3.2-90B-Vision-Instruct Yes 754757784426605 1350 (-36, 24) 9.88 730
18 Qwen2-VL-7B-Instruct Yes 808622637557495 2004 (-34, 25) 9.48 722
19 NVLM-D-72B Yes 770557602641682 1371 (-35, 33) 8.49 701
20 llava-onevision-qwen2-7b-ov Yes 737591649N/A512 1743 (-30, 30) 6.58 653
21 Llama-3.2-11B-Vision-Instruct Yes 741380487275490 2094 (-38, 32) 6.03 637
22 MiniCPM-V-2_6 Yes 664575628530389 1861 (-33, 37) 5.35 615
23 Molmo-7B-D-0924 Yes 672470523409618 923 (-34, 26) 5.04 604
24 Molmo-7B-O-0924 Yes 589413490N/A402 925 (-49, 37) 3.43 534

BibTeX

@misc{yang2025probenchjudgingmultimodalfoundation,
          title={ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks}, 
          author={Yan Yang and Dongxu Li and Haoning Wu and Bei Chen and Liu Liu and Liyuan Pan and Junnan Li},
          year={2025},
          eprint={2503.06885},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2503.06885}, 
  }