Spaces:
Running
Running
| import os | |
| import base64 | |
| current_dir = os.path.dirname(os.path.realpath(__file__)) | |
| with open(os.path.join(current_dir, "bottom_logo.png"), "rb") as image_file: | |
| bottom_logo = base64.b64encode(image_file.read()).decode("utf-8") | |
| benchname = 'KOFFVQA' | |
| Bottom_logo = f'''<img src="data:image/jpeg;base64,{bottom_logo}" style="width:20%;display:block;margin-left:auto;margin-right:auto">''' | |
| intro_md = f''' | |
| # {benchname} Leaderboard | |
| [**π Leaderboard**](https://huggingface.co/spaces/maum-ai/KOFFVQA-Leaderboard) | [**π KOFFVQA Arxiv**](https://arxiv.org/abs/2503.23730) | [**π€ KOFFVQA Dataset**](https://huggingface.co/datasets/maum-ai/KOFFVQA_Data) | |
| {benchname}π is a Free-Form VQA benchmark dataset designed to evaluate Vision-Language Models (VLMs) in Korean language environments. Unlike traditional multiple-choice or predefined answer formats, KOFFVQA challenges models to generate open-ended, natural-language answers to visually grounded questions. This allows for a more comprehensive assessment of a model's ability to understand and generate nuanced Korean responses. | |
| The dataset encompasses diverse real-world scenarios, including object attributes, recognition, relationship, etc. | |
| The page will be continuously updated and will accept requests to add models to the leaderboard. For more details, please refer to the "Submit" tab. | |
| '''.strip() | |
| about_md = f''' | |
| # About | |
| The {benchname} benchmark is designed to evaluate and compare the performance of Vision-Language Models (VLMs) in Korean language environments. | |
| This benchmark includes a total of 275 Korean questions across 10 tasks. The questions are open-ended, free-form VQA (Visual Question Answering) with objective answers, allowing responses without strict format constraints. | |
| ## News | |
| * **2025-04-25** : Our [leaderboard](https://huggingface.co/spaces/maum-ai/KOFFVQA-Leaderboard) currently finished evaluating **81** total open- and closed- sourced VLMs. Also we have refactored the evaluation code to make it easier to use and be able to evaluate much more diverse models. | |
| * **2025-04-01** : Our paper [KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language](https://arxiv.org/abs/2503.23730) has been released and accepted to CVPRW 2025, Workshop on Benchmarking and Expanding AI Multimodal Approaches(BEAM 2025) π | |
| * **2025-01-21**: [Evaluation code](https://github.com/maum-ai/KOFFVQA) and [dataset](https://huggingface.co/datasets/maum-ai/KOFFVQA_Data) release | |
| * **2024-12-06**: Leaderboard Release! | |
| ## Citation | |
| **BibTeX:** | |
| '''.strip() + "\n```bibtex\n" + ''' | |
| @article{kim2025koffvqa, | |
| title={KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language}, | |
| author={Kim, Yoonshik and Jung, Jaeyoon}, | |
| journal={arXiv preprint arXiv:2503.23730}, | |
| year={2025} | |
| } | |
| ''' + "\n```" | |
| submit_md = f''' | |
| # Submit | |
| We are not accepting model addition requests at the moment. Once the request system is established, we will start accepting requests. | |
| π Wondering how your VLM stacks up in Korean? Just run it with our evaluation code and get your scoreβno API key needed! | |
| π§ββοΈ We currently use google/gemma-2-9b-it as the judge model, so there's no need to worry about API keys or usage fees. | |
| '''.strip() | |