I Spent My Money on Benchmarking LLMs on Dutch Exams So You Don’t Have To

A machine learning engineer and PhD researcher at the University of Amsterdam conducted a benchmarking experiment to evaluate the performance of large language models on Dutch-language tasks. The researcher collected over 12,000 PDFs of Dutch state exams, extracted question-answer pairs, and used these to test the performance of several models, including o1-preview, o1-mini, GPT-4o, GPT-4o-mini, and Claude-3. The results showed that o1-mini outperformed more expensive models like o1-preview and GPT-4o, earning 66.75% of possible points compared to 61.91% and 62.32%, respectively. The researcher found that the higher cost of certain models did not necessarily translate to better performance, and that o1-mini offered the best value for Dutch-language tasks. The experiment also highlighted the challenges of benchmarking language models, including the high cost of API fees and the need for more extensive testing. The researcher is interested in collaborating with Dutch institutions to expand the scope of the benchmarking effort and provide more comprehensive insights into model performance. The results of the experiment have implications for companies building products tailored to Dutch-speaking users, and suggest that o1-mini may be a more cost-effective option for Dutch-language tasks. The researcher also found that the models performed better on simpler VMBO-level questions and struggled more with complex VWO-level questions. Overall, the experiment highlights the importance of benchmarking language models on specific tasks and languages to ensure that they are performing well in real-world applications.

towardsdatascience.com

RSS Hunter

2024-09-25