DEV Community

German LLM Benchmark

Most large language model (LLM) benchmarks are in English, which does not accurately reflect their performance in other languages. Benchmarks in other languages, such as German, are often based on publicly available data sets that may be part of an LLM's training data, making them invalid for benchmarking. To address this issue, a new German-language LLM benchmark called ML•LLM was developed, consisting of two parts: logic and non-logic. ML•LLM•L requires logic and reasoning to answer questions, while ML•LLM•NL requires knowledge of the German language or laws/customs in Germany. The results show that Grok from xAI is the clear leader, with DeepSeek and some OpenAI models close behind. Surprisingly, many LLMs struggle with simple tasks in German, such as counting the number of R's in a word. The reasoning models often do their reasoning in English, even when presented with German questions, highlighting the lack of German training data. The need for non-English LLM benchmarks is evident, and it is unclear if others are working on similar projects.
favicon
dev.to
dev.to
Image for the article: German LLM Benchmark
Create attached notes ...