Hey all! built an open source
kotlin coding benchmark for large language models like gemini 2.5 pro, claude 3.7 sonnet, and gpt 4
• blog with results
https://firebender.com/blog/kotlin-bench
• github repo
https://github.com/firebenders/Kotlin-bench
Why not just use SWE-bench/Aider/Codeforces/etc. benchmark?
Many of these benchmarks, like SWE-bench, focus on python tasks, so it makes it hard to trust. With
Kotlin-Bench, we now have a way to track LLM progress on kotlin tasks. This allows engineers to make an informed choice on the best LLM to use. It also incentivizes foundational models to make improvements that benefit the kotlin community.
How do the evals work?
We scraped thousands of pull requests and issue pairs off of popular github repos like
Wordpress-Android,
Anki-Android,
kotlinx. The PRs were filtered for ones that contained both test/non test changes. We further filtered by confirming "test validity", by running the configured test command before and after apply the PR non test file changes. If tests succeeded before applying non test changes, then we excluded the PR because it indicates nothing was actually getting tested.