A modern programming language that makes developers happier.

kotlinlang

Screenshot 2025-04-05 at 5.58.18 PM.png

Hey all! built an open source *kotlin coding benchmark* for large language models like gemini 2.5 pro, claude 3.7 sonnet, and gpt 4
• blog with results <https://firebender.com/blog/kotlin-bench>
• github repo <https://github.com/firebenders/Kotlin-bench>
_Why not just use SWE-bench/Aider/Codeforces/etc. benchmark?_

Many of these benchmarks, like SWE-bench, focus on python tasks, so it makes it hard to trust. With <https://github.com/firebenders/Kotlin-bench|Kotlin-Bench>, we now have a way to track LLM progress on kotlin tasks. This allows engineers to make an informed choice on the best LLM to use. It also incentivizes foundational models to make improvements that benefit the kotlin community.

_How do the evals work?_
We scraped thousands of pull requests and issue pairs off of popular github repos like <https://github.com/wordpress-mobile/WordPress-Android|Wordpress-Android>, <https://github.com/ankidroid/Anki-Android/|Anki-Android>, <https://github.com/Kotlin/kotlinx.coroutines|kotlinx>. The PRs were filtered for ones that contained both test/non test changes. We further filtered by confirming "test validity", by running the configured test command before and after apply the PR non test file changes. If tests succeeded before applying non test changes, then we excluded the PR because it indicates nothing was actually getting tested.