Hey all! built an open source kotlin coding benchm...
# feed
k
Hey all! built an open source kotlin coding benchmark for large language models like gemini 2.5 pro, claude 3.7 sonnet, and gpt 4 • blog with results https://firebender.com/blog/kotlin-bench • github repo https://github.com/firebenders/Kotlin-bench Why not just use SWE-bench/Aider/Codeforces/etc. benchmark? Many of these benchmarks, like SWE-bench, focus on python tasks, so it makes it hard to trust. With Kotlin-Bench, we now have a way to track LLM progress on kotlin tasks. This allows engineers to make an informed choice on the best LLM to use. It also incentivizes foundational models to make improvements that benefit the kotlin community. How do the evals work? We scraped thousands of pull requests and issue pairs off of popular github repos like Wordpress-Android, Anki-Android, kotlinx. The PRs were filtered for ones that contained both test/non test changes. We further filtered by confirming "test validity", by running the configured test command before and after apply the PR non test file changes. If tests succeeded before applying non test changes, then we excluded the PR because it indicates nothing was actually getting tested.
🔥 8