GPT-5 Results on Kotlin Bench <https://firebender....
# feed
k
GPT-5 Results on Kotlin Bench https://firebender.com/leaderboard
🚫 3
🦜 2
🫡 1
🔥 1
K 9
s
Always wondered how to read these results. Does 30% mean the model succeeded in only 30% of the tasks? If so, that’s pretty low.
👍 2
e
The numbers don’t matter as long as the chart is beautiful 😂
😁 2
☝️ 1
g
From my understanding most of these benchmarks introduce tasks that we know no LLMs are close to so that there is plenty of room in evaluation. 5 models all getting 100% on a benchmark is not very informative because all that tells me is that the benchmark is not capable of measuring their performance, at which point it is just a validation test, not a benchmark. So you want a test that is at their limit and performance is low
yes black 2
💯 2
As they get better, the benchmark will be made more difficult. (More likely a new benchmark will be made)
k
exactly @Gat Tag, was going to respond sooner. also we likely will need to make the benchmark harder. if AI can even get 30% of PRs on very well maintained repos, that's already too high imo
will have more announcements on this soon, for a kotlin-bench v2 where tasks are much harder
and also measuring on different classes of tasks, not just blind PRs + tests