Always wondered how to read these results. Does 30% mean the model succeeded in only 30% of the tasks? If so, that’s pretty low.
👍 2
e
Edgar Avuzi
08/14/2025, 3:24 PM
The numbers don’t matter as long as the chart is beautiful 😂
😁 2
☝️ 1
g
Gat Tag
08/14/2025, 6:26 PM
From my understanding most of these benchmarks introduce tasks that we know no LLMs are close to so that there is plenty of room in evaluation. 5 models all getting 100% on a benchmark is not very informative because all that tells me is that the benchmark is not capable of measuring their performance, at which point it is just a validation test, not a benchmark. So you want a test that is at their limit and performance is low
yes black 2
💯 2
Gat Tag
08/14/2025, 6:27 PM
As they get better, the benchmark will be made more difficult. (More likely a new benchmark will be made)
k
Kevin
08/14/2025, 6:50 PM
exactly @Gat Tag, was going to respond sooner. also we likely will need to make the benchmark harder. if AI can even get 30% of PRs on very well maintained repos, that's already too high imo
Kevin
08/14/2025, 6:51 PM
will have more announcements on this soon, for a kotlin-bench v2 where tasks are much harder
Kevin
08/14/2025, 6:51 PM
and also measuring on different classes of tasks, not just blind PRs + tests