and the problem with AI test is that you can't reproduce the result as they are non deterministic. Also you can never know if they made real progress or if they just put some of the answer of your specific test directly in the model. Remember that we are in a capitalist managed system and therefore, you have to do more benefits every year or you die...
k
Kevin
05/27/2025, 9:54 PM
yep this is forcing model providers to make LLM improvements faster and faster - theres been a 13% increase in the last few months on tasks completed for this benchmark. results are not deterministic, but you can run a sample n times to get a probability/confidence interval