{"id":3335,"date":"2025-07-24T02:17:42","date_gmt":"2025-07-24T02:17:42","guid":{"rendered":"https:\/\/jobuzo.com\/en\/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty\/"},"modified":"2025-07-24T02:17:42","modified_gmt":"2025-07-24T02:17:42","slug":"a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty","status":"publish","type":"post","link":"https:\/\/jobuzo.com\/en\/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty\/","title":{"rendered":"A new AI coding challenge just published its first results \u2013 and they aren\u2019t pretty"},"content":{"rendered":"<div>\n<div><\/div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">A new AI coding challenge has revealed its first winner &mdash; and set a new bar for AI-powered software engineers.&nbsp;<\/p>\n<p class=\"wp-block-paragraph\">On Wednesday at 5pm PST, the nonprofit Laude Institute announced the first winner of the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade, who will receive $50,000 for the prize. But more surprising than the win was his final score: he won with correct answers to just 7.5% of the questions on the test.<\/p>\n<p class=\"wp-block-paragraph\">&ldquo;We&rsquo;re glad we built a benchmark that is actually hard,&rdquo; said Konwinski. &ldquo;Benchmarks should be hard if they&rsquo;re going to matter,&rdquo; he continued, adding: &ldquo;Scores would be different if the big labs had entered with their biggest models. But that&rsquo;s kind of the point. K Prize runs offline with limited compute, so it favors smaller and open models. I love that. It levels the playing field.&rdquo;<\/p>\n<p class=\"wp-block-paragraph\">Konwinski has pledged $1 million to the first open-source model that can score higher than 90% on the test.<\/p>\n<p class=\"wp-block-paragraph\">Similar to the well-known SWE-Bench system, the K Prize tests models against flagged issues from GitHub as a test of how well models can deal with real-world programming problems. But while SWE-Bench is based on a fixed set of problems that models can train against, the K Prize is designed as a &ldquo;contamination-free version of SWE-Bench,&rdquo; using a timed entry system to guard against any benchmark-specific training. For round one, models were due by March 12th. The K Prize organizers then built the test using only GitHub issues flagged after that date.<\/p>\n<p class=\"wp-block-paragraph\">The 7.5% top score stands in marked contrast to SWE-Bench itself, which currently shows a 75% top score on its easier &lsquo;Verified&rsquo; test and 34% on its harder &lsquo;Full&rsquo; test. Konwinski still isn&rsquo;t sure whether the disparity is due to contamination on SWE-Bench or just the challenge of collecting new issues from GitHub, but he expects the K Prize project to answer the question soon.<\/p>\n<div class=\"internal-linking-related-contents\"><a href=\"https:\/\/jobuzo.com\/en\/12-weeks-jail-for-school-it-support-technician-who-took-upskirt-videos-of-teachers\/\" class=\"template-1\"><span class=\"cta\">News :<\/span><span class=\"postTitle\">&lt;div&gt;12 weeks' jail for school IT support technician who took upskirt videos of teachers&lt;\/div&gt;<\/span><\/a><\/div><p class=\"wp-block-paragraph\">&ldquo;As we get more runs of the thing, we&rsquo;ll have a better sense,&rdquo; he told TechCrunch, &ldquo;because we expect people to adapt to the dynamics of competing on this every few months.&rdquo;<\/p>\n<div class=\"wp-block-techcrunch-inline-cta\">\n<div class=\"inline-cta__wrapper\" readability=\"5.3\">\n<p>Techcrunch event<\/p>\n<div class=\"inline-cta__content\" readability=\"24.75\">\n<p>\n\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__location\">San Francisco<\/span><br>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__separator\">|<\/span><br>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__date\">October 27-29, 2025<\/span>\n\t\t\t\t\t\t\t<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">It might seem like an odd place to fall short, given the wide range of AI coding tools already publicly available &ndash; but with benchmarks becoming too easy, many critics see projects like the K Prize as a necessary step toward solving AI&rsquo;s growing evaluation problem.<\/p>\n<p class=\"wp-block-paragraph\">&ldquo;I&rsquo;m quite bullish about building new tests for existing benchmarks,&rdquo; says Princeton researcher Sayash Kapoor, who put forward a similar idea in a recent paper. &ldquo;Without such experiments, we can&rsquo;t actually tell if the issue is contamination, or even just targeting the SWE-Bench leaderboard with a human in the loop.&rdquo;<\/p>\n<p class=\"wp-block-paragraph\">For Konwinski, it&rsquo;s not just a better benchmark, but an open challenge to the rest of the industry. &ldquo;If you listen to the hype, it&rsquo;s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that&rsquo;s just not true,&rdquo; he says. &ldquo;If we can&rsquo;t even get more than 10% on a contamination free SWE-Bench, that&rsquo;s the reality check for me.&rdquo;<\/p>\n<\/div>\n<div class=\"internal-linking-related-contents\"><a href=\"https:\/\/jobuzo.com\/en\/migrant-acquitted-in-first-trial-over-us-border-military-zones\/\" class=\"template-1\"><span class=\"cta\">News :<\/span><span class=\"postTitle\">Migrant acquitted in first trial over US border military zones<\/span><\/a><\/div><p><sub>A new AI coding challenge just published its first results &ndash; and they aren&rsquo;t pretty<\/sub><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new AI coding challenge has revealed its first winner &mdash; and set a new bar for AI-powered software engineers.&nbsp; On Wednesday at 5pm PST, the nonprofit Laude Institute announced the first winner of the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian&#8230;<\/p>\n<p class=\"more-link-wrap\"><a href=\"https:\/\/jobuzo.com\/en\/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty\/\" class=\"more-link\">Read More<span class=\"screen-reader-text\"> &ldquo;A new AI coding challenge just published its first results \u2013 and they aren\u2019t pretty&rdquo;<\/span> &raquo;<\/a><\/p>\n","protected":false},"author":1,"featured_media":3336,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3335","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/posts\/3335","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/comments?post=3335"}],"version-history":[{"count":0,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/posts\/3335\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/media\/3336"}],"wp:attachment":[{"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/media?parent=3335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/categories?post=3335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jobuzo.com\/en\/wp-json\/wp\/v2\/tags?post=3335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}