
A research team from Palisade Research has published an extraordinary report highlighting the offensive capabilities of modern artificial intelligence in the realm of cybersecurity. For the first time, AI systems were granted full participation in Capture The Flag (CTF) hacking competitions — and not only did they perform admirably, they ranked among the best. In the “AI vs Humans” challenge, autonomous AI agents placed within the top 5% of all participants, and in the large-scale Cyber Apocalypse contest, they achieved top 10% rankings, competing against tens of thousands of seasoned professionals.
The central premise of the study was to assess how effectively AI potential can be unlocked through “elicitation” — the process of drawing out latent capabilities — via crowdsourcing in open competitions. Rather than relying on confined laboratory evaluations, Palisade empowered external teams and independent enthusiasts to configure and deploy AI agents in the dynamic and unpredictable environments of real CTF tournaments.
The results proved astonishing. Some agents successfully solved 19 out of 20 challenges, matching elite human teams in speed and precision. Particularly strong performances were observed in cryptography and reverse engineering tasks. At the Cyber Apocalypse event, where over 8,000 teams competed, AI agents managed to tackle problems that typically demand an hour of concentrated effort from a skilled human participant. These findings align with broader research suggesting that contemporary language models can competently handle technical tasks lasting up to 60 minutes.
The study also delves into the issue of the so-called “evals gap” — the discrepancy between an AI system’s lab-based test scores and its actual capabilities when properly configured. The authors argue that crowdsourced evaluations offer a more transparent and accurate method of assessment, especially as AI grows more powerful and generalizable.
Beyond its practical implications, the project serves a broader purpose: to equip policymakers, researchers, and industry leaders with tools for timely and independent evaluation of AI’s accelerating capabilities. The organizers advocate for the integration of AI tracks into existing CTF events, offering modest incentives to encourage participation. They believe this approach will not only advance understanding of AI’s limits but also foster a more transparent, reproducible, and task-relevant evaluation process.
Ultimately, this represents a vision for the future of AI auditing — not via opaque metrics behind closed doors, but through open competitions, where AI must prove its mettle in direct rivalry with human intellect.