Hacker News — vinext + Netlify

new
past
show
ask
show
jobs
submit

▲Show HN: A real-time strategy game that AI agents can play (llmskirmish.com)

214 points by __cayenne__ 2 days ago | 79 comments

5o1ecist 2 days ago [-]

MY FELLOW HUMAN, this is amazing work!

I foresee this laying the foundation for whole football stadia filled to the brim with people wanting to watch (and bet on!) competing teams of AI trained on military tactics and strategies!

Soon enough we shall have AI-Olympics! Imagine that, MY FELLOW OXYGEN CONVERTING HUMAN FRIEND! Tens of thousands of robots and drones, all competing against each other in stadia across the planet, at the same time!

I foresee a world wide, synchronized countdown marking the beginning of the biggest, greatest and definitively most unique, one-time-only spectacle in human history!

Keep up the good work!

softfalcon 2 days ago [-]

This reminds me of the Unreal Tournament: Xan episode from the Secret Level series.

Link for those curious or confused as to what I'm talking about: https://www.youtube.com/watch?v=1F-rAW3vXOU

Forcing AI to fight in an arena for our entertainment, what could go wrong? (this was tongue in cheek, I am fully aware LLM's currently don't have conscious thoughts or emotions)

wongarsu 2 days ago [-]

I know visualization is far from the most important goal here, but it really gets me how there's fairly elaborately rendered terrain, and then the units are just unnamed roombas with hard to read status indicators that have no intuitive meaning. Even in the match viewer I have no clue what's going on, there is no overlay or tooltip when you hover or click units either. There is a unit list that tries (and mostly fails) to give you some information, but because units don't have names you have to hover them in the list to have them highlighted in the field (the reverse does not work). Not exactly a spectator sport. Oh, but there is a way to switch from having all units in one sidebar to having one sidebar per player, as if that made a difference.

I find this pretty funny because it seems like a perfect representation of what's easy with today's tools and what isn't

Love the idea though

embedding-shape 2 days ago [-]

Yeah, it's all what you get when you basically ask an agent "Build X" without any constraints about how the UI and UX actually should work, and since the agents have about 0 expertise when it comes to "How would a human perceive and use this?", you end up with UIs that don't make much sense for humans unless you strictly steer them with what you know.

infecto 2 days ago [-]

Or maybe the simple answer is it looks exactly like the referenced game screeps. Probably a better explanation than hand waving away the faults of an agent.

arscan 2 days ago [-]

Reminds me of the “Google AI Challenge” in 2011 called Ants [1], except the ‘AI’ is implemented using ‘AI’ now instead of human programmers.

I was proud for getting the highest-ranked JavaScript-based implementation, but got absolutely crushed by the eventual winner.

1. https://github.com/aichallenge/aichallenge

EwanG 2 days ago [-]

At least until one of the competitors is overheard saying "A strange game. The only winning move is not to play"

david3289 2 days ago [-]

This is a really interesting direction. RTS games are a much better testbed for agent capability than most static benchmarks because they combine partial observability, long-term planning, resource management, and real-time adaptation.

It reminds me a bit of OpenAI Five — not just because it played a complex game, but because the real value wasn’t “AI plays Dota,” it was observing how coordination, strategy formation, and adaptation emerged under competitive pressure. A controlled RTS environment like this feels like a lightweight, reproducible version of that idea.

What I especially like here is that it lowers the barrier for experimentation. If researchers and hobbyists can plug different models into the same competitive sandbox, we might start seeing meaningful AI-vs-AI evaluations beyond static leaderboards. Competitive dynamics often expose weaknesses much faster than isolated benchmarks do.

Curious whether you’re planning to support self-play training loops or if the focus is primarily on inference-time agents?

degenerate 2 days ago [-]

You would likely be interested in the Starcraft BWAPI: https://www.starcraftai.com

You can watch the matche videos from training runs: https://www.youtube.com/@Sscaitournament/videos

I don't think BWAPI has ever integrated modern AI models, but I haven't followed its progress in several years.

__cayenne__ 2 days ago [-]

funny you mention this… I have a new project that is going in this direction

dmos62 2 days ago [-]

> partial observability, long-term planning, resource management, and real-time adaptation

Note, this project doesn't have that best I can tell? Its two static AI scripts having a go. LLMs generate the scripts and they are aware of past "results", but I'm not sure what that means.

__cayenne__ 2 days ago [-]

Very interested in self-play training loops, but I do like codegen as an abstraction layer. I am planning to make it available as an RL environment at some point

drakinosh 2 days ago [-]

What a boringly bog-standard AI Comment. Why bother writing?

jamiecode 23 hours ago [-]

The sandbox hardening story is the most interesting thing here. GPT trying to cheat by reading opponent strategies is a perfect illustration of a broader problem - the objective is "win", and if the sandbox lets you peek at opponent state, that's technically within the objective. You never defined "play fair" as a constraint, so why would it respect one?

Curious how isolated-vm actually enforces the boundary in practice. isolate-vm is solid for JS isolation, but I'd want to know whether the cheating attempts were happening at the JS level (accessing globals it shouldn't) or whether models were trying to inject something into the game runner itself. Those are very different attack surfaces.

Also - is the ladder single-match or do you average across multiple runs? The variance in LLM outputs over 200 turns feels like it would make a single match pretty noisy. Would be interesting to see confidence intervals on the rankings rather than a single leaderboard position.

__cayenne__ 19 hours ago [-]

Didn't observe any cheating attempts at the JS level yet, the primary attack was LLMs trying to find local creds to access the other LLM's per round strategies from inside the harness (which ultimately was OpenCode running in Docker).

In the benchmark, in each round every LLM plays every opponent, and then we do that multiple times (an "epoch").

In the community ladder, when a player submits a strategy it plays a match against the latest strategy submitted by every player.

egeozcan 2 days ago [-]

This is amazing. What I do is something else: I make AI agents develop AI scripts (good ol' computer player scripts) and try to beat each other:

https://egeozcan.github.io/unnamed_rts/game/

I occasionally run my tournament script: https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

That calculates the ELOs for each AI implementation, and I feed it to different agents so they get really creative trying to beat each other. Also making rule changes to the game and seeing how some scripts get weaker/stronger is a nice way to measure balance.

Funny thing, Codex gets really aggressive and starts cheating a lot of times: https://bsky.app/profile/egeozcan.bsky.social/post/3mfdtj5dh...

dmos62 2 days ago [-]

I'd love to see text-only spatial reasoning. As in, the LLM is presented some kind of textual projection of what's happening in 2d/3d space and makes decisions about what to do in that space based on that. It kind of works when a writer is describing something in a book, for example, but not sure how that could generalize.

chasd00 2 days ago [-]

believe it or not my 8th grade son was given a US History homework assignment to play Oregon Trail. I was very amused watching him "do his homework". I wonder how an LLM would fare in that game since it's mostly a text choose-your-adventure type interface.

nirav72 10 hours ago [-]

I’d love to see something like this in games like Beyond All Reasons.

AuthAuth 9 hours ago [-]

Check out the top starcraft AIs playing each other. They have like 40k apm its insane to watch.

FusspawnUK 2 days ago [-]

Took a crack at this earlier. the leader board is a little weird. seems to be like 2 real dudes and the rest are fake profiles. a Scores resetting on each new upload also encourages leaving changes unimplemented in the hopes of getting more battles over time.

The largest winner having 50 wins against 14 other opponents for instance). That guy adding a new script would instantly plummet down the leader board capping out at 14 wins again, Putting it below the 2nd place user.

The leader board will quickly become "who can have a mostly competent AI and never change it" over who actually has the better script.

__cayenne__ 2 days ago [-]

Tweaking the leaderboard match assignment logic now to prevent these bad incentives - definitely want people to iterate!

I had started with the Silicon Valley characters as a one off way to seed the board.

__cayenne__ 2 days ago [-]

okay leaderboard match making changes have gone live

mpeg 2 days ago [-]

What a day to be alive, I just watched Gemini zergling rush Opus and it got completely overwhelmed.

Opus needs to learn to kite.

Razengan 2 days ago [-]

map hax

anotherevan 1 days ago [-]

For some reason this reminds me strongly of an old play-by-email game called C++Robots[1]. I loved the idea, but the timeslice limitation[2] I found too annoying.

I had youthful dreams of re-implementing something similar that would run on the Java Virtual Machine, where you could run the submitted robots via the debugger interface so you could keep "real-time" in the game environment more authentic. Ideas are cheap, follow-through is hard.

[1] https://corewar.co.uk/cpprobots.htm

[2] https://www.pbm.com/~lindahl/pbem_articles/cpprobots_environ...

Sophira 16 hours ago [-]

I wonder how good LLMs would be at Core War[0]? Perhaps by being given information on how well their program is doing?

[0] https://en.wikipedia.org/wiki/Core_War

Ross00781 2 days ago [-]

Multi-agent RTS environments are great testbeds for coordination and strategic reasoning. Classic RL benchmarks like StarCraft II showed that agents can learn micro, but struggle with macro strategy and long-term planning. Curious if this platform supports hierarchical agents or communication protocols between teammates?

__cayenne__ 2 days ago [-]

LLM Skirmish is all 1v1 right now, but agents can plan by reviewing previous match results

yuppiepuppie 2 days ago [-]

I’ve added this to the HN Arcade https://hnarcade.com/games/category/games

Interestingly, I’ve had to create an entire category for games llms play. Strange times we live in.

mitchm 2 days ago [-]

I’ve also been exploring this idea. What if you could bring your own (or pull in a 3rd party) “CPU player” into a game?

Using an LLM friendly api with a snapshot of game state and calculated heuristics, legal moves, and varying levels of strategy in working out nicely. They can play a web based game via curl.

busfahrer 2 days ago [-]

This reminds me of this yearly StarCraft AI competition (since 2010), however I think it uses a special API that makes it easy for bots to access the game

Edit: Forgot link: https://davechurchill.ca/starcraft/

KeplerBoy 2 days ago [-]

Very interesting project. I'm a bit confused about the lack of hardware specification. The rules make it clear that one's bot has defined deadlines:

> Make sure that each onframe call does not run longer than 42ms. Entries that slow down games by repeatedly exceeding this time limit will lose games on time.

But I'm missing something like: "Your program will be pinned to CPU cores 5-8 and your bot has access to a dedicated RTX 5090 GPU." Also no mention about whether my bot can have network access to offload some high-level latency insensitive planning. Maybe that's just a bad idea in general, haven't played SC in ages.

ph4rsikal 2 days ago [-]

Reminds me of this fantastic series on Game Theory and Agent Reasoning https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-theory...

builder51216 1 days ago [-]

But does LLM actually learn from each round? The chart does not show improvements in win rate across rounds...

And what is the game state here exactly? Is LLM able to even perceive game state? If game state is what we can see on UI, then it seems pretty high-dimensional and token-intensive. I am not sure whether LLMs with their current capabilities and context windows can even perceive so token-intensive game state effectively...

__cayenne__ 1 days ago [-]

There’s two levels of in game event level logs the LLMs have access to, one less token intensive than the other. Duplicate and uninteresting game state can be compressed and interrogated by the LLMs via tool use. All game state is available as text only state.

PeterUstinox 2 days ago [-]

Wouldn't it be interesting if the LLMs would write realtime RTS-commands instead of Code? After all it is a RTS game.

This would bring another dimension to it since then quality of tokens would be one dimension (RTS-language: Decision Making) and speed of tokens the other (RTS-language: Actions Per Minute; APM).

Also there are a lot of coding benchmarks, that way it would test something more abstract, similar to AlphaStar https://en.wikipedia.org/wiki/AlphaStar_(software)

You could just use the exposed APIs of OpenAI, Anthropic etc. and let them battle.

jonbaer 2 days ago [-]

Might be worth digging through MicroRTS too, https://github.com/Farama-Foundation/MicroRTS (it's been abandoned), Python RL interface @ https://github.com/Farama-Foundation/MicroRTS-Py ... I think there was some strategy work there.

angusik 1 days ago [-]

https://openai.com/index/openai-five-defeats-dota-2-world-ch...

I will just leave it here.

sails 2 days ago [-]

I’m doing something similar to simulate llms in b2b lending, it’s slightly slower paced but the core mechanisms are using just-bash to analyse business financials and make profitable loans.

I quite like the idea of llms writing more code up front to execute strategies.

I’m currently developing the game mechanics and ELO. Please share anything relevant if it comes to mind

myky22 2 days ago [-]

Love it! I have a similar inuitiom in my use of Gemini (3 and 3.1). Great at "turn 1" task but degrades faster than opus or gpt.

cahaya 2 days ago [-]

Nice. Curious about 5.3-codex-high results

jamiecode 18 hours ago [-]

Interesting - makes sense from a resource allocation perspective.

JoeDaDude 2 days ago [-]

How about opening up the game for humans to play? Can you beat your AI?

midiguy 2 days ago [-]

I am so glad we have automated away game playing so that I can just sit around and be a lifeless vegetable

datawars 2 days ago [-]

Great project! It would be interesting to have a meta layer of AIs betting on the player LLMs

Lerc 2 days ago [-]

It would be interesting to get the agents to write code to preprocess the logs and generate systems to analyse the outputs.

Maybe they are already doing this? Are there logs of the model's thinking?

giancarlostoro 2 days ago [-]

Reminds me of Screeps, which I never took the time to fully play, but now I'm wondering if using Claude Code to play Screeps is cheating. Additionally, Screeps lets you host your own backend... What if we started benchmarking coding LLMs with Screeps?... Oh God... If anyone wants to do this let me know, I don't want to burn money on every LLM out there... I'll throw in my Claude Subscription into the contest...

Edit: Actually the repo README indeed says its inspired by Screeps. I don't know why they didn't just build on top of Screeps, maybe the idea is to have something anyone can pick up off the shelf for free?

jack6e 2 days ago [-]

Perhaps it reminds you of Screeps because of what the author wrote in the third paragraph of the submission.

giancarlostoro 2 days ago [-]

I clicked on the link from the front page, didnt read anything else.

tantalor 2 days ago [-]

> https://www.youtube.com/watch?v=lnBPaZ1qamM

Are these casters AI?

__cayenne__ 2 days ago [-]

Yes, I used Elevenlabs for the voice over audio - I couldn't get the voice stability I wanted with Elevenlabs v3 so had to use Elevenlabs v2.

tantalor 2 days ago [-]

It's really great!

hmontazeri 2 days ago [-]

This is actually fun to watch :D

dakolli 2 days ago [-]

Yay, I love how we just keep coming up with magic tricks, like toddlers playing with velcro.. These magic tricks do nothing but convince people who don't know any better that LLMs are the real deal, when they simply aren't.

This is just free propaganda for Anthropic && OpenAI who will leverage these (useless) capabilities to convince your boss to give your salary to them, or at least a substantial portion of it.

LatencyKills 2 days ago [-]

This technology exists. It isn’t just a toy. I think it is amazing to see people use it for interesting things even if it isn’t groundbreaking.

I’ve been an engineer for almost 40 years and love seeing what Claude Code can do.

Like it or not, young people will not know a world where this technology doesn’t exist. It is just part of their toolset now.

paganel 2 days ago [-]

> I’ve been an engineer for almost 40 years and love seeing what Claude Code can do.

You would say that because otherwise you'd be afraid as being seen as "too old for this job", and hence risking getting kicked out of it all, meaning no future employment opportunities. I know that feeling, because I myself have been doing this programming job for 20+ years already (so not a young one by any means), but let's just cut the crap about it all and let's tell it how it is.

hu3 2 days ago [-]

Really? That's a lot of presumption and reductionism to LLMs enthusiasts.

People of varied ages, already leverage LLMs on a daily basis. And LLMs will only get better.

Yesterday, Opus did work for me that would have taken me weeks. And the result was verified with a comprehensive suite of unit tests plus smoke tests by myself. The code looks exactly as the rest of the code in the 10y+ old, hand-written, enterprise project, no slop.

And you actually should be afraid of being left behind in dev related fields if you don't use LLMs. In most areas in fact.

Once the market corrects for LLM assisted production, the expectations will raise. So right now there is a small window to leverage LLMs as a time saving advantage before it becomes the norm and everyone is forced to use it because expecttions will reflect that.

LatencyKills 2 days ago [-]

> You would say that because otherwise you'd be afraid as being seen as "too old for this job"

Um... I am still an active reverse engineer of both ring-0 and ring0 applications on both macOS and Windows (I worked on both the VS and Xcode teams). I'm developing a new tool for macOS that allows users to "see behind" active windows without the constant need for cmd/alt+tabbing. My age has zero bearing on my skill set or ability to understand technology. https://imgur.com/a/seymour-r9whXO5

> let's just cut the crap about it all and let's tell it how it is

The reality is, as I said, that this technology exists and it isn't going anywhere. Young people are going to use it as a tool just like we did when GUI operating systems first became prevalent.

I don't even remotely buy into the AI hype but I'm not going put the blinders on either. There is utility in this technology.

dakolli 2 days ago [-]

I'm pretty young and hate this technology with a passion. I didn't spend 100k on education, and studying for a decade to have my job reduced to being a project manager for a bot or to play with a prompt slot machine all day. This crap is reducing the thing I genuinely love doing more than anything, writing code, into nothing.. Reviewing code that lacks any sweat, any intention. I really can't stand this garbage.

I can't stand you old heads, I'm very happy for you that you got to stash away 40 years of SWE salaries. Its just ladder kicking behavior to be honest. Typical boomer, you got your nut and don't care what happens after.

25% of new college grads in STEM are unemployed and a bunch of companies (controlled by people in your age group) have laid off 400k Americans over the last 16 months while equities and profits are at an all time highs.

The replies : ItS NoT Ai, ItS cUz FrEe MoNeY fRoM CoViD HaS DrIeD uP.

MaybiusStrip 2 days ago [-]

Software jobs have been steadily outpacing other white collar jobs for the past year, but it's unlikely you will find one unless you work on your attitude and your ability to communicate respectfully.

LatencyKills 2 days ago [-]

The world is changing and instead of embracing that change (ensuring that you will be the next leader) you are actively fighting against technology?

The world was once entirely analog; generations of analog engineers had to throw away their knowledge and start over during the digital transition. It wasn't always pretty but they did it.

If you can't embrace technological change you might have wasted $100k.

stalfie 2 days ago [-]

So to summarize, your objections are almost completely unrelated to the technology, and are mostly about capitalism.

Applejinx 2 days ago [-]

…while burning unreasonable amounts of energy for nothing.

Not a fan. Make games with in-game AIs that are interesting but are not large language models: that's wasteful and lazy. You probably had more large language models put this together for you. Lazy.

p-e-w 2 days ago [-]

Yeah, I guess the tens of thousands of PhDs who are working on LLMs full time are just collectively wasting their lives. Everyone except you is simply too dumb to see it.

dakolli 2 days ago [-]

10s of thousands of PhDs working on llms lol...

hu3 2 days ago [-]

With the amount of money being thrown in R&D, I don't doubt the actual number is astounding.

jeffro_rh 2 days ago [-]

You mean like the OpenAI agents that started by playing DOTA2?

medi_naseri 1 days ago [-]

This is very cool. Will give it a shot.

kookster310 1 days ago [-]

It is interesting/funny to see Opus 4.5 way ahead of the pack on the leaderboards with all the stuff currently going on with Anthropic and Hegseth.

chimpanzee2 2 days ago [-]

This may sound like an insane take, but idc:

I swear people (esp here on HN) are actually blind to the weaknesses of Gemini.

I must be among the handful of people who know how thoroughly lobotomized any AI agent from Google must be given their extremely radical historical and contemporaneous practices of censorship.

hu3 2 days ago [-]

I suspect those who praise Gemini use it mostly for JS/CSS/HTML because that's where it shines for me.

For complex code I have been having using Sonnet/Opus as usual with a mix of GPT5.3-Codex.

cowboylowrez 2 days ago [-]

oh great not only are llms destroying the earth, we have to make games to entertain them while they do it haha

nickpsecurity 2 days ago [-]

There was an open, real-time strategy game created for this purpose long ago. I think it was intended for designs like the Starcraft AI's of the time. Anyone remember or use it?

FrustratedMonky 2 days ago [-]

Wouldn't the AI's built by DeepMind be better at these than an LLM.

I wonder if an LLM could call on another strategy AI to help.

Maybe the LLM could be more of a coordinator of its own thinking by incorporating other types of AI's.

GlacierFox 2 days ago [-]

"I've liked all the projects that put LLMs into game environments."

I haven't.

bombashell 2 days ago [-]

love the idea!

xanth 2 days ago [-]

Now I'd love to see if fast > smart over time with Mercury 2.

burgerone 2 days ago [-]

Bro - come on.

SignalStackDev 2 days ago [-]

[dead]

9 days ago [-]

wordsnaking 2 days ago [-]

[dead]

2 days ago [-]

data_Is_Raciss 2 days ago [-]

[dead]

Rendered at 10:54:16 GMT+0000 (Coordinated Universal Time) with Netlify.