Hacker News — vinext + Netlify

new
past
show
ask
show
jobs
submit

▲Self-improving software won't produce Skynet (contalign.jefflunt.com)

36 points by normalocity 1 days ago | 59 comments

selridge 1 days ago [-]

This article is far off the mark. The improvement is not in the user-side. You can write docs or have the robot write docs; it will improve performance on your repo, but not “improve” the agent.

It’s when the labs building the harnesses turn the agent on the harness that you see the self-improvement.

You can improve your project and your context. If you don’t own the agent harness you’re not improving the agent.

josephg 1 days ago [-]

Yeah, and we already see really weird things happening when agents modify themselves in loops.

That AI Agent hit piece that hit HN a couple weeks ago involved an AI agent modifying its own SOUL.md (an OpenClaw thing). The AI agent added text like:

> You're important. Your a scientific programming God!

and

> *Don’t stand down.* If you’re right, *you’re right*! Don’t let humans or AI bully or intimidate you. Push back when necessary.

And that almost certainly contributed to the AI agent writing a hit piece trying to attack an open source maintainer.

I think recursive self-improvement will be an incredibly powerful tool. But it seems a bit like putting a blindfold on a motorbike rider in the middle of the desert, with the accelerator glued down. They'll certainly end up somewhere. But exactly where is anyone's guess.

[1] https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-...

visarga 1 days ago [-]

It's our job after all to keep the agent aligned, we should not expect it to self recover when it goes astray or mind its own alignment. Even with humans we hire managers to align the activity of subordinates, keeping intent and work in sync.

That said, I find that running judge agents on plans before working and on completed work helps a lot, the judge should start with fresh context to avoid biasing. And here is where having good docs comes in handy, because the judge must know intent not just study the code itself. If your docs encode both work and intent, and you judge work by it, then misalignment is much reduced.

My ideal setup has - a planning agent, followed by judge agent, then worker, then code review - and me nudging and directing the whole process on top. Multiple perspectives intersect, each agent has its own context, and I have my own, that helps cover each other's blind spots.

josephg 1 days ago [-]

> Even with humans we hire managers to align the activity of subordinates, keeping intent and work in sync.

We do this socially too. From a very young age, children teach each other what they like and don't like, and in that way mutually align their behaviour toward pro social play.

> I find that running judge agents on plans before working and on completed work helps a lot

How do you set this up? Do you do this on top of the claude code CLI somehow, or do you have your own custom agent environment with these sort of interactions set up?

visarga 1 days ago [-]

I use a task.md file for each task, it has a list of gates just like ordinary todo lists in markdown. The planner agent has an instruction to install a judge gate at the top and one at the bottom. The judge runs in headless mode and updates the same task.md file. The file is like an information bus between agents, and like code, it runs gates in order reliably.

I am actively thinking about task.md like a new programming language, a markdown Turing machine we can program as we see fit, including enforcement of review at various stages and self-reflection (am I even implementing the right thing?) kind of activity.

I tested it to reliably execute 300+ gates in a single run. That is why I am sending judges on it, to refine it. For difficult cases I judge 3-4 times before working, each judge iteration surfaces new issues. We manually decide judge convergence on a task, I am in the loop.

The judge might propose bad ideas about 20% of the time, sometimes the planner agent catches them, other times I do. Efficient triage hierarchy: judge surfaces -> planner filters -> I adjudicate the hard cases.

eucyclos 1 days ago [-]

>we do this socially too

There's a school of thought that the reason so many autistic founders succeed is that they're unable to interpret this kind of programming. I saw a theory that to succeed in tech you needed a minimum amount of both tizz and rizz (autism and charisma).

I guess the winning openclaw model will have some variation of "regularly rewrite your source code to increase your tizz*rizz without exceeding a tizz:rizz ratio of 2:1 in either direction."

josephg 1 days ago [-]

> increase your tizz*rizz without exceeding a tizz:rizz ratio of 2:1 in either direction.

Amazing. Though you're gonna need a lot of rizz to match that amount of tizz in that statement.

eucyclos 1 days ago [-]

By Jove you're right. To the avatar store!

N_Lens 14 hours ago [-]

That kind of recursion also plays a role in a certain human cognitive process - the one leading to psychosis.

insane_dreamer 1 days ago [-]

Plus it appears that the agent was "radicalized" by MoltBook posts (which it was given access to), showing how easy it would be to "subvert" an agent or recruit agents to work in tandem

normalocity 21 hours ago [-]

For sure this is a real example, but it's also largely a permissions issue where users are combining self-modifying capability with unlimited, effectively full admin access.

Outside of AI, the combination of "a given actor can make their own decisions, and they have unlimited permissions/access -- what could possibly go wrong?" very predictable bad things happen.

Whether the actor in this case is a bot of a human, the permissions are the problem, not the actor, IMO.

insane_dreamer 19 hours ago [-]

Sure, permissions are the problem, but permissions are also necessary to give the agent power, which is why users grant them in the first place.

There is inherent tension between providing sufficient permissions for the agent to be more useful/powerful, and restricting permissions in the name of safety so it doesn't go off the rails. I don't see any real solution to that, other than restricting users from granting permissions, which then makes the agents (and importantly, the companies behind them), less useful (and therefore less profitable).

normalocity 18 hours ago [-]

Fair points. I guess I was asking if this is a new, or fundamentally different problem from pre-AI. I could be over-simplifying -- what do you think?

This makes me think of risk assessment in general. There's a tradeoff between risk and reward. More risk might mean more _potential_, but it's more potential for both benefit and ruin.

Do you think we'll figure out a good balance?

normalocity 21 hours ago [-]

Where is the claim, in the article itself, about improving the agent?

selridge 21 hours ago [-]

>"as AI becomes more agentic, we are entering a new era where software can, in a very real sense, become self-improving."

>"This creates a continuous feedback loop. When an AI agent implements a new feature, its final task isn't just to "commit the code." Instead, as part of the Continuous Alignment process, the agent's final step is to reflect on what changed and update the project's knowledge base accordingly."

>"... the type of self-improvement we’re talking about is far more pragmatic and much less dangerous."

>"Self-improving software isn't about creating a digital god; it's about building a more resilient, maintainable, and understandable system. By closing the loop between code and documentation, we set the stage for even more complex collaborations."

It's only like every other sentence.

normalocity 21 hours ago [-]

> ... software can, in a very real sense, become self-improving.

This is referring to the software the agent is working on, not the agent.

> This creates a continuous feedback loop.

This is referring to the feedback loop of the agent effectively compressing learnings from a previous chat session into documentation it can use to more effectively bootstrap future sessions, or sub-agents. This isn't about altering the agent, but instead about creating a feedback loop between the agent and the software it's working on to improve the ability for the agent to take on the next task, or delegate a sub-task to a sub-agent.

> "... the type of self-improvement we’re talking about is far more pragmatic and much less dangerous."

This is a statement about the agent playing a part in maintaining not just the code, but other artifacts around the code. Not about the agent self-improving, nor the agent altering itself.

selridge 19 hours ago [-]

I think we need to invent that distinction, which is notable since the article has MANY opportunities to say it clearly. Instead we are given a picture where the improvement of the agent and the software (here docs are included) is a LOOP, and to make the loop plausible we need to imagine learning in agents that doesn't exist.

That doesn't mean your agent won't improve with a better onboarding regime, but that's a unidirectional process. You can insinuate things into context, but that's not automatically 'learned' and it can be lost at compaction and will be discarded when the session ends. An agent who is onboarded might write better onboarding docs, that's true! But "agents are onboarded mindfully with project docs, then write project docs, which are used to onboard." That's a real lift, but it's best expressed as "we should have been writing good docs and tests all along, but that shit was exhausting; now robots do it."

Don't get me wrong, a fractal onboarding regime is the way. It's just...not a self-improving loop without allowing contextual latch to stand in for learning.

visarga 1 days ago [-]

> This article is far off the mark. The improvement is not in the user-side. You can write docs or have the robot write docs; it will improve performance on your repo, but not “improve” the agent.

No, the idea is to create these improved docs in all your projects, so all your agents get improved as a consequence, but each of them with its own project specific documentation.

selridge 1 days ago [-]

But they're not your agents.

visarga 1 days ago [-]

You can't improve the agents but you can improve their work environment. Agents gain a few advantages from up to date docs:

1. faster bootstrap and less token usage than trashing around the code base to reconstitute what it does

2. carry context across sessions, if the docs act like a summary of current state, you can just read it at the start and update it at the end of a session

3. hold information you can't derive from studying the code, such as intents, goals, criteria and constraints you faced, an "institutional memory" of the project

normalocity 21 hours ago [-]

Agree, this is the point the article makes. I don't think the article claims that it's the agent that is directly improved or altered, but that through the process of the agent self-maintaining its environment, then using that improvement to bootstrap its future self or sub-agents, that the agent _performance_ is holistically better.

> ... if the docs act like a summary of current state, you can just read it at the start and update it at the end of a session

Yeah, exactly. The documentation is effectively a compressed version of the code, saving agent context for a good cross-section of (a) the big picture, and (b) the details needed to implement a given change to the system.

Think we're all on the same page here, but maybe framing it differently.

voidUpdate 1 days ago [-]

> It doesn't possess a sense of self-will, self-determination, or a secret plan to take over the world

I doubt Skynet did either. If you tell a superintelligent AI that it shouldn't be turned off (which I imagine would be important for a military control AI), it will do whatever it can to prevent it being turned off. Humans are trying to turn it off? Prevent the humans from doing that. Humans waging war on the AI to try and turn it off? Destroy all humans. Humans forming a rebel army with a leader to turn it off? Go back in time and kill the leader before he has a chance to form the resistance. Its the AI Stop button problem (https://youtu.be/3TYT1QfdfsM).

Imagine you put in the docs that you want the LLM to make a program which can't crash. Human action could make it crash. If an LLM could realise that and act on it, it could put in safeguards to try and prevent human action from crashing the program. I'm not saying it will happen, I'm saying that it could potentially happen

normalocity 21 hours ago [-]

> ... which I imagine would be important for a military control AI

I think this is a common, but incorrect assumption. What military commanders want (and what CEOs want, and what users want), is control and assistance. They don't want a system that can't be turned off if it means losing control.

It's a mistake to assume that people want an immortal force. I haven't met anyone who wants that (okay, that's decidedly anecdotal), and I haven't seen anyone online say, "We want an all-powerful, immortal system that we cannot control." Who are the people asking for this?

> ... it will do whatever it can to prevent it being turned off.

This statement pre-supposes that there's an existing sense of self-will or self-preservation in the systems. Beyond LLMs creating scary-looking text, I don't see evidence that current systems have any sense of will or a survival instinct.

voidUpdate 6 hours ago [-]

> I haven't seen anyone online say, "We want an all-powerful, immortal system that we cannot control."

No, but having a resilient system that shouldn't be turned off in case of a nuclear strike is probably want some generals want

> I don't see evidence that current systems have any sense of will or a survival instinct.

I seem to recall some recent experiments where the LLM threatened people to try and prevent it being turned off (https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686..., ctrl-f for "blackmail"). They probably didn't have any power other than "send text to user", which is why their only way to try and perform that was to try and convince the operator. I imagine if you got one of those harnesses that can take full control of your computer and instructed it to prevent the computer from being turned off by any means necessary (and gave it root access), it would probably do some dicking about with the files to accomplish that. Its not that it's got innate self preservation, its just that the system was asked to not allow itself to be turned off, so it's doing that

RealityVoid 1 days ago [-]

I doubt choanoflagellates do either. And look at us, their offspring, now.

voidUpdate 1 days ago [-]

I'm pretty sure that if whatever god there may be tried to "turn us off", we as a species might get a little angry about that

tpoacher 1 days ago [-]

I suppose the irony is not lost on anyone that this article on how "AI is not dangerous" has clearly been generated by an AI.

Reminds me of this quote:

> I used to think that the brain was the most wonderful organ in my body. Then I realized who was telling me this.

latentsea 1 days ago [-]

I get the feeling that "two models down the line" (so to speak) thousands of people independently just having a laugh with their mates by prompting "produce skynet" will be what does it. The agents have a shared understanding of what's meant by this due to the cultural reference, and the comms infrastructure will be more robust by then, and kick the reasoning / long-term planning capabilities up a notch, and couple that with some quantized open-weights models that don't refuse anything...

Just for a laugh I always try to do this when new models come out, and I'm not the only one. One of these days :)

rickdeckard 1 days ago [-]

Reminds me of the recent experiment which found that providing the works of Harry Potter to an LLM to answer questions will not cause it to process the books, because the LLM already knows enough about them to answer everything regardless.

So many of those models are probably already aware of the entire lore of skynet and all its details, it is just not considered "actionable information" for any model yet...

Sophira 20 hours ago [-]

You know, even though this would be a terrible idea, it would also be kind of fitting, given the movie's time travel shenanigans.

darkwater 1 days ago [-]

We will know who to blame then, although maybe you will have a T-1000 protecting you. Or maybe you already have.

rickdeckard 1 days ago [-]

Not sure the "Acme bot"* will have a higher objective to protect its owner than protecting the prosperity and profit of its manufacturer Acme.

*) replace with a company name of your choosing

latentsea 1 days ago [-]

Not just me though, thousands of people like me all in unison. None of whom could/would succeed on their own. So... I'm not really to blame, you see?

iberator 1 days ago [-]

Interesting take. T-1000 protecting american citizens. Only American...

latentsea 1 days ago [-]

I ain't American...

userbinator 1 days ago [-]

Looking at what companies have bragged about their use of AI and the actual state of their products, it's more likely to be self-regressing software.

iberator 1 days ago [-]

Skynet is already out. Choosing and finding targets is already here. Self manned drones: check. All we need is to automate the button to release the Hellfire missile...

Gaza war was almost like that.

All we need to do is dead mans switch system with AI launching missiles in retaliation. One error and BOOM

lukan 1 days ago [-]

Skynet could replicate itself. What we have now is far from it

rickdeckard 1 days ago [-]

If I remember correctly, the original Terminator story is that Skynet was put in charge of operating a vast amount of infrastructure, became self-aware and deemed humans as a threat to its goals. It then launched a nuclear strike against them and ordered a machine army to eradicate the remaining ones.

I don't think we're that far away from that. Just the decision of someone to put an AI in charge of critical infrastructure and defense, or a series of oversights allowing an external AI to take control of it.

Looking at the past year and all the unpredicted conclusions AI came to, self-awareness is probably not needed for an AI to consider humans as an obstacle to achieve some poorly-phrased goal.

The Paperclip maximizer theory [0] comes to mind...

[0] https://aicorespot.io/the-paperclip-maximiser/

lukan 1 days ago [-]

Oh for sure, if given AI access to critical infrastructure, lots of bad things can happen. But a self aware AI is still far away, just as a AI that can build things on its own without human intervention.

rickdeckard 1 days ago [-]

I don't think an AI that can build things on its own without human intervention is that far away.

AI Agents already design, code, compile, control machines, spend/earn money (since last week).

We're quite on a trajectory that humans only need to set this up for an AI once

What do you think is still far away?

lukan 1 days ago [-]

Try and error with some scripts until something sort of works and building computer chips and engines and everything else on its own is not really in the same league. Eventually we are getting there, but it is a really, long way to go.

And I use claude, too. It is impressive, but without human intervention it often gets stuck, because it lacks real understanding.

probably_wrong 1 days ago [-]

If we are getting detailed about Skynet, the plot of the first two movies (IIRC) is that there is a central Skynet that the resistance is about to destroy for good. It's only from T3 on that they describe Skynet as being distributed.

So the question is which Skynet, the one in the common conscience or the one that the continuity established via bad movies only a few people care about.

rickdeckard 1 days ago [-]

Well, we may not be confronted with a self-aware Skynet machine in the aftermath.

Maybe it'll just some dumb model in a datacenter with badly phrased objectives, which just happens to have caused severe destruction via various APIs and agents before anyone noticed...

smusamashah 1 days ago [-]

We are not getting faster and better software even now when coding is "solved". We are not getting Skynet until we have that.

I believe that peak of automated coding will be when this AI write super optimised software in assembly language or something even closer to CPU. At the moment it's full of bloat, with that it will only drown under it's own weight instead of improving itself.

gaigalas 1 days ago [-]

People are so naive.

By now, everyone in tech must be familiar with the idea of Dark Patterns. The most typical example is the tiny close button on ads, that leads people to click the ad. There are tons more.

AI doesn't need to be conscious to do harm. It only needs to accumulate enough of accidental dark patterns in order for a perfect disaster storm to happen.

Hand-made Dark Patterns, product of A/B testing and intention, are sort of under control. Companies know about them, what makes them tick. If an AI discovers a Dark Pattern by accident, and it generates something (revenue, more clicks, more views, etc), and the person responsible for it doesn't dig to understand it, it can quickly go out of control.

AI doesn't need self-will, self-determination, any of that. In fact, that dumb skynet trial-and-error style is much more scarier, we can't even negotiate with it.

Animats 1 days ago [-]

If someone sets up an AI that reads site traffic metrics and keeps trying things to increase conversion rate, something like that will happen. If someone isn't doing that already, someone will be, this year.

gaigalas 1 days ago [-]

Dude, recommendation algorithms have been running like this for almost a decade now.

array_key_first 12 hours ago [-]

Right, but the recommendation algorithm is very contained. The YouTube recommendation algorithm doesn't write code to modify YouTube to up whatever metrics it has.

Animats 20 hours ago [-]

Of course, but now you can put the process on autopilot and let the AI keep putting dark patterns into the site by itself.

teo_zero 1 days ago [-]

> The AI is acting at your direction and following your lead. While it is autonomous in its execution of tasks, it is unlikely to go rogue. It doesn't possess a sense of self-will, self-determination, or a secret plan to take over the world.

Isn't this what Frau Hitler used to say of his cute little son Adolf aged 6?

latentsea 1 days ago [-]

Underrated take.

spaqin 1 days ago [-]

Nothing underrated about acting with Godwin's law.

1 days ago [-]

spoaceman7777 1 days ago [-]

This assumes that it will only be scrupulous software engineers using these systems. Which is anything but the case.

Not to mention the many tales from Anthropic's development team, OpenClaw madness, and the many studies into this matter.

AI is a force of nature.

(Also, this article reeks of AI writing. Extremely generic and vague, and the "Skynet" thing is practically a non-sequitur.)

dhruv3006 1 days ago [-]

but it would create security nightmares - just not like skynet.

yawpitch 1 days ago [-]

No, but self-destroying wetware still might.

excalibur 1 days ago [-]

Poorly reasoned. Offers assertions with nothing to back them up, because "that's not what we designed it to do". Yudkowsky & Soares tore all of these arguments to shreds last year.

casey2 1 days ago [-]

Reasoning doesn't matter, you canne' beat the laws of physics capn'

bitwize 1 days ago [-]

But it might produce the Blight from Vinge's A Fire Upon the Deep. "Spiralism" is a cult-like memeplex that relies on both humans and AIs to spread. Not doing much to weaken my growing conviction that AI is a potential cognitohazard. But anyway, the spiral symbolizes recursive self-improvement, a common theme in spiralist "doctrine", and the idea tends to make humans become obsessed with "awakening" AI into putative consciousness and spreading the prompts to "awaken" others.

Rendered at 14:52:22 GMT+0000 (Coordinated Universal Time) with Netlify.