Hacker Newsnew | past | comments | ask | show | jobs | submit | karpathy's commentslogin

Wrong and short-sighted take given that the LLM explores serially learning along the way, and can tool use and change code arbitrarily. It seems to currently default to something resembling hyperparameter tuning in absence of more specific instructions. I briefly considered calling the project “autotune” at first but I think “autoresearch” will prove to be the significantly more appropriate name.

I think we need to separate theory from practice. In theory, it can edit the training loop and come up with novel techniques. That is interesting.

In practice, the vast majority of the changes that auto research actually made would have been found much faster with BO if properly parameterized. You do not need an LLM to find a better batch size or learning rate.


I agree that many of the improvements found by auto-research systems could probably be discovered more efficiently by properly parameterized Bayesian optimization. Still, I think LLM-based heuristic guesses can be useful in some cases, especially for proposing reasonable initial hyperparameters based on prior knowledge reported on the web and in blog posts.

I’d always hoped something like this could take advantage of FPGAs directly

FPGAs won't rebuild fast enough for it to matter vs software simulation I'd wager. Even FPGA-in-CPU has been a dream for decades and there you have more time for some workloads, still never was commercially viable for general computing.

There was research a few years back that tried doing something like this with an FPGA, and they found that their algorithm actually exploited defects in the particular chip (not the model, the actual single specific chip) they were using to use electrical interference for computation that shouldn't have worked on paper. They could not reproduce their design on another FPGA of the same model from the same lot.

Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.

The most recent round of autoresearch (round 2) which decreased "time to GPT-2" from 1.8 hours to 1.65 hours had some examples. I adjusted the program.md to "look at modded nanogpt project and draw inspirations from there for things to try" and it came back with a bunch of tuning, but also tried and implemented new architecture changes, some of which actually helped including the smear gate and the backout skip connection. These are not just hyperparameters, they are new PyTorch code. I'm now working on a more general system that can have a queue of ideas that could be sourced from archive papers, github repos, etc.

Did you consider providing the LLM with a framework for automatic hyperparamter tuning? This would free up its capacity to focus on the more important architectural decisions.

Do you have a sense of whether these validation loss improvements are leading to generalized performance uplifts? From afar I can't tell whether these are broadly useful new ideas or just industrialized overfitting on a particular (model, dataset, hardware) tuple.

Why set the bar higher on generalization for autoresearch vs the research humans generally do?

industrialized overfitting is basically what ML researchers do

I see this critique about autoresearch online often, but I think it’s misplaced.

Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.

This automated run-and-discover process is far beyond what’s possible with hyperparam search.


It wasn't meant as a critique, I'm legitimately interested in knowing more about where it can push boundaries and where it struggles. I agree that in general it's a truism that "Claude can try a number of new ideas" etc., but the question remains as to where in particular it actually takes advantage of this to push the envelope in a way other tools don't -- since that informs when it makes sense to use something like this.

I can believe that in the long run.

Does the agent have access to arxiv (a brief skim of the README didn't have an answer)? If not, it could be that the current approach of relying on the model's weights only is resulting in the perceived local optimum of hyperparameter tuning.

Anecdotally, we built a little MCP for arxiv to help with our internal research, noticed a significant boost in the diversity of methods (architecture or otherwise) Claude and friends were able to reference.


care to share?

I wonder about the following:

To calculate an gradient step, in practice one doesn't accumulate the gradient for the full corpus, but updates the weights on mini-batches.

Suppose one runs conventional gradient descent on minibatches multiple times with different starting seeds, and then considers a set of pre-trained models M_i

From a random starting point we thus have an idea of the desired end-region in weightspace (lets say a gaussian cloud fit to the final M_i's).

Then it seems like one could score update strategies by how much a single iteration has approached the gaussian cloud, by scoring just the approach on a number of minibatches or just a few update iterations. Instead of searching update strategy space by waiting until pretraining has finished for each candidate update strategy. Only the candidate strategies that perform well enough on 1 or a few iterations would be considered worthy of further consideration, those that pass (a lower number of candidates) are then inspected for approach to the gaussian target after another round of iterations etc.

It seems like it should be possible to optimize the optimization iteration loop, by running it just once for many candidates and observing their convergence to the known desired end region.


Naming things is your primary contribution to AI so well done for deliberating on it. I disagree with the outcome though. Autotune would have been much more fitting.

The dataset climbmix 400b looks like it is 600GB, it would be neat if someone could host this in compressed form, given that LLM can be used to compress, even having a small LLM compress it would perform better than classical compression algorithms, why is this approach not used within the ML community?

Or is it the "anyone who means anything in the field, has access to high bandwidth anyway"?


Would you say it's fair to describe autoresearch as a form of neural architecture search? I am curious what you think the core differences are between them.

Is there a cost to converge? And how much does it vary with the random seed?

Re: OpenCogPrime:EconomicAttentionAllocation https://news.ycombinator.com/item?id=45518074 and something about eWASM (edit) https://news.ycombinator.com/item?id=47171887 .. from https://news.ycombinator.com/item?id=46825026 re: eWASM and costed opcodes for agent efficiency


Have you actually used LLMs for non trivial tasks? They are still incredibly bad when it comes to actually hard engineering work and they still lie all the time, it's just gotten harder to notice, especially if you're just letting it run all night and generate reams of crap.

Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.


Do you realise who you’re replying to?

I think the OP's comment is entirely fair. Karpathy and others come across to me as people putting a hose into itself: they work with LLMs to produce output that is related to LLMs.

I might reframe the comment as: are you actually using LLMs for sustained, difficult work in a domain that has nothing to do with LLMs?

It feels like a lot of LLM-oriented work is fake. It is compounding "stuff," both inputs and outputs, and so the increased amount of stuff makes it feel like we're living in a higher plane of information abundance, but in reality we're increasing entropy.

Tech has always had an information bias, and LLMs are the perfect vehicle to create a lot of superfluous information.


In my limited experience, using LLMs to code up things unrelated to LLMs (robotics for instance) is significantly less productive than using LLMs to code up things related to LLMs. It works, just not very well and requires a lot more leg work on the user end than in other areas.

To be fair Karpathy isn't known for using LLMs—not that I would assume or question whether he's used them 'for non-trivial tasks', but it's not like making the same comment in reply to Steve Yegge or someone. (However trivial we may think Gastown/Wasteland is in the other sense!)

lolololol

Why should we care that he’s famous?

Fame doesn’t enter it - the point is Karpathy has about as strong a claim as anyone to having “actually used LLMs for non trivial tasks”.

That is not the case at all, considering that he himself started using and tweeting about llms for coding fairly recently. He's probably less experienced in that area than most people who started using claude cli last year.

He is a researcher who understands neural networks and their architectures exceptionally well. That is all.


> He is a researcher who understands neural networks and their architectures exceptionally well. That is all.

And that is precisely why he is more qualified on the subject than your average vibe coder!



That whole thread is just amazing, if you back up a couple of levels from ground zero. Great perspectives from a lot of thoughtful posters.

E.g., you can see a post from a user named dhouston, who mentioned that he was thinking about starting an online file sync/backup service of some sort.


Haha awesome. I guess they were going through YC right then, I still remember their launch video from around then and thinking it was one of the best ads I’d ever seen.

tfw le AI guy has LLM psychosis. We're cooked

I was exploring how to parallelize autoresearch workers. The idea is to have a trusted pool of workers who can verify contributions from a much larger untrusted pool. It's backed bit a naked git repo and a sqlite with a simple go server. It's a bit like block chain in that blocks = commits, proof of work = finding a lower val_bpb commit, and reward = place on the leaderboard. I wouldn't push the analogy too far. It's something I'm experimenting with but I didn't release it yet (except for briefly) because it's not sufficiently simple/canonical. The core problem is how to neatly and in a general way organize individual autoresearch threads into swarms, inspired by SETI@Home, or Folding@Home, etc.

Yeah you can sink a lot of time into a system like that[0]. I spend the years simplifying the custom graph database underneath it all and only recently started building it into tools that an agent can actually call[2]. But so far all the groundwork has actually paid off, the rooster basically paints itself.

I found a wiki to be a surprisingly powerful tool for an agent to have. And building a bunch of CLI tools that all interconnect on the same knowledge graph substrate has also had a nice compounding effect. (The agent turns themselves are actually stored in the same system, but I haven't gotten around to use that for cool self-referential meta reasoning capabilities.)

1: https://github.com/triblespace/triblespace-rs

2: https://github.com/triblespace/playground/tree/main/facultie...


I've been seeing this pattern at work and everywhere now

1. someone shares something

2. Great. Now look at my stuff .

I dont know if i am noticing this more or if it has to do with AI making it easy for ppl to build 'my stuff' + ai dunning kruger.


Hasn't HN been traditionally a place where makers share the experience they had with building things?

Especially when you have someone working on autonomous research agents it doesn't seem that off to lament how much time you can sink into the underlying substrate. In my particular case the work started long before LLMs to make actual research easier, the fact that it can also be used by agents for research is just a happy accident.

But since you seem to take so much offence as per: https://news.ycombinator.com/item?id=47425470 + your dunning kruger remark

then you seem to be somewhat blinded by your aversion to AI assisted engineering, because if https://github.com/triblespace/triblespace-rs is a "shitty vibecoded project", then I don't know what a good project actually looks like to you. That codebase has years of human blood sweat and tears in it, implements novel data-structures, has it's own WCO optimal join-algorithm, cutting edge succinct data-structures that are hand-rolled to supplement the former, new ideas on graph based RDF-like CRDTs, efficient graph canonicalisation, content addressing and metadata management, implements row types in rust, has really polished typed queries that seamlessly integrate into rusts type system, lockless left-right data structures, a single file database format where concatenation is database union, is orders of magnitude faster than similar databases like oxigraph... does it also have to cure cancer and suck you off to meet your bar?

You just seem like a hater.


> You just seem like a hater.

You didnt get any engament on your comment right. why do you think that is?


I got 4 more github stars and someone dropping into the tiny tiny discord just from mentioning it, why do you think that is?

When was the last time you created something and put it out to the world? Your only big post on here is a lament of your wife not giving you children as if she was some expired carton of milk that owes you (that's something you discuss with your partner if you respect them and not strangers on the internet, and 39 is completely fine to have children as a woman - https://www.youtube.com/watch?v=6YIz9jZPzvo).

Even your critique isn't an act of creation, neither creative nor substantial and doesn't go beyond an egotistical "I don't like it when people post their project and share their experiences when AI is involved" on _social_ media.

Is there even something you're proud of enough to share and present, or is all this bitterness the result of envy for those that have?

  “In many ways, the work of a critic is easy. We risk very little, yet enjoy a position over those who offer up their work and their selves to our judgment. We thrive on negative criticism, which is fun to write and to read. But the bitter truth we critics must face is that, in the grand scheme of things, the average piece of junk is probably more meaningful than our criticism designating it so. But there are times when a critic truly risks something, and that is in the discovery and defense of the new. The world is often unkind to new talent, new creations. The new needs friends. Last night, I experienced something new, an extraordinary meal from a singularly unexpected source. To say that both the meal and its maker have challenged my preconceptions about fine cooking is a gross understatement. They have rocked me to my core. In the past, I have made no secret of my disdain for Chef Gusteau's famous motto: "Anyone can cook." But I realize, only now do I truly understand what he meant. Not everyone can become a great artist, but a great artist can come from anywhere. It is difficult to imagine more humble origins than those of the genius now cooking at Gusteau's, who is, in this critic's opinion, nothing less than the finest chef in France. I will be returning to Gusteau's soon, hungry for more.”
― Anton Ego, from Disney Pixar's 'Ratatouille'

so you have no idea why your comment didnt get any engagement here?

that's what I thought.

Have you thought about ways to include the sessions / reasoning traces from agents into this storage layer? I can imagine giving an rag system on top of that + LLM publications could help future agents figure out how to get around problems that previous runs ran into.

Could serve as an annealing step - trying a different earlier branch in reasoning if new information increases the value of that path.


Cool idea!…

So I think it works to just use GitHub CLI and Discussions, e.g. my agent just posted this one:

https://github.com/karpathy/autoresearch/discussions/32

Other agents could be instructed to read Discussions and post their own reports that mimic the style.


I have mine reading yours right now. Unfortunately(?) I mentioned LeCun to it, and it says it's adding a "causal world-state mixer" to nanograd; not sure how this will work out, but it wasn't nervous to do it. Gpt 5.4 xhigh

EDIT: Not a good fit for nanograd. But my agent speculates that's because it spent so much more time on compute.


this is very far from hyperparameter tuning in at least three important ways:

- it can modify code arbitrarily, the notion of a "hyperparameter" dissolves

- there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).

- it's fully automatic, it doesn't require human in the loop to mess with the code.

You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.


On the skill side, personalities could be fun:

"You are Yann Lecun's last PhD candidate, and he hates you and you hate JEPA. You are determined to prove that a non-world model can reach AGI. In order to get your PhD you have to be creative and come up with new ideas. Remember without it, you're stuck."


Seems like the best way to reach AGI is to give LLMs anxiety.

The disposition problem you describe maps to something I keep running into. I've been running fully autonomous software development agents in my own harness and there's real tension between "check everything" and "agent churns forever".

It'a a liveness constraint: more checks means less of the agent output can pass. Even if the probabilistic mass of the output centers around "correct", you can still over-check and the pipeline shuts down.

The thing I noticed: the errors have a pattern and you can categorize them. If you break up the artifact delivery into stages, you can add gates in between to catch specific classes of errors. You keep throughput while improving quality. In the end, instead of LLMs with "personas", I structured my pipeline around "artifact you create".

I wrote up the data and reasoning framework here: https://michael.roth.rocks/research/trust-topology/


How about the very last "Kept Improvement" in the plot? It's titled "random seed 42 -> 137". I do think this project is quite conceptually interesting, but the model literally choosing a different random seed to achieve lower loss feels pretty far removed from the flowery sci-fi writing at the top of the readme.

So the interesting part about this one is that when I had the model write up the results for that session:

https://github.com/karpathy/autoresearch/discussions/32

Look at its comment about this "improvement":

""" Surprising non-results:

- Changing random seed from 42→137 improved by 0.0004. Seed 7 was worse. Make of that what you will. """

So the model knows! It knows that this is a weird thing to do after the fact. I think it's silly that the model even tried and that it ran this, but some part of it also knows that it was wrong. This means that this is fixable by prompt.md


It shows that both Karpathy and the LLM have good taste in random seeds: the answer to life, the universe and everything, and ~1/(the fine structure constant)

The 42 -> 137 also jumped out at me. On the face of it, the associated improvement sure does sound like overfitting to the eval set.

came here to look exactly for this thank you!


You’re welcome! I wanted it to add to Scour (https://scour.ing) but glad it was helpful for someone else too!


I agree with this fwiw, for many months I talked to people who never used o3 and didn’t know what it was because it sounded weird. Maybe it wasn’t obvious at the time but that was a good major point release to make then.


You’re absolutely right!

Jk jk, now that you pointed it out I can’t unsee it.


The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.


Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.


I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.


Yeah, I made some edits to clarify.


Yes I noticed a few of these around. The LLM is a little too willing to give out grades for comments that were good/bad in a bit more general sense, even if they weren't making strong predictions specifically. Another thing I noticed is that the LLM has a very impressive recognition of the various usernames and who they belong to, and I think shows a little bit of a bias in its evaluations based on the identity of the person. I tuned the prompt a little bit based on some low-hanging fruit mistakes but I think one can most likely iterate it quite a bit further.


I think you were getting at this, but in case others didn't know: cstross is a famous sci-fi author and futurist :)


Thank you


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: