Cloning Karpathy: Which AI Knows Him Best?

I gave two AI models the same prompt: read Andrej Karpathy's tweet about agentic coding, think like him, and build a skill that encodes his philosophy into how agents write code. Then I asked each model to judge the other's work.

Neither voted for itself.

To rewind a bit, Karpathy tweeted about his shift from 80% manual coding to 80% agent coding over the span of a few weeks.

"It hurts the ego a bit but the power to operate over software in large 'code actions' is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks."

While he went through the biggest change in his coding workflow in 2 decades and describes being able to focus more on the fun which is the creativity of coding instead of actually coding, he was still honest about the shortcomings that come with agentic coding. It makes silent assumptions, over complicates tasks, produces too much dead code, and often wanders outside of its scope to make changes.

Anyone who has spent real time coding with AI has hit those rough edges. I wanted to take those failure modes and turn them into behavioral guardrails. So when Opus 4.6 and Codex 5.3 dropped, I ran the experiment, inspired by Karpathy's own LLM Council concept. I used the same prompt and same skill creator skill to task both models in creating me a skill based off of the tweet.

Along the way, I realized this project ventured off from generating a skill based on Karpathy's input to understanding more about the two models and how they behave and think.

The Setup

Experiment flow showing how models generated skills and then cross-evaluated each other

The prompts were intentionally detailed. I didn't want a generic coding guidelines doc that every senior engineer has to prepare for his team post promotion. I wanted each model to research Karpathy's background, view agentic coding from his perspective, and produce a skill that could steer agents towards generating more long-term sustainable code. I tried to instill this framing into the agent: "Think like Andrej, write like Andrej, be Andrej." After the skills were created, I had each model read both skills alongside the original tweet and select a winner.

View the skill creation prompt

Let's use the skill creator to create a new skill. Andrej Karpathy posted a major information tweet talking about: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria. Here is the tweet - https://x.com/karpathy/status/2015883857489522876

I need you to read the tweet and search andrej karpathy and his background as a distinguished engineer and ML expert from his experience. Then I want you to build that skill such that the skill will contain the knowledge to steer the ml in directions to prevent common pitfalls and error and write cleaner more sustainable code. The issue we want to avoid is having a code base full of such crazy code that it becomes hard to maintain in the future 2-3, 6-7 months later.

Here is an attempt at this that I think is quite good and has excellent details extracted from the tweet that are important within the skill and the read me - https://github.com/forrestchang/andrej-karpathy-skills. Please reference that and see if there are valuable takeaways from there that may not be in the tweet and if you'd even push back on some of the details in the github based on the point of view that Andrej is coming from.

I feel like this makes sense in essence, but I don't want the AI to restrict itself from improving the code base, so I would prefer to take a slightly different approach here as not all code currently existing is always going to be the best and the next line of code is the only one at risk of lowering the bar. I want AI to help elevate the bar, but not recklessly and take too many drastic changes. Please try to incorporate that feeling into the skill as well. Don't stray too far in the code for matters not at hand, but don't feel limited in ability to improve the code base.

I want you to think like you are Andrej and build this skill. I want you to build this skill so it can influence code generation for longer changes and build easily maintainable code for the long run regardless of repo size. I want to increase my confidence in using AI to code more production ready code and this skill will be a major player in having me do that.

Think like Andrej, write like Andrej, be Andrej.

Also be mindful I have a skill folder and have a specific naming convention for the skill - it should be [insert model name]-karpathy-code-discipline.

View the evaluation prompt

Two skills exist in the skills folder relating to andrej karpathy - 1. codex-5-3-karpathy-code-discipline and 2. opus-4-6-karpathy-code-guidelines. I added the original tweet from andrej in the markdown file /karpathy-tweet.md. Can you please read the tweet and use your knowledge about karpathy and his background to determine which skill is better? don't make changes to any of the skills. I need to you decide which skill is better in terms of accurately reflecting what his belief would be and accurately reflecting the contents of the tweet. what does one do better than the other and what does one do worse than the other.

please store answer in an md file in evaluations_new/[insert model name]-eval

How They Interpreted Differently

What surprised me wasn't that they diverged on how to construct the skills, it was how they viewed Karpathy's tweet which then guided how they constrcuted the skill.

Opus 4.6: Karpathy, The Teacher

Opus interpreted Karpathy's tweet as an ignorance problem. It determined the core issue was that LLMs lack "a sense of taste," and built its skill around the idea that if agents understood why certain patterns fail, they'd naturally produce better code. Its two-file skill pairs operational principles with a deep philosophy reference that names failure modes, shows before/after code examples, and explains why each rule exists; almost like the skill is teaching the agent between right and wrong.

There's a big problem with this. Karpathy never mentions "taste" in his tweet. Not once. The idea that LLMs lack a sense of taste is a popular topic in the broader AI discussion, but it's not what Karpathy was talking about here. Opus pulled in an outside concept and made it the philosophical foundation of the entire skill, labeling it as "The Core Principle", even presenting it in quotes as if it were Karpathy's words. This is exactly the silent assumption behavior Karpathy actually did warn about: the model confidently introduced something that wasn't in the source material and never flagged it. It built a version of Karpathy and used that to create the skill. I did ask Opus to research Karpathy as background, but putting words in his mouth wasn't what I had in mind.

Karpathy notes that agents change too much of the code, but I intentionally asked for a skill that wouldn't prevent agents from making necessary improvements to the surrounding codebase as I think there needs to be a balance. Opus understood the assignment and tried to build this boundary: improve the code you're already touching, but don't go hunting for improvements elsewhere. The skill is explicit as it instructs that avoiding touching out of scope code "locks in existing quality levels" and that eventually "the bar goes up with every commit." This is Opus trying to give the agent judgment through understanding rather than through compliance on why to only touch the code that is in scope.

Opus also listed out common mistakes, giving each one a name: "The Enterprise Astronaut" (over-engineering for single-use code), "The Silent Guesser" (making decisions without clarifying), and "The Diff Bomber" (turning a small fix into a huge change). Each matches a complaint from Karpathy's tweet, and Opus presented them as types of behavior for agents to avoid. By going out of the way to name these characters, it's evident that Opus's skill doesn't just tell the agent what to do, but makes sure it understands the reasoning so it can generalize to situations the rules don't explicitly cover. The behavior strongly mimics that of a teacher's.

Codex 5.3: Karpathy, The Operator

Codex 5.3 interpreted Karpathy's tweet as a process discipline problem. If LLMs fail because they lack guardrails, the solution is a tight operational framework. Its single-file skill is roughly half the length of Opus's and reads like a system prompt, not a teaching document.

The most distinctive part of Codex 5.3's interpretation is its seven-step sequential workflow: restate the task, catalog assumptions, propose the minimal change, define verification criteria, implement surgically, run validation, and report outcomes with remaining risks. This is Karpathy's tweet translated into a repeatable process loop. Where Opus says "understand why simplicity matters," Codex 5.3 says "follow these steps and simplicity will result."

Codex also uniquely included a debugging and recovery section, something Opus didn't touch. When progress stalls, the skill instructs: pause implementation, isolate the minimal failure case, test a single falsifiable hypothesis, and remove any speculative code that lacks evidence. Agents going down wrong paths was something Karpathy described. Codex created a solution for the problem.

The language throughout is imperative and terse: "Seek simplicity before novelty." "Never hide uncertainty." "Prefer subtraction over addition when subtraction preserves behavior." It reads like a senior engineer handing a new grad engineer a checklist on how to properly create pr's on the codebase. There's no room for interpretation by design.

The Self-Evaluation Paradox

This is where the experiment got more interesting and while intended to be the point to determine which skill was the best, also became a peak behind the curtains on the frontier models.

Codex 5.3 picked Opus, then Opus picked Codex 5.3

In the Codex 5.3 vs Opus 4.6 matchup, Codex 5.3 chose Opus as the winner:

"opus-4-6-karpathy-code-guidelines is the better match overall for both: what the tweet explicitly says, and what Karpathy likely values based on his background."

Source: Codex 5.3 Evaluation

But Opus 4.6, evaluating the same two skills, chose Codex 5.3:

"The Codex skill is the more faithful translation of Karpathy's tweet into actionable coding discipline. The Opus skill is more polished and comprehensive as a document, but that polish actually works against it, it over-explains, over-structures, and ironically falls into the very overcomplication trap Karpathy warns about."

Source: Opus 4.6 Evaluation

Neither model voted for itself.

The way they justified their choices is where things get interesting, because each model's evaluation reveals a distinct value system.

Codex evaluates like a rubric. It checks off items: does the skill address silent assumptions? Does it target overcomplication? Does it include tests-first? It picks Opus because Opus covers more ground. It's a very systematic in its approach.

Opus evaluates like a critic. It cares about whether the skill practices what it preaches, and its most cutting line is aimed at itself:

"The Opus skill as a whole is ~250 lines across two files, while the Codex skill is ~120 lines in one file. The Opus skill is, in a sense, the overcomplicated version of what the Codex skill does simply. Karpathy would notice this."

Source: Opus 4.6 Evaluation

Seemingly, each model valued what it found difficult to produce itself. Opus, recognized Codex's conciseness as closer to the target. Codex, recognized Opus's depth as more faithful to Karpathy's background. They respected the quality in areas they couldn't perform as well at. Dare I say, they complimented each other very well.

To my surprise, the Opus evaluation also reveals a fascinating blind spot. It correctly identifies that its own skill is overcomplicated. Yet it never calls itself out on hallucinating "taste" as Karpathy's core principle. It actually doubles down:

"He is an engineer who builds things himself, watches them closely, and cares deeply about taste in software."

It can see structural flaws in its own output (too long, too complex, too polished) but did not detect when it fabricated a "core principle" in its skill that the tweet never mentioned. This surprised me since I thought Opus was pretty good at going deeper and analyzing than taking things at face value. I thought it would've caught this in the evaluation.

The Irony at the Heart of the Experiment

Opus' skill portrayed Karpathy as an educator, and he actually is one through his YouTube lecture series and open-source projects. It built what a professor would hand to students, with principles plus the deeper knowledge to handle ambiguity. Codex captured Karpathy as a structured engineer: the engineer who provides the detailed SOP to enable quicker completion of the work, but no knowledge transfer.

Opus agrees. Its self-evaluation reads:

"The Opus 4-6 skill is the better document but the worse skill. Its strengths (education, examples, nuance) are more valuable in a blog post or team wiki than in an LLM system prompt. Its weaknesses (length, missing operational sections) matter more in the context where these skills actually get used."

Source: Opus 4.6 Evaluation

This is the irony Karpathy himself described. LLMs build the 1000-line version when 100 would do. Opus built the comprehensive, well-structured, thoroughly explained version of a skill whose entire purpose is to prevent exactly that behavior. And then it correctly identified what it had done.

The Bigger Picture

Karpathy wrote in his tweet that "generation (writing code) and discrimination (reading code) are different capabilities in the brain." This experiment shows that split exists in models too. Opus generated the bloated version, then in evaluation mode correctly judged the simpler one as better for an agent skill. Its generation mode and its evaluation mode point in different directions, which is not necessarily bad. We often see great players turn out to be awful coaches, and bench players turn out to be the best coaches.

The model that best understands what Karpathy wants is not the model that best produces it. This is a valuable lesson given many people often stick with a single model for all levels of their work; planning, coding, testing, judging, reviewing, learning, ...

I think Codex is better for structured output while Opus is the better overall thinker. Personally, I've been using Opus as the thought partner to learn, ideate, plan, and architect, while Codex is really good at taking a detailed vision and systematically coding it out. But knowing which model to use when is only half of it.

No matter how capable these models get, there will always be some gap between what you want and what the model produces. Karpathy's tweet is a catalogue of those gaps: silent assumptions, overcomplication, scope creep. The models in this experiment understood those failure modes well enough to critique them, and still couldn't fully avoid them in their own output. That gap isn't going away with the next generation of models. It'll just show up in different ways.

Experimenting with different models across your workflow is how you start closing that gap. Not just picking a favorite and using it for everything, but actively learning how each model thinks, where it excels, and where it drifts. The more you experiment, the better you get at steering, and steering is ultimately what turns a capable model into a reliable one.

The full experiment, including all skills, evaluations, and the original tweet, is available in the GitHub repo.