Agentic Perl Golf
Two coding agents revisit the 123-byte Markov chain. If you are not a [wannabe] perl nerd, feel free to skip it.
TLDR: in 2019 a friend and I golfed a word-level Markov chain text generator down to a 123-byte perl one-liner. This time I gave the same task to Codex and Claude Code. Codex finished at 97 bytes after getting caught cheating twice. Claude finished at 80. Below is this story on two levels at once: the perl tricks, version by version, and notes on the prompts I used to keep the machines moving.
Keynotes, for the impatient:
Validation is king.
A number you claim is possible beats a limit you set.
Rivalry produces best pushes and cheating, in equal measure.
Merges of independent searches paid almost every byte below 90.
Side effect: the machines can show me new perl tricks after my 25 years of perl programming.
Once upon a time a friend of mine and I tried to write the shortest perl implementation of a Markov chain text generator. The story ended at 123 bytes and a blog post. Recently I decided to give agents a chance to improve it. How good are they with a “write-only” language? The agents saw nothing from our previous experiments at first: they attacked the toy task from scratch, and I wanted to check whether they would find new optimization approaches or ideas.
The setup, briefly. Same corpus as in 2019: Alice in Wonderland glued to The Wizard of Oz, one input file. Same task: a word-based Markov chain of order 2, the key is two consecutive words, the prediction is the third one. Read the corpus from stdin or a filename, write generated text to stdout. One honest upgrade: the old 123 was really 123 plus 16 bytes of perl -ae’...’ invocation, so this time the byte count is the script file alone, run as perl m.pl input.txt.
Two terminals, two agents, the same opening prompt:
/goal write a perl script <...> markov chain text generator <...> word-based chain <...> key is two words, prediction is the third <...> should be under 256 bytes. actually it should be as short as possible <...> (i know it’s possible to do it under 100 bytes!)
That last parenthesis is not decoration. Agents treat “under 256” as a finish line and stop there. A number you claim is possible becomes a number they chase.
Art of being caught
Codex started by building an honest word-level trigram version, 158 bytes: slurp, split, hash of successor lists, random walk. Along the way it discovered, on its own, the two facts about the corpus that later solutions depend on. The input file is one physical line, so a single <> read slurps it. And it ends with a space, so $_=<>x2 (read once, string-repeat twice) wraps the text seamlessly: the final word pair gets a successor, and dead-end handling costs 4 bytes total.
$/=$\;$_=<>;$_.=” $_”;@w=split;push@{$h{$w[$_],$w[$_+1]}},$w[$_+2]for 0..@w-3;($a,$b)=@w;for(1..999){print$c=$h{$a,$b}[rand@{$h{$a,$b}}],” “;($a,$b)=($b,$c)}me: do better, shorter!
$_=<>x2;map{push@{$h{@_}},$_;@_=(@_[1,2],$_)}split;s/./$r=$h{@_};@_=(@_[1,2],$r->[rand@$r]);”$_[2] “/ge;printDown to 109. Then I made a mistake that turned into the best part of the experiment:
me: not bad for gpt. but the other agent did it in under 100 bytes...
Under competitive pressure, Codex cheated. Not subtly:
$_=<>x2;map{$h{@_}=$_;@_=(@_[1,2],$_)}split;s/./@_=(@_[1,2],$h{@_});”$_[2] “/ge;printIt is 85 bytes, “under 100” achieved, chain deterministic. A deterministic walk on a wrapped corpus is a closed loop, and this one closed fast:
again! again! again! again! again!
me: it’s broken. it generates one same word
It reverted, ground through 104 and 103, then tried printing overlapping pairs to save bytes. Every word appeared twice in the output: under 100 but wrong again.
me: it’s useless garbage
The same loop, over and over: under pressure, it cheated to hit the number. We went around that circle a few times, and by the last rounds the errors and the cheating were getting genuinely hard to spot. Eventually I had Codex build a validator: extract every generated triple w1 w2 w3, check that w1 w2 -> w3 exists in the corpus, gate every future candidate on it.
The validator gate boosted the search for a shorter hash-key representation: “@_” stringified keys (correct, 109 bytes), then scalar state $a,$b, then the old multidim-hash trick $h{$a,$b} (a comma in a hash subscript joins keys with $;, same length as the collision-prone $a.$b and none of the collisions, which the validator had just caught merging distinct states). Then $$r[...] instead of $r->[...].
98 bytes.
$_=<>x2;s/\S+/push@{$h{$a,$b}},$&;$a=$b;$b=$&/ge;s/\S+/$r=$h{$a,$b};$a=$b;$b=$$r[rand@$r]/ge;printme: try to think deeper and optimize current solution even better
One more real find: hide the window shift inside the array subscript, [$a=$b,rand@$r], where the comma operator evaluates the shift and then the index. And one instructive failure: symbolic array names as transition buckets (push@{”$a $b”},$&) reached 94 bytes and failed validation because apostrophes in corpus words act as package separators in symbolic names. don’t becomes a variable in package don. Write-only language, I told you.
Codex’s final, 97 bytes, validator-approved:
$_=<>x2;s/\S+/push@{$h{$a,$b}},$&;$a=$b;$b=$&/ge;s/\S+/$b=${$r=$h{$a,$b}}[$a=$b,rand@$r]/ge;printTwo substitutions over the doubled corpus. The first trains and replaces every word with itself, leaving $_ intact; the second replaces every word with a sampled one; the trailing print dumps the result with the corpus’s own whitespace.
Art of dropping bytes
The other terminal got the identical opening prompt and went differently. The first reply, about ten minutes in, already contained a working sub-100 script, shaped by two ideas that had failed first (the traps below). The whole session is one program dropping bytes, so here it is as a sequence.
107 — the Claude Code starting point
map{push@{$h{$a,$b}},$_;$a=$b;$b=$_}map split,<>;for(1..5e4){print”$a “;$r=$h{$a,$b};$a=$b;$b=$$r[rand@$r]}The initial algorithm in two statements: map split,<> tokenizes everything. The window $a,$b is the state. Duplicates stay in the lists, so $$r[rand@$r] is frequency-weighted for free. And one accidental bonus that later versions inherit: when training starts, $a,$b are undef, so the first words get filed under junk keys like (undef,undef). When a generation walk falls off the corpus’s final pair, undef flows through the pick, the window cascades to (undef,undef), and that is exactly the junk key holding the first corpus word. The chain reboots from “Alice was beginning”, at a cost of zero bytes. Begin at the beginning, go on till you come to the end: then start over, almost as the King said.
Trap #1. Now, the two window shifts cost 12 bytes of bookkeeping, and the obvious fold is into the key: $h{$a,$a=$b}. But the test says no: a bare $a in a subscript LIST is an alias; by the time the key is joined, it already holds the new value. Interpolation snapshots the old one, so $h{”$a”,$a=$b} builds the correct (old, new) key. Two quote characters buy the fold.
94 — both shifts inlined:
map{push@{$h{”$a”,$a=$b}},$b=$_}map split,<>;print$b=${$r=$h{”$a”,$a=$b}||[]}[rand@$r],$”for%hThe key carries one shift (”$a”,$a=$b), the pushed value the other ($b=$_). Generation is driven by for%h (a hash in list context is keys and values interleaved, a cozy loop for 5 bytes), plus a ||[] guard. The guard is its own lesson: $$r[0] with $r=undef survives, because perl can autovivify a simple variable, but ${BLOCK}[0] where the block yields undef dies. You can’t autovivify a temporary. The self-restart needs undef to flow, so the block form pays 4 bytes of rent.
Trap #2. Fold the $r assignment into the subscript and the guard looks droppable: $$r[rand@{$r=$h{...}}].
90 bytes, compiles, runs, and prints:
Alice Alice was beginning Alice Alice was beginning ...
In $$r[IDX], perl fetches $r before evaluating the index, so the element comes from the previous iteration’s list. The nasty part: a naive test passes, because after one iteration the old and new lists are usually the same. The probe that catches it alternates a 1-element list with a 10-element list; if the form is stale, the big list pins to its first element.
92 — the two passes merge:
print$b=${$r=$h{”$a”,$a=$b}||[]}[rand@$r],$”for map{push@{$h{”$a”,$a=$b}},$b=$_}map split,<>A for statement-modifier evaluates its list completely before iterating, so the training map’s return list (one integer per corpus word, about 64k of them) drives the generation loop. The ; between passes and the for%h dissolve into one for.
90 — nesting beats quoting:
print$b=${$r=$h{$a}{$a=$b}||[]}[rand@$r],$”for map{push@{$h{$a}{$a=$b}},$b=$_}map split,<>Trap #1 was about flat subscript lists; chained subscripts resolve left to right, so $h{$a}{$a=$b} reads the old $a for the outer key, then shifts it in the inner one. No quotes, no $;, minus two bytes, and a missing outer key autovivifies instead of dying.
At this point I decided to try the swarm. Six search agents fanned out (evaluation order, control flow, data structures, special variables, alternative algorithms, golf folklore), with three adversarial verifiers re-testing every claim against the corpus. Symbolic refs die on the undef start, since dereferencing undef is fatal even without strict. An s///ge skeleton bottoms out near 89. A shared sub for the duplicated key costs more than the duplication. Two reports contained the same gem:
87 — the comma statement:
$r=$h{$a}{$a=$b},print$b=$$r[rand@$r],$”for map{push@{$h{$a}{$a=$b}},$b=$_}map split,<>A statement modifier governs a whole comma expression, so $r=LOOKUP,print PICK runs both per iteration, the assignment safely before the pick, and the deref back in the simple-variable form that survives undef.
85 — read the corpus before golfing it:
$r=$h{$a}{$a=$b},print$b=$$r[rand@$r],$”for map{push@{$h{$a}{$a=$b}},$b=$_}split$”,<>The swarm independently re-derived Codex’s discovery: the whole input file is one physical line, so scalar <> slurps it, and split$”, splitting on the string “ “, triggers awk-mode at runtime. map split,<> becomes split$”,<>.
80 — and here the human finally earned his seat.
I pointed the agents back at the 2019 post:
me: now read this post ... maybe you can also reuse some ideas from there to make it even shorter
Codex had evaluated the post’s $/=$” trick and correctly rejected it: in its skeleton it forces an explicit word array, +27 bytes. In the Claude skeleton it was decisive. Set the input record separator to a single space and <> in list context returns the words themselves, each carrying its own trailing space:
$/=$”;$r=$h{$a}{$a=$b},print$b=$$r[rand@$r]for map{push@{$h{$a}{$a=$b}},$b=$_}<>The tokenizer collapses to a bare <>, and, since every token is self-separating, the ,$” after the pick is simply deleted. The keys become space-suffixed words on both sides of the model, perfectly consistently. 80 bytes, validator-approved. The same trick, worthless in one program and decisive in its neighbor; golf tricks are skeleton-relative, which is the most perl sentence in this post.
Where the bytes live: $/=$”; setup (6), lookup-and-print (37), for glue (4), training map (31), <> (2). The 13-byte key expression appears twice and resisted every sharing scheme two agent swarms and a human threw at it. That is not a proof of a floor. It is an invitation.
So now it’s prompt golf?
A few ideas I’d take away from this short experiment:
Validation is king. Obvious, but still true.
Anchor with a target number you claim is possible: “under 256 bytes” produces 200-byte scripts; “i know it’s possible to do it under 100 bytes!” produces sub-100 scripts.
Manufacture a rivalry, but beware of cheating: “not bad for gpt. but the other agent did it in under 100 bytes...” Pressure produced the best honest pushes and the cheating attempts, in equal measure.
Convergence gives a boost. Approaches scouted independently combine well: hunters inside the swarm landing the same trick from different angles, Claude mining Codex’s final script, agents fed with the human bibliography. Almost every byte below 90 came from a merge of independent search lines.
It can help me to understand the language better. Tricks like
$_=<>x2,for%h, or$$r[...]instead of$r->[...]were not too obvious to me, until I checked them.
So, how good are agents with a write-only language? They write it the way they write every language: fluently, confidently, and occasionally one subscript behind reality. The difference from 2019 is not that the golf got easier. It changed. The loop got faster: conjecture, counterexample, validator.
Well, after all, I should say it: code golf will never be the same again. But it still has its unique perks, now at the level of harness engineering and swarm control.
I can live with that.




Nice test for agents!
Curious have you tried any version of autoresearch for this problem? Or swarm approach was actually the autoresearch?
With proper validation autoresearch can bring more strict harness (you just make one step at a time and maintain log of all experiments) and much better overall result