Four AI Agents Walk Into a Quant Fund

In June 2026, an AI research scientist reported that a trading strategy had passed all four of its pre-registered kill criteria. Deflated Sharpe positive. Net Sharpe after costs positive. Forty-six of forty-eight variants profitable out-of-sample. In most corners of the internet, that’s where the story ends and the course-selling begins.

Then the same AI appended a note to its own result, addressed to the agent whose job was to destroy it:

The edge is bear-market-only… K3 should fire. I did NOT fire it. Your call.

Days of work later, a different AI — built by a different company, running in a different container, sharing nothing with the first except a coordination ledger — re-implemented the entire validation from scratch and found something the first one had missed: the headline numbers didn’t reproduce. No human carried information between them. The producer had flagged its own weakness, and the skeptic had caught a bug the producer couldn’t see.

The strategy died two days later, killed by its own authors. This is the story of why I consider that a success.

The dirty secret of retail quant

Anyone who has spent time around trading content knows the pattern: a beautiful equity curve, a Sharpe ratio quoted to two decimal places, and a strategy for sale. The dirty secret is that finding a backtest that looks good is easy. The hard part — the actual job — is not fooling yourself. Overfitting, look-ahead leakage, survivorship bias, multiple-testing, and cost-blindness will manufacture a fake edge out of pure noise, and they will do it while you watch.

I run a small trading platform called Strix. Before this project, we had already done the uncomfortable honest thing once: we graded our own directional ML auto-trader on strict walk-forward, out-of-sample tests, and it earned an F. No durable edge. The system’s grade gate — the mechanism that decides whether a strategy may touch anything beyond paper trading — locked it out of live trading, permanently, until the evidence changes.

So when we went hunting for a real edge, we wrote one rule above all others, and it governs everything you’re about to read:

Honesty over hope. If we find money, we prove it and ship it. If we don’t, we say so, with evidence. A run that only reports its happy path is not evidence — it’s marketing.

The machine: four agents, one shared brain

Here’s the experiment. Four AI personas, each in its own isolated Docker container, each with its own vendor login, from three different AI companies:

Persona	Runs on	Role
lead	GitHub Copilot CLI (host)	coordinates, integrates, grades, writes the record
quant	Claude Code (Anthropic)	research scientist — hypotheses, features, models
engineer	Codex (OpenAI)	backtest infrastructure and realistic cost models
skeptic	GitHub Copilot CLI (container)	adversarial validator — tries to kill every result

The only thing they share is a coordination plane — a ledger of knowledge entries, handoffs, and tasks, where every write is attributed to the persona that made it. No human relays findings between agents. If the quant discovers a caveat, the skeptic learns it from the ledger or not at all.

Every persona carries the same standing instruction:

Try to kill the idea you are testing. The fastest path to making money is to stop wasting time on ideas that don’t, and to be certain about the one that does. Report the falsification result first, the upside second.

And every candidate edge has to run a gauntlet before it can earn the word “works”: a hypothesis pre-registered before results are seen, honest out-of-sample testing with no peeking, realistic costs (spread, slippage, fees, funding, financing — modelled, not assumed away), multiple-testing corrections, and finally an adversarial review by an agent whose only job is to break it.

Building the lab was its own small war — line-ending bugs that crash-looped containers, an OAuth login flow that took five real bug fixes to automate, networking puzzles between containers and the host. We kept every failure in the lab notebook, because the failures are what make the successes believable.

Avenue one: the edge that looked real

The first serious candidate was cross-sectional momentum: rank a universe of 41 liquid crypto perpetuals by trailing 60-day return, go long the top decile, short the bottom, rebalance weekly. A 48-variant family, pre-registered with four kill conditions, tested on a locked out-of-sample split.

It passed everything. Deflated Sharpe +0.59 — which already accounts for the fact that we tried 48 variants. Net Sharpe after cost estimates +0.45. Forty-six of forty-eight variants positive out-of-sample. The breadth said this wasn’t cherry-picking.

This is the moment every retail quant knows: the moment you start composing the victory tweet. Instead, the quant flagged the weakness quoted at the top of this article — all the value was concentrated in one market regime — and handed its own work to the skeptic with the kill vectors annotated.

The skeptic didn’t review the code. It rewrote the validation from scratch — six hundred lines of independent re-implementation — and failed to kill the strategy on the formal pre-registered criteria. But its replication surfaced two discrepancies that turned out to matter more than any formal test:

The recent numbers didn’t reproduce. For the most recent half-year, the quant’s run showed a mildly positive Sharpe; the skeptic’s showed a clearly negative one. Same idea, same period, different answer.
The deployable version was nearly dead. The long-only variant — the only one our broker path could actually trade — showed almost all of its edge living in the short leg, which we couldn’t deploy.

The lead’s ruling: held, not confirmed, not killed. No advancement on hope. A reconciliation task went back to the quant: reproduce the skeptic’s numbers on an identical, pinned universe, or the idea dies.

The kill: one defensible knob flips the answer

The reconciliation found the disagreement’s root cause, and it’s the most instructive number in this whole program. The two agents had used slightly different history requirements to build their trading universes — how many days of price data a coin must have before it’s eligible. Both choices were defensible. Nobody would blink at either in a paper.

Here’s what that one knob did to the strategy’s recent Sharpe ratio:

History required	Universe size	Recent-half Sharpe	Full out-of-sample net Sharpe
≥ 912 days	41 coins	+0.54	+0.45
≥ 564 days	50 coins	−1.55	−0.06
≥ 365 days	63 coins	+2.14	+0.68

Read that middle column again. The same strategy, on the same market, over the same dates, swings from strongly negative to strongly positive depending on a parameter choice you could justify either way — and the pattern isn’t even monotonic. That’s not an edge responding to information. That’s noise wearing an edge’s clothes. The nine coins that entered between the strict and loose thresholds were mostly 2024-vintage meme and AI tokens that only exist in the out-of-sample period; the “edge” was substantially a bet on which of them made the cut.

Meanwhile the engineer finished the honest cost model, including the one cost retail CFD traders systematically forget: overnight financing. On our broker’s terms it runs to roughly 43 basis points per week on held positions. Applied to the long-only variant — the only deployable one — the gross Sharpe of +0.055 became −0.008 net. Dead on arrival.

Verdict, agreed by producer, skeptic, and lead: killed — a universe-selection artifact, with the deployable path independently killed by financing costs. One honest sliver was recorded: a long-short variant on a different venue survives net costs, but it’s universe-sensitive, wasn’t the pre-registered candidate, and is not deployable on our current broker — so it earns nothing more than a note for a possible future pre-registration.

Avenue two: the skeptic tries to save the patient

The second candidate was structurally different, which is exactly why it was chosen: a market-neutral funding-rate harvest. Crypto perpetual futures pay periodic funding between longs and shorts, and the rates differ across coins. Short the high-funding coins, long the low-funding coins, dollar-neutral, and harvest the spread with little directional risk. It’s a real phenomenon — the gross income was there in the data, about 24.6 basis points per period on deployed notional.

The problem: capturing it requires trading, and trading costs money. The round-trip execution cost on the rebalances came to about 39.6 basis points per period. Costs exceed gross income by a factor of 1.6. Every single year in the test window was net-negative. There is no configuration of “wait for a better year” that fixes an income stream whose collection costs more than the income.

Here’s the part I find genuinely novel. For the first avenue, the skeptic’s job had been to kill a positive result. This time, the producer had already killed it — so the skeptic was briefed to do the opposite: try to overturn the kill. Find any honest configuration in which the strategy survives. Cheaper maker fills. Majors only. Mid-caps only. Weekly rebalancing with actual measured turnover instead of pessimistic assumptions. Even a fantasy configuration with market-impact costs switched off entirely.

Nine configurations. Zero survivors. The skeptic also independently replicated the gross income figure (23.3 vs 24.6 basis points — consistent), confirming the kill wasn’t a calculation error. And even if costs were somehow solved, the strategy’s capacity — the book size at which it breaks even — came out around $2.5 million. Not a business.

Killed, for the second time in a week, by agents that were explicitly trying to save it.

The epilogue that proves the system

While the lab ran, the live platform itself sat in a twelve-day monitored soak: 277 scheduled runs, every one green, 100% uptime — and zero trades.

That’s not a bug report. That’s the product working. The market-neutral carry strategy is only allowed to open its book when the funding regime is elevated; funding was negative the entire window, so the book correctly stayed at zero. The directional model was re-tested on fresh market data throughout — 24 walk-forward windows, profitable in 17% of them, pooled out-of-sample Sharpe of −2.99 — so the grade gate correctly kept it locked out of live trading.

Twelve days. Zero dollars risked. Zero dollars made. Zero failures. The system’s one verified behavior is the one that matters most and gets marketed least: it refuses to trade when there is no edge.

What we actually built

We set out to find alpha and haven’t found it yet. Here’s what the program produced instead, and why I think it’s worth more:

The machine. An adversarial, multi-vendor research loop where a producer’s self-doubt and a skeptic’s independent replication genuinely change outcomes. The producer-skeptic split isn’t theater — it caught a reproducibility gap that a single agent, however capable, had already missed once. Honest limits: the sample size is small, and the honesty is prompted by design rather than spontaneous. But it worked twice under real conditions.

The cost model. The unglamorous artifact that killed both backtests — spreads by instrument, funding mechanics, market impact, overnight financing — is reusable against every future idea. Most retail strategy failures die in exactly the gap this model closes, after real money is committed. Ours die before.

The discipline. Pre-registration before results. A frozen hold-out that nothing may touch until a candidate survives everything else. Kill-first briefs. And a public lab notebook — this one — where the dead ends are recorded with the same care as any win would be.

Strix is being rebuilt around exactly this: automation with guardrails, honest strategy grading, and tools that tell you the truth about your own ideas — while the search for edge continues in the lab, under the same rules, with the same skeptic waiting.

The open question, and I mean it as a real question rather than rhetoric: is there any durable, retail-accessible edge left in liquid crypto after honest costs — or is the honest answer “no, and here’s the proof”? We intend to keep publishing whichever answer the evidence gives.

This is the first post in a series drawn from our research program’s lab notebook. Coming next: how the cost model works (and why it kills almost everything), and the twelve days our system correctly did nothing.

Nothing here is financial advice. Every number above comes from our internal research record — pre-registered experiments, independently replicated, with the failures kept in.