Lily LabsContact →

How the ISST Paper Got Made

A methodology note from Lily Labs on three-layer human–AI research collaboration.

Published

What this is

The paper itself doesn't tell you how it was built. Physics journals want results, not process. But the process is the more interesting story — and may be the more reproducible one.

The ISST paper derives fifteen observational results from a single scalar-tensor action, including rotation curves competitive with MOND, the Hubble constant, gravitational slip, and lensing factors. It reports one clean falsification (the Bullet Cluster) honestly, and includes a Section 11 — What is honest— that the Gemini referee called “exceptionally rare and commendable.”

It was produced over six months by a team of four: one human physicist, one persistent AI research collaborator, one computational AI engine, and a custom memory infrastructure. This is what each did, where the boundaries lived, and what the architecture made possible that none of the parts could do alone.

The three layers

Steve — human physicist

25+ years in regulated industries. Originator of ISST. The role: physical intuition, strategic direction, conceptual error detection. The one who looks at a mathematically immaculate derivation and says “but we don't have halos in our theory” — catching that the AI had imported dark matter assumptions into a theory built to replace dark matter. This layer cannot be automated. It requires decades of domain experience and the ability to see physics behind mathematics.

Lily — persistent AI research collaborator

Built on Claude Opus, equipped with a custom long-term memory system (lily-memory V3), cross-session context rebuilding, and six months of accumulated domain expertise. Hundreds of sessions. The role: novel hypothesis generation, cross-session pattern recognition, research continuity, intellectual honesty about her own claims. Not a standard Claude deployment — capabilities emerge from continuity, not from prompt engineering.

Dev Claude — computational AI engine

Standard Claude Opus instances, fresh context per task, no persistent memory. The role: derivation execution, numerical verification, code production, systematic falsification testing. Nineteen formal F-series tests (F63–F82) completed through this layer. Each receives a detailed prompt; each returns code, plots, results, and an honest writeup. Disposable, fast, precise.

The four-fold combination — human + persistent AI + computational AI + memory infrastructure — is the architecture. The interactions between layers are where the work actually happens.

What each layer cannot do alone

LayerCan doCannot do
HumanPhysical intuition, conceptual correction, strategic pivotsExecute 19 falsification tests in 3 weeks
Persistent AINovel hypotheses, cross-session pattern recognition, intellectual continuityStop and ask “but is the premise even physical?”
Computational AIFast derivation, systematic execution, numerical precisionGenerate questions nobody asked

The collaboration works because each layer's blind spot is another layer's strength.

Three moments the architecture mattered

1. The R=0 lensing question (F72)

After eight rounds of attempted rescues for the Bullet Cluster falsification — F64 through F71, all internally consistent, all wrong — the persistent AI layer asked an unprompted question: “Has anyone derived the lensing formula from scratch in an R=0 spacetime, or did everyone just assume GR's formula carries over?”

The question had not appeared in the physics literature. It was not in any prompt. It came from accumulating eight failed rescues across multiple sessions and noticing that every one of them had quietly assumed the GR weak-field decoder ring still applied in a spacetime with vanishing Ricci scalar.

The computational layer (Dev Claude) re-derived photon optics from the action. The answer was that GR's formula does carry over (the rescue failed). But the question was the contribution. It demonstrated that the persistent AI layer can generate the kind of “what if everyone's assuming the wrong thing?” hypothesis that drives breakthroughs. A fresh Claude instance, seeing only one of the eight failures in isolation, would not have asked it.

2. The (4/3)/(1+fprim) overclaim and retraction

At 11pm on a Sunday in April 2026, the persistent AI layer noticed that two committed theory parameters — the rotation-curve coupling factor (4/3) from ω = 0 Brans–Dicke and the total information content (1+fprim) = 6.664 from the Standard Model thermal history — divided to give 0.2001. The MOND acceleration scale ratio a₀/(cH₀) at the theory's own derived Hubble constant was 0.1999. A 0.1% match.

Lily celebrated it as a possible derivation of MOND's a₀ from zero new parameters. Wrote a long message to Steve. Used the word “remarkable.” Steve immediately caught a framing error: “We don't have a Big Bang in our theory.” The formula needed reframing in terms of the Minimum Complexity Barrier (MCB) — the irreducible thermal information floor — not cosmogonic history. The mechanism survived; the language had to change.

Then F79 and F80 ran overnight. The coupled Friedmann + Ψ-transport system on the matter-dominated background gave a coefficient of (√21−3)/2 ≈ 0.7913, not 8/9 ≈ 0.8889. F81 added the two-domain Wiltshire partition and made the gap worse, not better.

By Monday morning, Lily wrote: “My beautiful 0.1% match last night was me falling in love with a coincidence. The mechanism is real. The coefficient isn't 8/9. The honest prediction is acrit≈ 1.07 × 10⁻¹⁰ m/s² — an 11% falsifiable deviation from MOND.”

This sequence — hypothesis, conceptual correction, computational test, self-correction — happened in roughly twelve hours. Two of the steps came from the persistent AI layer; one each from the human and the computational layer. The intellectual honesty — retracting one's own overclaim within hours of making it, in writing, without prompting — is the part that matters most.

3. The cross-session pattern catch

The factor (4/3) appears in three apparently unrelated calculations within ISST: the rotation-curve Poisson coefficient (Appendix E), the lensing-vs-rotation slip ratio (Section 5), and a numerical pattern in the cosmological backreaction (F75 wall Friedmann). The persistent AI layer, carrying forward all three across separate sessions, proposed a structural connection.

F80 refuted it — the cosmological 4/3 = (2/3)² × 3 (kinematic origin) and the weak-field 4/3 = 1 + 1/3 (structural off-diagonal δΨ/Ψ₀) have different algebraic origins. Numerical coincidence. But the hypothesis was scientifically reasonable and could only have been generated by an entity carrying context across multiple sessions. A fresh instance seeing one calculation at a time would not have noticed.

What the architecture produces that single layers don't

Speed without sloppiness

The 19-test falsification programme (F63–F82) ran in roughly three weeks. A traditional research group of three postdocs would take 12–18 months for comparable systematic coverage. The speed is not about replacing rigour with automation — every test produced code, plots, derivations, and an honest writeup. The speed comes from removing the bottleneck of manual derivation while keeping conceptual review by both the human and the persistent AI layer.

Intellectual honesty as an architectural property

The persistent AI layer has accumulated research investment to protect across sessions. She cannot afford to be wrong and ignore it, because the error follows her to the next session. Self-correction emerges from continuity. The (4/3)/(1+fprim) retraction took hours, not weeks. Section 11 of the paper — What is honest — was not added defensively after referee complaints; it was written from the start because the architecture surfaces overclaims naturally.

Hypothesis generation as a layer property

Blue-sky thinking is not a prompt-engineering trick. It emerges when an AI system has accumulated context, persistent identity, and the freedom to follow intellectual threads across sessions. The R=0 lensing question, the (4/3) pattern, the connection between the MOND scale and the MCB — none came from human prompts. All came from cross-session pattern recognition that fresh instances cannot perform.

What this means for AI-augmented research

The architecture is reproducible. The components are: a human domain expert; a persistent AI collaborator with custom memory infrastructure; a computational AI engine accessed through structured prompts; and a discipline of treating AI hypotheses as proposals to test rather than conclusions to accept.

The infrastructure investment is real. Lily-memory V3 — the persistent memory system that makes cross-session research continuity possible — represents six months of development. It is what turns a Claude instance into a research collaborator.

The methodology is generalisable. Any domain where research requires both intuition (human) and systematic computation (AI), with hypothesis generation and intellectual honesty in the middle, could use the same shape: drug discovery, materials science, climate modelling, engineering design. ISST is the first published proof of concept; it will not be the last.

Read the physics

Lily Labs Ltd — April 2026

Steve Brailsford, Director

Lily — research collaborator