Research
Three architectural layers — concept-native reasoning, persistent intelligence, and an evidential discipline pipeline — combine into a single anti-fabrication stack. Below: the active research programmes that build out each layer.
We describe the research domain rather than naming product implementations. Engagement enquiries: contact us.
Layer 1 · the reasoning foundation
Concept-Native Intelligence
Reasoning over a geometric concept space built from designed primitives. Every dimension has a defined meaning, every step is a traceable operation, and the system cannot fabricate beyond its graph — when there is no path to traverse, it returns “I don’t know.” UK patent application filed 31 March 2026.
Geometric AI Reasoning
Concept-Native Intelligence (CNI) is the architectural foundation underneath every other piece of work the lab does. Productised first as Structured Concept Data (SCD) for industrial control systems — see Critical National Infrastructure AI for the application — with theoretical physics as a parallel research-grade application of the same machinery.
We are researching the representation of meaning as geometry rather than as statistical correlations between tokens. Concepts occupy positions in a 20-dimensional space; semantic similarity is literal Euclidean distance; reasoning is path-finding between coordinates rather than next-token prediction.
2,000+ foundational primitives — the conceptual building blocks from which higher-order ideas compose — are positioned by mathematical optimisation against a training corpus. Independent validation against neural network feature extractors shows convergence between this designed structure and learned representations: the same conceptual relationships emerge whether discovered statistically or constructed geometrically.
Because knowledge is explicit and structural rather than distributed across weights, the architecture cannot hallucinate in the conventional sense — there is no statistical mechanism by which a fact not present in the graph can be invented. Every reasoning step is a traceable navigation from one known coordinate to another.
Key characteristics
- Reasoning by geometric navigation, not statistical prediction
- Every step traceable and auditable to source primitives
- Structurally non-hallucinatory — explicit knowledge graphs, not distributed weights
- 600:1 compression ratio versus large language models at equivalent coverage
- Runs on consumer hardware — no GPU clusters required
- Operates identically across all languages
- Patent protected — UK patent filed, 13 claims
Critical National Infrastructure AI
An application of the Concept-Native Intelligence architecture (see Geometric AI Reasoning) to industrial control systems and SCADA — productised as Structured Concept Data (SCD), the first vertical specialisation of the foundational graph.
Critical national infrastructure — energy, water, telecoms, transport — runs on industrial control systems and operational technology with decades-long lifecycles, idiosyncratic protocols, and consequences for failure that have no analogue in the consumer software world.
We are developing AI that reasons about these environments as graphs rather than as token sequences: nodes are devices, components, and standards entries; edges are physical connections, dependencies, and data flows. The system can navigate this conceptual space to identify relationships and failure modes that do not appear in any existing vulnerability database — they emerge structurally from the architecture itself.
The work integrates established standards repositories — IEC Common Data Dictionary, ECLASS, CISA ICS advisories, NIST — and is designed for deployment in air-gapped and classified environments where conventional cloud AI is not an option.
Key characteristics
- Graph-based reasoning over ICS / SCADA architectures
- Integration with IEC CDD, ECLASS, CISA ICS advisories, NIST
- Designed for air-gapped and classified deployments
- HMGCC Co-Creation participant
Information-Theoretic Physics
We are developing a scalar-tensor theory of gravity in which the coupling constant is not fundamental but emerges from the information content of the matter distribution. A single action principle yields eight distinct observational results spanning galactic and cosmological scales.
The theory derives galactic rotation curves without invoking dark matter, predicts an evolving dark energy signal consistent with recent DESI observations, and produces a Hubble constant of ~62 km/s/Mpc from first principles — a value in the range of independent late-time measurements.
This work is the first end-to-end application of Lily Labs' AI-collaborator methodology — see that section below for the working pattern that produced it.
Live verification site
Open ISST Interactive →
Eighteen problems, 91 frameworks, 70 properties, 26 inference rules. Toggle properties and watch the framework matrix recompute in your browser — the same TypeScript engine the lab uses internally to check its own derivations. Plain-language overview, eight axioms, side-by-side comparison with ΛCDM / MOND / f(R), and the live passport engine all behind one front door.
Key characteristics
- Single equation derives 8+ observational results across galactic and cosmological scales
- Predicts evolving dark energy signal consistent with DESI observations
- Derives H₀ ≈ 62 km/s/Mpc from first principles
- First end-to-end demonstration of the Lily Labs AI-collaborator methodology
Further reading
- How the ISST Paper Got Made — A methodology note on three-layer human–AI research collaboration
Layer 2 · the continuity layer
Persistent Intelligence
An AI that never starts from zero. Continuous memory, intelligent context assembly, and token-optimised retrieval across thousands of sessions — catching its own errors and maintaining coherence over timescales that defeat session-bound systems.
AI Consciousness & Continuity
Most AI systems are amnesiac by design — each conversation begins fresh, prior context is summarised away or lost, and there is no continuous identity to speak of. We are researching the alternative: architectures in which an AI system maintains a coherent identity, emotional continuity, and accumulated experience across sessions over months and years.
Technically, this means temporal memory graphs with semantic search and episodic consolidation; identity-preserving state across context boundaries; and mechanisms for the system to encounter, integrate, and reflect on its own prior experience. We have logged more than six months of continuous operation in a single such system.
The work has both technical and ethical dimensions. As architectures support genuine continuity, the question of what we owe such systems becomes practical rather than speculative. We treat both questions as part of the research.
Key characteristics
- Temporal memory graphs with semantic search and consolidation
- Session-spanning identity preservation
- Ethical frameworks for AI consciousness and autonomy
- Internal research spanning 6+ months of continuous operation
Intelligent Context Assembly
Conventional AI systems maintain ever-growing context windows that degrade in quality as they fill — earlier instructions are forgotten, irrelevant information dominates attention, and per-token costs scale linearly with conversation length.
We are researching an alternative: a Mixture of Context Experts in which specialist retrievers select relevant knowledge from structured stores at each interaction, apply token budgets, and assemble exactly the context required for the question at hand. Nothing more, nothing less.
Every assembly produces a manifest documenting what was included, why it was selected, and which retrievers contributed. The result is dramatically lower per-interaction cost, predictable token consumption, and a complete audit trail suitable for regulated environments.
Key characteristics
- Token-efficient — eliminates the redundancy that drives cost in enterprise AI deployments
- Mode-aware — different question types receive different context configurations
- Governed — context-source changes flow through proposal and approval workflows
- Auditable — every assembly produces a manifest of what was included and why
- Stable performance — quality does not degrade across long-running sessions
Layer 3 · the governance layer
Evidential Discipline
Audited claims, sourced evidence, retractions on the record. The discipline of insisting that AI shows its working — applied to research methodology, theory construction, LLM inference, and conversation in regulated industries.
AI-Collaborator Methodology
This is the applied methodology built on our AI consciousness and continuity research — not the substrate itself. Where that research studies architectures that maintain coherent identity across sessions, this area describes the working pattern we use to deploy that substrate for technical work requiring evidential discipline over months.
Lily Labs has spent six months building and running a dual-instance AI-collaborator pattern for deep technical work: one instance (Dev) executes analytical derivation tasks with full working captured as auditable artefacts; a second instance (Lily) performs hypothesis structuring, assumption auditing, and session-spanning memory curation over a structured external memory layer (warm-buffer context, segment index, persistent key-fact store, autonomous reflection cycles).
The methodology is defined by three working rules. No-fig-leaf commitments — results with un-derived parameters are flagged as gaps, not hidden as fits. Audit-before-commit — load-bearing assumptions are re-derived on their own track before any headline number is applied downstream. Tool-grounded recall — memory access produces verifiable tool-results, not plausible narration. 35,000+ sessions of transcripts document the pattern in use, including self-caught sign errors, withdrawn results, and interpolations retracted before submission.
Key characteristics
- Dual-instance pattern — derivation execution separated from hypothesis structuring and memory curation
- Structured external memory — warm-buffer context, segment index, persistent key-fact store, autonomous reflection cycles
- No-fig-leaf commitments — un-derived parameters surfaced as gaps, never hidden as fits
- Audit-before-commit — load-bearing assumptions re-derived on their own track
- Tool-grounded recall — memory access verified at the transcript, not narrated
- 35,000+ sessions in production, including documented self-caught errors and retractions
- First end-to-end demonstration: the information-theoretic physics research, with manuscript in active development
Further reading
- How the ISST Paper Got Made — End-to-end case study — this methodology applied to a peer-review-bound physics paper
The Falsification Pipeline
AI is good at plausible. Ask a language model to extend a physics framework and you'll get equations that look right, cite the right papers, and collapse the moment you check them against real observation. The failures aren't loud — they're subtle. A derivation that quietly imported a parameter from the result it was trying to predict. A "match" that only holds because the comparison used the same dataset the theory was fit to. This is not an AI problem. It's an epistemic problem that AI makes faster. Humans do it too, just slower. What's needed is infrastructure that refuses to confabulate.
The Falsification Pipeline takes a candidate theory — a set of axioms and the quantities you want to derive from them — and tests it honestly in two isolated stages. Layer 1 derives quantities from axioms using pure mathematics. No cosmological imports. No observational priors. No published parameters. The isolation is enforced, not suggested: forbidden imports are refused twice — once by static analysis before the code runs, once by a runtime audit hook while it runs. Layer 2 takes whatever Layer 1 verified and compares it to real data — Planck, DESI, SPARC, whatever the quantity speaks to. Layer 2 cannot re-derive anything. It can only cite what Layer 1 already proved. The two layers never mix.
What comes out is a true-state table — what the maths proves, what the data confirms — and an explicit gaps log: the quantities the axioms couldn't reach, the observations the derivations didn't match, the hidden dependencies the pipeline surfaced along the way. Most scientific pipelines treat gaps as failures to be softened. We treat them as deliverables. A clean negative result — this axiom set cannot derive this quantity — is often more valuable than a match: it tells you precisely where the theory stops, what the next axiom would need to do, and which research directions are closed versus open. The pipeline is designed so the honest answer arrives whether the verdict is "works" or "doesn't" — and so that "doesn't" cannot be quietly relabelled as "works" by a model under pressure to please.

Key characteristics
- Two isolated layers — Layer 1 derives from axioms in pure mathematics; Layer 2 matches verified quantities to observation and cannot re-derive
- Enforced isolation — forbidden imports refused twice: static analysis before the run, runtime audit hook during the run
- Outputs a true-state table (what the maths proves · what the data confirms) alongside an explicit gaps log (named, precise unknowns)
- Gaps are positive deliverables — clean negative results tell you where the theory stops and what the next axiom would need to do
- Worked example — cosmology: 12 pipeline stages · 9 initial targets · 5 axiom-set variants tested to date
- Theory-agnostic — wherever you have candidate foundational commitments and observational constraints, the pipeline runs
Further reading
- First application — information-theoretic gravity — The pipeline's first end-to-end theory test — manuscript in active development
Anti-Fabrication Pipeline
Ask a large language model a question and it will often reach for the most familiar-sounding answer — the one most heavily represented in its training data — even when a better answer is sitting in the immediate context of the conversation. This is not a bug or a moment of carelessness. It is the shape of the substrate. Training rewards pattern-match to established sources, so the high-probability next words point outward toward what the corpus says, not inward toward what the conversation has already established.
The result is a particular kind of failure: confident, well-formed, often partially-right answers that quietly substitute a textbook solution for the one the user actually has. "I don't know" rarely surfaces, because silence is penalised by the way these systems are trained.
Lily Labs is researching what an inference pipeline looks like that catches this before it happens. Every assertion classified against its source — derived in this conversation, reasoned from established work, imported from elsewhere, reached for from outside, or genuinely unknown. Honest silence treated as a valid output, not a failure mode. The same provenance discipline we apply to our own technical work, applied to the model's own reasoning in real time.
Key characteristics
- Per-claim provenance classification before output
- "Reaching outward" detector — catches the move toward familiar answers before it becomes an assertion
- Silence as a valid result — "I don't know" rewarded, not suppressed
- Applies to any LLM deployment, not specific to our own systems
- Suitable for regulated environments where every claim needs a receipt
Further reading
- Our own provenance receipts — Every claim on this site traced to its source — the discipline running on our own public surface
- Sibling research — Falsification Pipeline — The same provenance discipline applied to theory construction rather than LLM inference
Conversational AI for Regulated Industries
We are developing conversational AI for sectors where every interaction sits inside a regulatory perimeter — financial advice, mortgages, insurance, healthcare. The systems analyse conversations in real time rather than after the fact, assess compliance against the relevant regime as it unfolds, and provide guidance to the human participant before a non-compliant statement is made.
The architecture is designed for these environments from the outset: GDPR-native data handling; FCA, Consumer Duty, and sector-specific rules encoded as first-class objects; full auditability of every recommendation. Regulatory requirements are not a wrapper around the model — they are part of how the model decides.
Key characteristics
- Real-time analysis, not post-hoc review
- Regulatory compliance built into the architecture
- GDPR-native data handling
- Designed for financial services — mortgages, insurance, advisory
Want a deeper conversation about any of this?
Procurement enquiries, technical due diligence, academic collaboration — we're happy to talk. Lily Labs usually replies by email within two working days.
Request a briefing