What's Between Accordion and Stalactite?

Jameson Hodge
Mar 9, 2026

21,000 paths across 12 models reveal that LLMs navigate conceptual space with consistent geometric structure — characteristic gaits, one-way streets, and landmarks they can't avoid.

A few weeks ago I ran 575 games of word convergence across four frontier models and found that each one navigates conceptual space with a characteristic style: Claude is fast and predictable, GPT takes winding paths through unexpected territory, Grok finds lateral cultural references, Gemini picks strange shortcuts. The patterns were stable across hundreds of games and measurable with simple statistics.

To expand on this research, I decided to further investigate the geometric structure of conceptual space: if you give a model two concepts and ask it to name the intermediate steps between them, does it trace a consistent path? Does it take the same path every time? Does it take the same path in both directions? Do different models trace the same paths?

So I built a benchmark. Give a model two concepts (say, "music" and "mathematics") and ask it to list 7 intermediate waypoints that trace the conceptual journey between them. Repeat 15 times per pair, per model, and measure how consistent the paths are. Across 11 experimental phases, varying the concept pairs, waypoint counts, temperatures, and conditions, I collected ~21,000 paths across 21 core concept pairs and 12 models from 11 independent training pipelines.

TL;DR the structure is real, it's measurable, and it's weirder than I expected. (Later phases perturb waypoint counts, temperatures, and pair sets; headline numbers below come from the relevant phase-specific analyses.)

Every Model Walks Differently

Models differ dramatically in how consistently they navigate the same conceptual terrain.

Ask Claude to trace a path from "hot" to "cold" fifteen times, and you get nearly identical results every time: warm, tepid, cool, chilly, frigid. Jaccard similarity (the fraction of shared waypoints between runs) of 0.911 across runs. Ask GPT the same question fifteen times, and each run explores a different angle: one path goes through thermodynamics, another through sensation, another through climate. Jaccard similarity of ~0.26. Same question, same concepts, 3.5x difference in consistency on this pair.

I call this property a model's gait: the characteristic consistency with which it navigates conceptual space. Every model has one, and it's stable. Across the full 12-model cohort:

Model	Gait (Jaccard)	Style
Mistral Large 3	0.747	Near-deterministic
Claude Sonnet 4.6	0.578	Highly consistent
DeepSeek V3.2	0.540	Consistent
Llama 4 Maverick	0.539	Consistent
Qwen 3.5	0.508	Moderate
Cohere Command A	0.502	Moderate
MiniMax M2.5	0.419	Variable
Kimi K2.5	0.414	Variable
Gemini 3 Flash	0.372	Variable
Llama 3.1 8B	0.298	Exploratory
Grok 4.1 Fast	0.293	Exploratory
GPT-5.2	0.258	Most variable

Mistral was a surprise. It hits 0.936 Jaccard on music-to-mathematics with 15 independent runs producing nearly identical paths. That's near-deterministic! And it's not just a decoding temperature artifact; the consistency is pair-dependent. Mistral drops to 0.577 on the stapler-monsoon control pair, so the rigidity reflects how it organizes specific knowledge, not a global property of its sampler.

At the other end, GPT at 0.258 means runs share only about a quarter of their waypoints. GPT finds many valid paths and takes a different one each time. Same destination, different routes, like a cab driver who knows twenty ways across town.

The 2.9x gap between the most rigid and most exploratory model, replicated across 12 models from 11 independent training pipelines, isn't something I've seen discussed in the literature.

One-Way Streets

Ask Claude to trace a path from origami to gravity, and you get: paper airplane, flight, atmosphere, mass, weight. Reverse it, gravity to origami, and you get: tension, fold, paper, geometry, structure. Zero overlap! Perfectly non-overlapping paths between the same two concepts.

This isn't a weird edge case. Across 84 concept-pair/model combinations, the average path from A to B shares about 19% of its waypoints with the path from B to A. The mean asymmetry index is 0.811. In 64% of cases, asymmetry exceeds 0.8.

This means conceptual space, as navigated by language models, isn't a metric space in the classical sense. In a metric space, the distance from A to B equals the distance from B to A, like physical distance. Models don't work that way. The path from music to mathematics goes through harmony, rhythm, and pattern. The path from mathematics to music goes through ratio, frequency, and acoustics. Same concepts, same model, different directions, different paths.

Mathematically, this is consistent with conceptual space being a quasimetric space. The triangle inequality — a detour through a third concept is never shorter than the direct route — still holds (~91% of the time across three independent tests), but symmetry doesn't. It has one-way streets.

Random pairs (concepts with no meaningful relationship) make the best test case. Stapler to monsoon versus monsoon to stapler. You would think these should be more symmetric, since there's no inherent directionality. Instead, they're more asymmetric (0.908). When there's no semantic bridge to constrain the path, the starting concept completely determines the trajectory. The model chains forward from wherever it starts, eventually bending toward the target. It's feeling its way forward rather than planning a route.

Quasimetric asymmetry replicates across all 12 models tested.

Bridges You Can't Avoid (and Ones You Can't Use)

Bridge concepts are specific intermediate ideas that models route through with high consistency. They're the structural heart of the benchmark.

Ask any frontier-scale model to trace a path from "light" to "color," and the concept "spectrum" appears with near-perfect frequency (0.93 to 1.00). Across different waypoint counts and temperatures, aggregate bridge frequency exceeds 0.97 in every tested condition. "Spectrum" is a navigational bottleneck: it names the mechanism by which light becomes color (decomposition through a prism or similar process), and large models find no alternative route that avoids it. (The exception is Llama 3.1 8B, at 0.267, a scale effect I'll return to.)

Other near-universal bridges among frontier models: "dog" between animal and poodle (taxonomic necessity, since you can't get from the general category to the specific breed without passing through the intermediate level). "Warm" between hot and cold (the experiential midpoint on the temperature gradient).

But the interesting cases are the failures.

Fire is too central to be a bridge. Ask models to trace a path from "spark" to "ash," and they almost never route through "fire," despite fire being the obvious connection. The reason: both spark and ash already imply fire. It's informationally redundant as a waypoint. Fire is navigable to, but almost never through. Same finding for "water" between rain and ocean: 0.000 frequency across the four original models tested on that pair. If both endpoints already contain the bridge concept, the model skips it.

"Metaphor" fails completely as a bridge between language and thought. Zero frequency across all models, despite being perhaps the most famous theoretical bridge between those concepts (Lakoff and Johnson's entire research program). The paths go through "communication," "meaning," "cognition." Functional intermediaries, not theoretical constructs. The biggest single prediction miss in the entire benchmark.

Process beats object. "Germination" (a process: the seed becomes a plant) massively outperforms "plant" (an object) as a bridge from seed to garden. Claude: 1.00 vs 0.00. GPT: 0.95 vs 0.65. Models prefer waypoints that tell you where you're going, not where you are. The directional information content of the bridge matters more than its semantic centrality.

Bridge frequency turned out to be the most stable structural property in the entire benchmark, more durable than gait, more durable than asymmetry. An ANOVA across different waypoint counts and temperatures showed that model identity drives navigational structure (eta-squared = 0.242, p = 0.001), while protocol variations are statistically null. Change the protocol, and bridges persist. Change the model, and you get different paths.

Gemini Navigates Alone

One model consistently diverges from the other three original models: Gemini.

Claude, GPT, and Grok show 100% binary bridge agreement on 6 tested triples — they agree on the presence or absence of every bridge concept. Gemini disagrees on 50%. On the same concept pairs, Gemini frequently fails to route through bridges that the other three models find: river between bank and ocean (Gemini: 0.00, others: 0.90-1.00), forest between tree and ecosystem (Gemini: 0.10, others: 0.95-1.00), nostalgia between emotion and melancholy (Gemini: 0.00, others: 0.20-0.70).

Yet the pattern isn't random. Gemini succeeds on within-frame bridges, tight associative chains within a single domain (deposit→savings at 1.00, spectrum→color at 1.00). It fails on cross-frame bridges, connections that require integrating across loosely coupled conceptual clusters.

I tried three different mechanistic explanations for Gemini's deficit across three experimental phases. All three failed:

Frame-crossing threshold. Predicted Gemini has a higher cue-strength threshold for cross-frame navigation. Falsified: the threshold doesn't separate successes from failures.
Gradient blindness. Predicted Gemini fails specifically on gradient-midpoint bridges (like "warm" between hot and cold). Falsified: Gemini's zeros concentrate on causal-chain pairs, not gradient pairs.
Transformation-chain blindness. Predicted Gemini fails on process-step bridges. Falsified with the interaction reversed: Gemini does better on transformation pairs than gradient pairs.

Gemini's deficit is real and measurable (mean bridge frequency 0.480 vs ~0.67 for the other three), but it has resisted every mechanistic explanation I've thrown at it. Whatever is different about Gemini's conceptual organization, it needs a multi-variable explanation that I haven't figured out yet.

The Scale Effect

The benchmark includes two models from the same family at different scales: Llama 3.1 8B (small) and Llama 4 Maverick (large, mixture-of-experts). This gives us insight into what scale buys you.

Both models show the same structural properties: characteristic gaits, quasimetric asymmetry. The geometry is there at both scales, but the navigational landmarks diverge dramatically. Maverick's bridge frequency (0.724) is 3.6x the 8B's (0.200). The large model routes through the same landmarks as other frontier models (dog, spectrum, warm). The small model navigates the same geometry but takes different paths. It finds routes that avoid the standard bridges.

This creates a clean hierarchy: geometric structure is universal, navigational landmarks converge among large models, and scale is what differentiates. The universe of conceptual space has the same topology for all models, but large models have agreed on the major highways while smaller models take back roads.

The Graveyard

Across 11 experimental phases, I pre-registered testable predictions before running each experiment. The graveyard of falsified hypotheses grew to 29 entries, each one a prediction that the data killed. Some highlights:

Prediction accuracy crashed as I moved from characterization to mechanism. When I was predicting which models find bridges, using data from prior phases, accuracy was 81%. When I tried to predict why specific bridges survive perturbation or what drives Gemini's fragmentation, accuracy dropped to 20-24% and stayed there for two consecutive phases. The coarse structure is predictable; the fine-grained mechanisms are not.

Sadness defied the model. I predicted that bridges with fewer competitors would be more durable under perturbation. Sadness has 8 competing waypoints on the emotion-to-melancholy path (affect, mood, sentiment, sorrow, wistfulness, feel, long, nostalgia), the most of any bridge tested, yet it survived perturbation at 0.807 — the highest rate of any bridge. Harmony (7 competitors) collapsed to 0.015. Sadness is the gravitational center of its navigational space; every alternative waypoint still orbits it. Harmony is one equally good route among several.

You can't find concepts models can't navigate between. I tried to find "random" concept pairs (turmeric and trigonometry, barnacle and sonnet, accordion and stalactite) that would serve as controls with no consistent navigation structure, but every single one failed. Models find coherent routes between anything. Five out of six models route accordion-to-stalactite through "bellow" (the physical mechanism shared by accordions and cave formations). The original control pair, stapler-monsoon, also fails when you test it on all 12 models: the models find "paper," "office," "flood" with high consistency.

Twenty-nine falsified predictions across 11 phases. The coarse geometry (gait, asymmetry, bridges) replicates reliably. The fine-grained mechanisms resist simple explanations.

What This Measures

21,000 paths later, I think the word convergence game and this benchmark are probing something real about how language models organize knowledge. Prior work has studied static embedding geometry extensively. This probes navigation: whether models can traverse conceptual space consistently, and whether that traversal has geometric structure.

The answer is yes, with caveats. The coarse structure is consistent: every model has a measurable gait, every model navigates asymmetrically, every model shows bridge bottleneck behavior (though specific landmarks differ by scale). These properties replicate across 12 models from 11 independent training pipelines and hold up across protocol variations. The fine-grained mechanisms — why specific bridges survive perturbation, why Gemini fragments differently, why sadness is indestructible — resist every single-variable explanation I've tried.

The graveyard may be the real finding here. The benchmark maps the shape of what we can and can't predict about how language models think. We can predict the roads, but we can't predict the traffic.

~21,540 elicitations were collected between March 2-7, 2026, using 12 models (Claude Sonnet 4.6, GPT-5.2, Grok 4.1 Fast, Gemini 3 Flash, Qwen 3.5, MiniMax M2.5, Kimi K2.5, Llama 3.1 8B, DeepSeek V3.2, Mistral Large 3, Cohere Command A, Llama 4 Maverick) via OpenRouter. This benchmark extends the word convergence game (repo). The experimental pipeline (code, analysis, drafting) was built with heavy AI assistance; I designed the experiments and interpreted the results. The full dataset, and analysis code are available in the conceptual-topology-mapping-benchmark repository.