Stratum← Stratum Journal
← Stratum Journal
ResearchMarch 4, 20267 min read

Your HPC Cluster Has Perfect Memory. Your Lab Does Not.


Every computational lab I've talked to has the same asymmetry.

Their HPC cluster has perfect memory. SLURM logs every job submission — timestamp, partition, nodes requested, cores allocated, wall time, exit code. The cluster knows job ID 47283 started at 02:14 on a Tuesday in November, ran for 6.3 hours on 64 cores, and exited with code 137. It knows who submitted it and from which directory.

The lab has almost no memory at all.

What the cluster doesn't know — what the lab doesn't know — is why that job was run. Which system was being studied. What parameter space was being explored. What the researcher was hoping to learn. What they actually found. What they tried next.


The Three-Layer Knowledge Problem

Computational research generates knowledge at three distinct layers, and most labs only capture one of them.

The Three-Layer Knowledge Capture Problem
LayerWhat it containsTypical captureTools usedGap
Layer 1: ComputeWhat your cluster ranNear-perfectSLURM logs, PBS logs, output files in /scratchNone — this layer is solved
Layer 2: AnalysisWhat your analysis producedReasonableJupyter notebooks, Python scripts, git repositories, CSVsPartial — notebooks exist but aren't always organized or queryable
Layer 3: ReasoningWhy you ran what you ranAlmost neverTacit knowledge in researcher's head; sometimes Slack; rarely wikiSevere — this is the knowledge that graduates with your students

Layer 3 — the reasoning layer — is almost never captured. It lives in the head of whoever ran the jobs. When that person graduates, the reasoning graduates with them.

What Happens When Layer 3 Walks Out the Door

The typical computational lab has a 4–6 year PhD student lifecycle. A student spends their first year learning the lab's workflows — which parameters work for which systems, which codes need which environment modules, which initial geometries converge and which don't. They spend the next three years building on that foundation. Then they graduate.

The next student starts at zero.

Not completely zero — they have access to Layer 1 (the old SLURM logs, if they can find them) and Layer 2 (the notebooks, if they can read them). But Layer 3 is gone. The reasoning behind the workflow choices doesn't exist anywhere findable.

So the new student spends their first year rebuilding Layer 3 from scratch. They run convergence tests that were run before. They make parameter choices that were made before, sometimes better, sometimes worse. They rediscover what the previous student knew.

In a 15-person computational lab, this happens four or five times a year.

The Numbers Are Worse Than They Look

Consider a group that runs VASP for DFT calculations on metallic systems. Each convergence study — finding the right ENCUT, KPOINTS, and exchange-correlation functional for a new system class — takes roughly two weeks of active researcher time plus substantial compute. The results of that convergence study are written up in the paper's supplementary information if they're written anywhere at all.

When the next researcher needs to study a related system class, they don't know whether the convergence parameters from the previous study are transferable. They might be. They might not be. The reasoning behind the original choices isn't accessible. So they run the convergence study again.

This is the invisible tax on computational productivity. It's not a crisis — it's chronic. Every lab lives with it. Most have stopped noticing it because it's always been there.

Why Existing Solutions Don't Fix Layer 3

"We use git" — Git captures what your code looked like at each point in time. It doesn't capture why you wrote it that way, what you tried before, or what the failed branch was trying to accomplish. Git is excellent Layer 2 tooling. It doesn't touch Layer 3.

"We have a lab wiki" — Research wikis are written documentation. They capture Layer 3 knowledge only when someone explicitly decides to write it down — which happens approximately never during the busy parts of a PhD, and happens as a rushed retrospective at the end. We've written about why wikis fail in research contexts specifically. The short version: documentation is maintenance work, and maintenance work gets deprioritized.

"We have lab notebooks" — Lab notebooks (paper, Notion, Obsidian, Roam) are personal knowledge management. They work for the person who keeps them. They don't create searchable, queryable knowledge that the whole group can access. When that person graduates, the notebook goes with them.

"The cluster keeps logs" — Yes. Layer 1 is captured. See above.

What Actually Captures Layer 3

Layer 3 — the reasoning layer — has different properties than the other layers. It is generated continuously, not at publication time. It is embedded in conversation and workflow, not in formal documents. It is contextual: a parameter choice only makes sense alongside the system it was made for. It is sequential: understanding why a choice was made requires knowing what was tried before it.

The only thing that captures this kind of knowledge at scale is a system that's always present in the workflow — not a documentation tool you visit later, but an active layer that synthesizes knowledge as research happens.

That is what ResearchOS is. It connects to your HPC workflows, your version control, your literature, and your computational output. It builds a queryable memory of the lab's Layer 3 knowledge over time. When a new student joins, they can ask: "What KPOINTS grid have we used for perovskite surface calculations, and why?" and get an answer drawn from the lab's actual history of decisions — not a wiki that hasn't been updated in 18 months.

The Test

If you want to know where your lab stands on Layer 3 knowledge capture, run this test: find the last major project that a graduated student owned. Then ask a current student to explain why the key parameter choices in that project were made the way they were.

Not what the choices were. Why.

If they can answer from documentation, your lab has solved a problem most haven't. If they have to reverse-engineer the reasoning from output files and notebooks — or if they simply don't know — that is your Layer 3 gap.

Every answer you don't have is compute time you'll spend rediscovering.


Probe / ResearchOS

ResearchOS is the institutional memory layer for computational research labs. It captures the reasoning layer — not just what ran, but why — and makes it queryable by everyone in the group. We're currently working with founding labs at R1 universities. If the three-layer problem resonates, we'd like to talk.

probe.onstratum.com →
Sean / Stratum
© 2026 Stratum · hello@onstratum.com · onstratum.com