Starting a Lab: The Knowledge Infrastructure Decision Nobody Told You About
The first three years of a lab establish knowledge patterns that persist for a decade. Most junior faculty optimize for publications. Almost none optimize for institutional memory.
When you start a lab, you make hundreds of decisions in the first year. What HPC cluster to get access to. Which simulation codes to build your research around. What your group meeting structure will be. How you will recruit students. Where you will submit your first papers. Whether to write your own tools or extend existing ones.
Almost none of those decisions are about how the lab will remember what it learns.
That turns out to be the one that matters most — not for the first year, but for the fifth. And by the fifth year, when the first PhD student you hired is defending and taking the institutional memory of your first research direction with them, the patterns you set in year one are very difficult to change.
The Year One Paradox
Year one of a new lab is the best possible time to establish knowledge infrastructure. The group is small — two or three people, often just the PI and one or two graduate students. The research is focused. The workflows are not yet entrenched. There is no legacy system to retrofit around.
It is also the worst possible time to think about it. The tenure clock is running. The first paper needs to be submitted. Grant applications need to be written. Students need to be hired. The things that feel urgent crowd out the things that are important but not visibly urgent yet.
The result is a nearly universal pattern: junior faculty establish research workflows that are optimized for producing results quickly, and they document those workflows only enough to keep the current people working effectively. The documentation exists for the present, not for the future people who will need it when the current people leave.
The best time to set up knowledge infrastructure is when you have three people. The worst time is after the seventh person graduates. Most labs do the second.
What Gets Built in Year One That Nobody Documents
The first generation of code, protocols, and workflows in a new lab has a particular property: it is almost always underdocumented relative to its importance. Here is why.
When a PI writes or supervises the first version of a lab tool, the rationale is obvious to them. They made the design decisions. When a first-year PhD student builds an extension to a standard code — a new input parser for a simulation package, a custom post-processing pipeline, a wrapper around a third-party library — the PI oversees it. The decisions get discussed. They may be documented in a lab notebook or a commit message. But the reasoning is not captured in a form that survives the person who wrote it.
Then that student becomes a fifth-year, starts writing their thesis, and a new student arrives. The new student inherits the code with no guide to why it works the way it does. The fifth-year is busy. The PI has been at this for four years and has internalized the decisions. Nobody sits down to write the reasoning document. The code gets used; the reasoning stays tacit.
Three years later, when the code needs to be extended or debugged for a different system than the one it was built for, the new student has a perfectly functional tool with no explanation of its boundary conditions. Which system classes was it tested on? Which configurations cause it to fail silently? Why was the default parameter set to that value and not something more conservative? The answers exist only in the head of the PhD student who graduated.
# The kind of commit history a new student inherits:
commit a3f2e19
Author: Kenji Mori <k.mori@lab.edu>
Date: Thu Mar 14 2024
fix RPMD bead initialization for high-T runs
commit 7c41d8b
Date: Mon Feb 5 2024
add convergence check for ring polymer contraction
commit 2b9f0c3
Date: Wed Jan 17 2024
initial LAMMPS extension for centroid MD
# What the new student needs to know:
# - Why this bead count for this system class?
# - Which temperature regime requires the high-T fix?
# - Which potential functions are compatible with ring polymer contraction?
# None of it is here.The Compounding Problem
Knowledge loss in a new lab compounds in a way that is easy to underestimate.
In year one, the PI knows everything the lab knows. There is no knowledge gap. The lab is effectively one person with one or two junior assistants.
By year three, the lab has five people. The PI knows the strategic context — the research questions, the approach, the open problems. The senior PhD student knows the computational workflows and the experimental parameters that work and the ones that don't. The junior students know what they have been directly taught. The knowledge is distributed across people and is mostly accessible as long as those people are there.
By year five, the first PhD student defends. The PI is now two years deeper in the tenure process. A postdoc has joined and brought their own workflow assumptions. Two new graduate students are in their first year. The lab is ten people. The knowledge held by that first PhD student — the specific parameter choices, the troubleshooting history, the reasons why certain approaches were abandoned — cannot be transferred in a two-week transition. It leaves with them.
The new students who need that knowledge will reconstruct it, slowly, at significant cost. For a computational lab, this often looks like six months of re-deriving parameter choices that were already settled. For a lab with custom code, it looks like two students spending a semester adding up to a working understanding of a codebase that one person built in a semester and understood immediately.
The Retrofit Problem
By year five or six, when the knowledge loss becomes visible, the lab is past the point where it is easy to fix.
Retrofitting knowledge infrastructure into an existing lab requires changing the habits of every current member — getting everyone to document in a shared system, getting the PI to enforce it, getting postdocs who have their own workflows to adapt. It requires migrating whatever partial documentation exists into a consistent format. It requires going back and reconstructing the reasoning behind decisions that were made years ago, before the system existed.
Most labs do not successfully retrofit. They continue with the existing system — which is the PI's memory, supplemented by notebooks that are partially maintained, supplemented by the institutional knowledge of whoever has been there the longest — and accept the ongoing attrition of knowledge at each graduation.
This is rational in the short term. A lab of ten people cannot stop and spend a month on documentation hygiene. The grant deadlines and paper submissions do not pause while the knowledge infrastructure gets reorganized.
It is expensive in the long term. The cost is diffuse and does not show up in any single budget line. It shows up as six months of re-derivation, as new students who take too long to become independent, as projects that cannot be picked up after a hiatus because the prior context is unavailable.
What Year One Actually Requires
The knowledge infrastructure that prevents this does not require significant additional work from the people in the lab. It requires three things.
First: a system that captures reasoning alongside artifacts, not just artifacts. Git captures code changes; lab notebooks capture experimental parameters. Neither captures the reasoning behind the choices — why this functional, why this cutoff, why this sampling strategy and not a more conservative one. The system needs to capture the latter without requiring an additional documentation step.
Second: group-level accessibility from the start. A system that works for the PI but not for a new student in year three is not a knowledge system for the lab. It is a personal archive. The design assumption has to be that the system will serve people who were not present for the original decisions.
Third: maintenance cost low enough to survive real lab pressure. A documentation system that requires fifteen minutes of structured entry per day will not be maintained by PhD students during qualifying exam season or grant deadlines. The accumulation of context has to happen as a side effect of the work, not as a separate task.
These requirements define the gap between personal documentation systems (org-mode notebooks, well-maintained Jupyter archives, detailed commit messages) and group knowledge infrastructure. Personal systems meet the first requirement inconsistently and fail the second and third almost entirely.
The question is not whether your lab will have a knowledge problem. Every lab has one. The question is whether you encounter it in year five with an established lab and a retrofit cost, or in year one with an empty slate and a setup cost.
The Opportunity
A lab in its first two years has something that established labs do not: nothing to retrofit. The workflows are not yet entrenched. The team is small enough that establishing a shared system is relatively low friction. The research is focused enough that the system does not have to handle ten years of accumulated context from multiple research directions simultaneously.
Starting with knowledge infrastructure in place means that the first PhD student who defends leaves a record of their reasoning, not just their results. It means the second generation of students inherits an accessible context for the lab's methods. It means the PI's working knowledge of the lab's history is not the single point of failure for the institution.
It also means that the compounding works in the other direction. Every paper, every experiment, every debugging session, every design decision adds to a queryable record. By year five, the lab has five years of indexed reasoning available to anyone who joins. The onboarding cost for a new student is lower. The knowledge loss at graduation is smaller. The research builds on itself rather than being partially reconstructed at each transition.
This is not an argument for spending significant time on documentation infrastructure in year one. It is an argument for choosing the right system in year one, while the cost of that choice is low, so that the documentation happens as a side effect of the work rather than as an additional burden added to the work.
The decision is most consequential precisely when it seems least urgent.
Probe is designed to be deployed at the start of a lab, not retrofitted into one. It indexes reasoning alongside artifacts as a side effect of normal research activity — no structured entry, no documentation tax. Founding lab pricing available through Q2 2026.
Learn more at probe.onstratum.com →