ResearchMarch 6, 20268 min read

The ML Potential Training Problem

You trained the model on 400,000 DFT configurations. The model fails on a new composition class. The data curation decisions are gone with the person who made them.

A third-year postdoc in a computational materials lab spends most of late 2024 building a machine-learned interatomic potential for a class of high-entropy alloys. She runs 380,000 DFT reference calculations — mostly VASP, some Quantum ESPRESSO for validation — to generate the training set. She tries three different active learning schemes before settling on the one that gives adequate coverage. She chooses PBE as the exchange-correlation functional after testing PBEsol and finding worse transferability for the binary subsystems. She deliberately excludes certain extreme geometry distortions from the training set because they caused the model to over-fit to non-physical configurations. She trains a final model that achieves the target accuracy on the test set and uses it in the group's production simulations throughout 2025.

She defends in May 2025 and takes a position at a national lab.

In the fall, a new student wants to extend the potential to a related alloy composition — slightly different elemental ratios, still within what seems like the same material class. The potential fails. The forces are unstable. The molecular dynamics crashes within 200 femtoseconds.

Nobody knows why.

What You Have

The training set files are on the cluster. 380,000 VASP calculations, with inputs, outputs, and energies extracted. The training code is in a git repository with version history. The final model weights are archived. The publication is submitted and the supporting information describes the training procedure at a methods-section level of detail.

What you do not have is the reasoning behind the data curation decisions.

Specifically: which material composition subspace was intentionally left sparse because the postdoc determined it was out of scope for the target application? Which geometry distortion threshold was used for the outlier filtering, and why that threshold and not a tighter one? Why PBE and not PBEsol — the postdoc ran that comparison, the test results are somewhere in her local directory, but the conclusion and the reasoning are in a group meeting slide that was presented once and never uploaded to a shared location.

The git repository for the training code has commit messages like these:

# commit 8d2a4c1
# Author: Yuki Tanaka <y.tanaka@lab.edu>
# Date:   Mon Sep 15 14:22:08 2024
updated training configuration

# commit 4f9b31e
# Date:   Thu Oct 3 09:17:44 2024
fixed active learning loop, adjusted sampling threshold

# commit 1c7d852
# Date:   Fri Nov 8 16:44:33 2024
final training set config, removed problematic structures

"Removed problematic structures." Which structures? Problematic in what sense? What made them problematic and not just high-energy? The commit changes the filtering script by six lines. The reasoning is not in the diff.

Why the Model Fails on the New Composition

A machine-learned interatomic potential is, at its core, a function that maps atomic configurations to energies and forces. It can only produce accurate outputs for configurations that are well-represented in its training distribution. When it encounters configurations outside that distribution, it extrapolates — and the extrapolation can be catastrophically wrong, which is what "the MD crashes within 200 femtoseconds" looks like in practice.

The question of why the new composition is outside the training distribution has two possible answers. Either the composition was genuinely not covered — the training set was never meant to include that region of composition space — or it was covered but the filtering removed too many of those configurations, leaving the model undertrained there.

Diagnosing which of these is true requires knowing the original data curation decisions. It requires knowing what was left sparse intentionally versus accidentally. It requires knowing whether the composition filter that excluded "problematic structures" in November 2024 was meant to exclude unstable geometries or was accidentally set to also exclude a broad swath of the new composition's parameter space.

Without that information, the new student has two options. She can re-train the potential from scratch — 380,000 new DFT calculations, four to six months of work. Or she can try to reverse-engineer the training set's coverage empirically, running test calculations to map the failure boundary, and build intuition from that about what was and was not represented. This is also months of work, and it may not converge on the right answer because she is diagnosing a decision she did not make with evidence that was not designed to diagnose it.

A machine-learned potential inherits all the tacit knowledge of the person who curated its training data. When that person leaves, the potential is correct but unexplainable — and unexplainable models cannot be extended, debugged, or transferred.

The Three Failure Modes

The scenario above — model fails on extension, no one knows why — is one of three common failure patterns in ML potential development. They have different surface appearances but the same underlying cause.

Failure mode one: the extension problem. A new student or collaborator wants to apply the potential to a related system. The model fails. Without documented coverage decisions, they cannot determine whether this is a genuine extrapolation failure (the model was never trained for this) or a fixable gap (the training set was thin here, more data would help). The diagnostic cost is high.

Failure mode two: the retrain problem. The model needs to be retrained — new functional form, better training data, updated DFT settings. Without documented decisions from the original training, the person doing the retrain does not know which choices were principled and which were pragmatic. They cannot distinguish "we used PBE because it was the right choice for this system" from "we used PBE because it was faster and we were running out of allocation." They cannot inherit the decisions; they must remake them, often arriving at different answers without knowing why.

Failure mode three: the reproducibility problem. A collaborator uses the potential, gets different results than the original paper, and asks for clarification. Or a reviewer asks for justification of the training procedure. The PI can describe the general approach accurately, but the specific decisions — which cutoffs, which active learning criterion, which structures were filtered and why — require either finding the postdoc or reconstructing from incomplete records. The published methods section describes the procedure at the level of a paper, which is never the level of detail needed to reproduce the decisions.

The Scale of the Problem in ML Potential Development

Machine-learned interatomic potentials are not simple models. Training a production-quality MLIP for a multi-component alloy system involves decisions at every step: which base DFT functional to use for the reference calculations; what geometric distortion range to sample; which active learning strategy to use and what its convergence criterion should be; how to handle the boundary between physical and non-physical configurations in the training set; which validation metrics matter for the target application.

Each of these decisions has a rationale. The rationale is generated by a person, usually a senior postdoc or advanced PhD student, at the moment the decision is made. It may be communicated once at group meeting. It may be partially documented in a lab notebook. It does not end up in the training configuration file, the git repository, or the paper.

For a potential trained on 400,000 DFT calculations — a large but not unusual number for a system of moderate complexity — those decisions represent months of expert judgment. The model that comes out the other end is correct, but it is only correct within boundaries that nobody documented. The person who knows those boundaries has graduated.

This is not a problem unique to machine-learned potentials. It is the general problem of complex methodological pipelines in computational science: the pipeline exists, the outputs are correct, and the reasoning that produced both is undocumented because there was never a low-friction way to capture it at the moment it was generated.

What Indexed Reasoning Would Change

The extension failure described at the start of this post takes months to diagnose without documentation and twenty minutes to diagnose with it.

A query to an indexed version of the lab's decision context — "why did we choose PBE over PBEsol for the HEA potential training?" — should return the group meeting notes from October 2024 where the comparison was discussed, the postdoc's message to the PI summarizing the transferability test results, and the specific conclusion. A query about which structures were excluded in the November 2024 filtering step should return the reasoning behind the threshold choice.

This does not require that the postdoc have written a separate documentation document. The reasoning already exists in the lab's communication channels — in Slack messages, emails, and group meeting notes. What it requires is that those channels be indexed against the artifacts they describe, so that a query about the training configuration returns the decision context, not just the configuration file.

When a new student can query the reasoning behind the model they are trying to extend, the diagnosis of why it fails on a new composition takes hours, not months. When a collaborator asks for reproducibility clarification, the response is assembly rather than reconstruction. When a reviewer asks for justification of the training procedure, the answer is findable.

The reasoning exists. It is just not where anyone can find it.

Probe — ResearchOS for Research Labs

Probe indexes the reasoning behind your lab's computational decisions alongside the artifacts they produced — training configurations, DFT parameters, filtering choices — and makes them queryable when the person who made them has moved on. Founding lab pricing available through Q2 2026.

Learn more at probe.onstratum.com →