March 3, 2026Probe / ResearchOS

HPC Jobs Don't Fail Loudly

At 3:47 AM on a Tuesday, a VASP geometry optimization failed on node cn0142 of the Alpine cluster. The error was a segmentation fault in the FFT routine — a known issue with a one-line workaround documented in a Stack Overflow post from 2023. Nobody in the lab found out until 9:15 AM.

By then, 11 compute hours had been spent on downstream jobs in the dependency chain — all of them invalid. The wallclock reservation had expired at 8 AM. The student who submitted the job would wait another three days for the next open window.

This is not a story about a bad HPC cluster. Alpine is well-run. SLURM works. CARC support responds to tickets within 24 hours.

This is a story about the gap between HPC infrastructure and scientific intent. SLURM knows when a job exits nonzero. It does not know that the job was running the third step of a five-step convergence test, that the MKL error is a node-specific environmental problem rather than a fundamental physics failure, or that the grad student who submitted it had office hours until 8 PM and could have requeued within minutes if she'd known.

The cluster manages compute. Nobody manages the science.

The cost nobody measures

Walk through any computational materials or chemistry lab and ask: “What fraction of your HPC compute hours in the last year ran to completion successfully and produced usable data?”

Most PIs don't know. The SLURM accounting reports tell you about CPU-hours billed. They don't tell you how many of those hours were spent on jobs that failed in ways that weren't caught for hours, or jobs that completed but produced garbage output because an input file was misconfigured in a way that VASP will helpfully run to completion rather than complain about.

A reasonable lower bound from conversations with computational PIs: 15–25% of HPC compute time in a typical academic group is wasted on runs that fail silently or complete without being useful. On an Alpine allocation of 200,000 SUs per year — a medium-sized group — that's 30,000 to 50,000 core-hours gone.

The dollar value matters less than the time value. Each failed overnight run pushes a result by a day. Ten failed overnight runs — a not-unusual month — push a timeline by two weeks. A grad student with a five-year PhD and a thesis that needs computational results has a budget of roughly 250 productive work months. Two weeks of invisible slippage per month is the difference between defending on time and not.

Why this doesn't get fixed

The HPC cluster is not your lab's problem to manage. That's CARC's job, or RC's job, or whoever runs the institutional allocation. They are excellent at keeping nodes up, managing queues fairly, and responding to hardware failures.

They are not watching your LAMMPS runs to see if the MLIP potential you loaded has a force discontinuity that will produce thermodynamically inconsistent trajectories that look plausible until someone tries to compute a free energy from them six months later. That's your lab's job. And in practice, it falls to whoever happens to check their email first in the morning.

The monitoring tools that exist are built for infrastructure, not science. sacct will tell you a job ran for 2.3 hours and exited with code 139. It will not tell you that exit code 139 from LAMMPS on this cluster, in this configuration, running this potential, means the GPU memory ran out on the GNN evaluation step and the last 40 minutes of trajectory are corrupted.

That interpretation lives in the head of the postdoc who has seen it happen before. If that postdoc is on vacation, or has graduated, or is on the other coast at a conference, the job failure just sits there.

The second-order problem

The worst part isn't the failed run. The worst part is that the knowledge of how to diagnose and fix the failed run doesn't accumulate.

Every group has a postdoc who is the de facto HPC expert — the person who has debugged VASP on Alpine enough times to know that the FFT segfault only happens on the Rome nodes, that LAMMPS occasionally dies if you cross a node boundary with a certain KOKKOS configuration, that the scratch filesystem has a known latency spike Tuesday mornings that can trigger apparent job failures that are actually I/O hangs.

When that postdoc leaves — and they always leave — the next person learns these things again. From scratch. Painfully.

Some labs maintain a wiki. The wiki is almost always out of date. The most recent entry was from 2022. The fix it documents no longer applies because the module system changed in 2024.

The information doesn't need a wiki. It needs to be captured at the point of occurrence — when the postdoc debugs the failure, notes the cause, and fixes it — and indexed so it's findable when the next person hits the same wall.

What monitoring for science looks like

The right model is not a pager alert when jobs fail. Any PI who has tried to run a 24/7 alert system for computational jobs knows exactly how that ends: alert fatigue, ignored notifications, or worse, a lab culture where every failure demands an immediate response.

The right model is an agent that understands your lab's computational stack — your specific VASP version, your MLIP potentials, your convergence criteria, your dependency chains — and interprets failures in that context.

Not “job cn0142 exited 139.” But: The VASP relaxation for MXene-Ti3C2-vacancy-3x3 failed on a Rome node with the FFT segfault. This has happened three times in the last six months; previous fix was adding NCORE=4 to the INCAR. I've flagged the next queued job in this series and paused it pending your review.

That's a different thing. That's an agent that has absorbed the lab's diagnostic history and can apply it automatically. It requires two things most labs don't have: persistent memory of previous failures and their resolutions, and understanding of scientific intent — knowing that the VASP job is step three of five, so a failure here should hold the downstream cascade until someone decides whether to re-run or proceed.

The question worth asking this week

What happened on your cluster last night?

If the answer is “I'll check after coffee,” the monitoring gap is real.

If the answer is “I got a summary in my inbox at 6 AM describing which jobs completed, which failed, why, and what the recommended next step is” — you've already built what most computational labs spend years not having.

Probe / ResearchOS

ResearchOS is a persistent context layer for research labs — including autonomous HPC monitoring for LAMMPS, VASP, GROMACS, and other computational workflows on Alpine, NERSC, TACC, and ACCESS-allocated clusters.

Early access for computational labs →

← All essays