The Overnight Job Problem
A graduate student submits a LAMMPS simulation on Monday evening. Molecular dynamics on a carbon nanotube system — expected runtime fourteen hours, results ready Tuesday morning. She sets up the job, confirms it enters the queue on Alpine, and goes home.
Tuesday at 9am, she opens her laptop. The job is not in the queue. The output directory has a log file. The log file is 47,000 lines long. Somewhere in those 47,000 lines, the simulation ran out of memory at 3:47am and was killed. Six hours of compute: charged. Results: not written. What to do next: unclear, because the memory limit was set based on a previous run that was a smaller system, and the person who set that limit graduated last spring.
This is not an unusual week. It is a Tuesday.
The Anatomy of an Overnight Failure
HPC jobs fail in predictable ways. Every computational research lab eventually produces a mental taxonomy of failure modes — walltime exceeded, memory overflow, input errors that only surface six hours in, node failures that kill jobs at random. What is not predictable is when any given failure will occur, and the combination of batch scheduling and overnight submission means that the failure window is precisely the window when no one is watching.
| Failure mode | How it happens | How you find out | What it costs |
|---|---|---|---|
| Walltime exceeded | SLURM kills the job when it hits the requested time limit | Email from scheduler; job listed as TIMEOUT in sacct | Full compute time charged; result not written; must resubmit with longer walltime |
| Memory exceeded | Node runs out of RAM; process is killed by OS | Email from scheduler; log file may show 'Killed' with no explanation | Job must be profiled and resubmitted with higher memory request — often guesswork |
| Input error | Bad parameter in config file; code runs but produces garbage or crashes mid-run | Log file at pickup time; error may be buried 10,000 lines deep | Full compute time lost; parameter history unclear if run wasn't documented |
| Node failure | Hardware fails mid-job; scheduler kills all jobs on affected node | System email; job listed as NODE_FAIL | Random; can happen to any job; only solution is checkpoint/restart capability |
| Dependency not found | Module version mismatch; library not available on allocated nodes | First line of log file; obvious in retrospect | Zero compute time charged, but setup time and delay of result |
The table above describes mechanics. The real cost is not in the table. The real cost is the accumulation: a lab that runs overnight jobs regularly will experience these failures regularly, and each failure requires reconstruction. What parameters were being tested? What ran successfully before? What does the failure tell us about the input that needs to change?
In a well-documented lab, those answers are findable. In most labs, they are findable sometimes — depending on whether the student who ran the analogous job last year left notes, whether the Slack thread from the debugging session is still searchable, whether the VASP INCAR from the successful March run was saved in a way that can be compared to the failed one from this week.
Why the Current Solutions Don't Work
SLURM sends emails when jobs complete or fail. The emails are honest but not helpful. A TIMEOUT notification tells you the job hit the walltime limit. It does not tell you how close it was to finishing, whether a checkpoint exists, whether the same job ran successfully last year with a different parameter set, or what walltime request would have been adequate. The email is a signal. It contains almost none of the context that makes the signal actionable.
Most labs have developed some version of a manual response system: check the email, SSH into the cluster, navigate to the job directory, grep the log file for the failure message, decide what to change, resubmit. This process takes fifteen to forty-five minutes per incident. It requires knowing where the job ran, which log files to check, and what the error message means in the context of that specific code. New students cannot execute it independently for the first several months. Senior students and postdocs do it as a background interrupt to their actual work.
The alert infrastructure exists. What's missing is the layer that connects the alert to the lab's accumulated context about what that alert means.
What the Lab Actually Needs to Know
When an HPC job fails, the relevant questions are not just technical. They are historical.
Has this system been run before? If so, what parameters worked? The INCAR that successfully ran a similar perovskite calculation in 2024 contains the ENCUT, KPOINTS, and SIGMA that converged cleanly for that system class. A new student trying to run an analogous system should not have to reconstruct those choices from first principles — the lab already made them and learned what worked.
Has this failure mode occurred before? A lab that has been running VASP for five years has seen memory overflow errors on specific supercell sizes, has seen walltime failures on specific calculation types, has collectively learned which PREC settings are safe at which system sizes. That collective knowledge is tacit. It lives in the heads of people who have been in the lab for more than two years. When those people graduate, the knowledge graduates with them.
What is the right next action? Resubmit with more memory? Reduce the system size? Try a different functional? The answer depends on the failure mode, the specific system, and what the lab has already tried. Without that context, the student's options are: ask someone (if anyone is available), search old emails (if they exist), or guess.
The Monitoring Gap Is an Institutional Memory Gap
The problem is usually framed as an alerting problem. Labs get scheduler emails; they want better emails, or Slack notifications, or a dashboard that shows job status. These are real improvements. They are also insufficient.
Better alerting gets the researcher to the failed job faster. It does not tell them what to do with the failure. The researcher who arrives at a failed LAMMPS job at 3am via a Slack notification is still missing the same context as the researcher who arrives at 9am: what ran before, what worked, what this failure indicates about the input parameters.
The layer that's missing is not alerting. It is memory — specifically, the connection between the current job failure and the lab's accumulated history of running similar calculations. That connection requires knowing: what did this lab learn the last time a LAMMPS potential job failed on an MXene system? What was the fix? Was it ever documented?
For most labs, the answer to the last question is no. The fix was applied, the job was resubmitted, the simulation ran, the results went into a paper. The knowledge of the fix lived in the student who applied it, and was communicated to the next student by word of mouth, if at all. The HPC cluster logged the job metadata. Nothing logged the reasoning.
What Persistent Monitoring Infrastructure Looks Like
The right model is not better alerting. It is an agent that watches HPC jobs as a persistent observer — one that knows what the lab has run before, recognizes failure patterns from historical context, and can surface relevant prior experience when a job fails overnight.
When a LAMMPS job fails on Alpine at 3:47am, that agent surfaces the relevant history: this potential file was run last October with ENCUT=520 on a similar carbon nanotube system; the job ran successfully with 32GB memory; the current job requested 16GB. Here is the INCAR from the October run for comparison. Here is the postdoc who ran that job — she graduated, but her notes are in the lab's memory system.
That is not a notification. It is institutional memory made actionable.
The six hours of compute lost to an overnight failure does not have to be an isolated incident. It can be a data point that improves how the lab runs the next job — if the infrastructure exists to connect the failure to what the lab already knows. Right now, for most research labs, that infrastructure does not exist.
The HPC cluster logs everything except the one thing that matters: why the job was configured the way it was, and what the lab learned from running it.
Every lab eventually learns this the hard way — usually during a time-critical run when the person who knows the relevant history is unavailable. The solution is not to keep that person available. It is to build the infrastructure that makes the history available regardless of who is in the lab.
ResearchOS is currently in design partner trials with computational research labs at R1 universities. It runs persistent HPC monitoring that connects job failures to the lab's accumulated context — what ran before, what worked, what the history suggests. If your lab runs overnight simulations on SLURM clusters and you have felt the overhead described here, we'd like to talk.
probe.onstratum.com →