The Night My Server Rebooted Itself Into a Loop
A story about one innocent systemd timer, one cursed config flag, and one very long Sunday night in Accra.
A story about one innocent systemd timer, one cursed config flag, and one very long Sunday night in Accra.
I just wanted the server to reboot once a week.
That's it. Clean slate every seven days. Flush any memory leaks, clear stale processes, keep things fresh. It's the kind of low-effort maintenance task that feels smart and responsible. Senior engineer behaviour, even.
So I set up a systemd timer, tested it, watched it fire, and thought nothing more of it.
Then my server disappeared.
What Actually Happened
The timer worked. A little too well.
The culprit was one line in my systemd timer config:
Persistent=true
What Persistent=true does is tell systemd: "if this timer was supposed to fire while the system was offline, run it immediately when the system comes back up."
Sounds reasonable, right? Except my timer was configured to reboot the server. So the sequence of events looked something like this:
- Timer fires, server reboots
- Server comes back online
- systemd sees the timer was "missed" during the reboot downtime
Persistent=truesays: run it now- Server reboots again
- Go to step 2. Forever.
An infinite reboot loop. Triggered by a flag I didn't fully understand. On a production server.
The Damage
I noticed within minutes. That part sounds like a win. It wasn't.
My app stopped responding. I tried to SSH in. Connection refused. Tried again. Nothing. And again. Nothing. The server was just gone. Not slow, not timing out, just completely unreachable every single time.
I tried everything. SSH, nothing. EC2 Instance Connect, nothing. Even the EC2 serial console, which is supposed to work regardless of the OS state, nothing. The server was cycling too fast for any of it to matter. By the time any connection had a chance to establish, it was already gone again. Completely locked out, with a full view of the problem and zero ability to touch it.
There was nothing to debug. Nothing to fix. The only option left was to delete the instance entirely and start from zero.
What followed was a few hours of rebuilding everything from scratch. Nginx config. PM2 process setup. Environment variables. Reverse proxy rules. SSL certificates. All of it, pieced back together late into the night, from memory and old notes, hoping I hadn't missed anything critical.
Sunday night. That's the one thing that saved me. Traffic was low, barely anyone was using the app, and the downtime mostly went unnoticed. If this had happened on a Tuesday afternoon? Different story entirely. Sometimes the timing of your mistakes is the only thing standing between you and a very bad day at work.
What I Should Have Done
A few things, honestly.
Read the docs before copying config snippets. Persistent=true is a perfectly valid option. It just has absolutely no business being in a timer that reboots a server. If I'd spent two minutes reading what it actually did, I'd have never included it.
Test destructive timers differently. I "tested" the timer by watching it fire once successfully. But I never thought through what happens on restart, which is exactly what the timer triggers. The one thing I needed to simulate, I didn't.
Use crontab for simple scheduled reboots. systemd timers are powerful, but that power comes with complexity. For something as straightforward as a weekly reboot, a crontab entry is cleaner, more predictable, and has no gotcha flags hiding in the docs:
0 2 * * 0 /sbin/reboot
That's it. Runs every Sunday at 2am. Does exactly one thing. Cannot loop. I use this now.
The Real Lesson
The most painful incidents are rarely caused by complex bugs. They're caused by small misconfigurations you were confident about.
I was confident about that timer. I'd read enough about systemd to feel comfortable. But there's a specific kind of dangerous in tech: knowing just enough to do something, but not quite enough to do it safely. I was in that zone. The server paid for it.
Now, before I add any config flag I haven't used before, I look it up. Not the Stack Overflow answer. The actual documentation. Two minutes of reading has saved me from several potential disasters since that night.
The Silver Lining
Rebuilding from scratch meant I did everything properly this time. Better folder structure. Tighter Nginx config. Environment variables actually documented somewhere instead of living only in my head.
The incident cost me a Sunday night. The rebuild saved me future Sunday nights.
That's the weird thing about breaking production: sometimes the cleanup ends up better than what you had before. Not that I'd recommend it as a strategy.
I write about real engineering experiences. The things that went wrong, what I learned, and how I build now. If you've survived your own production incident, you already know: the best lessons are the expensive ones.