TAUSIF
TAUSIF
All posts
May 9, 2026·6 min read

The 30ms tax that killed our race condition: row-level locks in production

A few years ago we shipped a smart parking system across multiple lots. The pitch was straightforward: cameras read licence plates at the gate, a Django backend matches them against active reservations, the gate goes up, the driver parks.

Within the first month of pilot, we had a problem nobody on the team had taken seriously: drivers would book a spot, drive to the lot, and find someone else already in it. Same spot. Same time slot. Two reservations.

This is a textbook race condition. Two requests to reserve the same resource arrive at the same instant. Both check whether the spot is free (yes, it is), both decide they can have it, both write a row. The second write doesn't notice the first because they happened in parallel.

We knew it was theoretically possible. We didn't think it would actually fire on the volumes we had. It did, repeatedly, and each time it cost us a tow truck and an angry phone call.

This is the story of the 30ms we added to every reservation request to make the problem disappear.

What didn't work

The first instinct of every backend engineer hitting this is the same: "I'll fix it in the application layer."

Idea 1: lock in Python. Wrap the read-and-write in a global mutex.

This works for one process. The moment you have two web workers — and you always have two — they don't share the mutex. The race comes back the next day, dressed in a thread-pool costume.

Idea 2: optimistic concurrency. Read the spot's version column, attempt to update only if the version hasn't changed.

This is the classical pattern. It works when conflicts are rare. It does not work when conflicts are common. We were getting bursts of 50+ simultaneous requests for the same lot during morning rush. Optimistic locking turns those into 49 retries. Most of them fail. The user who clicked "reserve" sees an error, retries, and now there's another burst.

Optimistic locking optimises for the case where conflicts almost never happen. Our case was conflicts happening continuously. We needed the opposite.

Idea 3: a queue. Push every reservation request into a queue, process serially, write to the database.

This works. It also adds 200ms of latency, an entire piece of infrastructure to operate, and a failure mode (queue down = nobody can park) that didn't exist before.

We were one architectural decision away from microservices, message buses, and a 10-page incident review for every deployment. For a problem that was, fundamentally, "two writes shouldn't happen at the same time."

What we did instead

Postgres has had row-level locking since 2003. We weren't using it.

with transaction.atomic():
    spot = Spot.objects.select_for_update().get(
        id=spot_id, status='free'
    )
    spot.status = 'reserved'
    spot.reservation = reservation
    spot.save()

select_for_update() tells Postgres: "I'm about to read this row and write it. While I'm doing that, no other transaction can touch this row. They have to wait." Other transactions that try to lock the same row block until this one commits or rolls back.

The cost is 30ms of added latency at the 95th percentile during peak. The benefit is that the race condition cannot physically occur, ever, regardless of how many web workers we run.

Two writes for the same spot at the same instant don't happen. The second one waits until the first commits, then re-reads the row, sees status='reserved', and the get() filter fails because the row no longer matches status='free'. We catch the exception and tell the user "spot just taken, picking another."

That's it. No queue. No mutex. No retry storm. One database primitive doing exactly what it was designed for.

Why we hadn't reached for it sooner

The honest answer is that locks have a bad reputation. Junior engineers are warned away from them with horror stories: deadlocks, lock escalation, performance cliffs. By the time you're senior, you've internalised "locks are scary" as a heuristic and you reach for application-layer solutions first.

Most of the horror stories are about row-locks done wrong. The two failure modes you actually have to worry about:

  1. Long-held locks. If you start a transaction, lock a row, and then make an HTTP call to a third party inside the transaction, the row stays locked for as long as the HTTP call takes. Don't do that. Locks belong inside transactions that touch only the database, finish in milliseconds, and commit.
  2. Deadlocks from inconsistent lock order. If transaction A locks spot 1 then spot 2, and transaction B locks spot 2 then spot 1, they deadlock. Postgres detects this and kills one of them with an error. The fix is to always acquire locks in a consistent order — sort by ID before locking.

Both are tractable. Both are easier than running a message queue.

What this is really about

I tell my students this is about row-level locks. It isn't. It's about reaching for the database primitive before the framework abstraction.

Web frameworks make every problem look like a problem you should solve in application code. ORMs hide the database from you so well that you start to forget the database knows things your application doesn't — like how to sequence concurrent writes, atomically.

A senior engineer's instinct, the one I'm trying to teach, is the inverse: when you have a concurrency problem, the question is not "how do I solve this in Python." The question is "what does the database already know how to do."

In our case the answer was: lock a row for 30ms.

100,000+ vehicles later, zero double-bookings.

What I'd tell a younger me

The boring solution is almost always a database feature you didn't know you had. Read the manual for the database you're already running before you add anything new to the stack.

And: the cost of "I'll add a queue for this" is never just the queue. It's the queue plus its monitoring, plus its backpressure handling, plus the on-call engineer who has to learn it, plus the deployment story when the queue version changes. A select_for_update() has none of those costs. It's already there. It's already tested by thousands of companies running it in production.

Use it.


Md. Tausif Hossain leads engineering at DevTechGuru, a Bangladesh-based agency shipping HealthTech, PropTech, and enterprise SaaS products to clients in nine countries. He also runs TechnicalBind, an independent software studio, and teaches advanced full-stack engineering at Ostad. Reach him at tausif.bd or @tausif1337.

Share

Newsletter

Get new posts in your inbox.

Honest essays on engineering, leadership, and the things I’m figuring out. No spam, ever.