Race Conditions —
When Threads Collide
You've probably shipped one without knowing it. I spent three days chasing a bug that only appeared on Tuesday afternoons under load. Turned out to be two lines of perfectly reasonable code that happened to run at the wrong moment. This is that story — and everything I learned from it.
The first time I encountered a race condition in production, it cost us eight hours of engineering time and a very uncomfortable call with a client. The dashboard was showing negative inventory counts. Not wrong counts — negative ones. The kind that make your stomach drop.
The frustrating part? The code looked fine. Both functions were correct in isolation. The bug only appeared when two users clicked "Purchase" at almost exactly the same moment. I've since come to think of race conditions less as bugs and more as timing landmines buried in perfectly reasonable code.
Section 01 So what actually is a race condition?
At its core, a race condition happens when the correctness of your program depends on the order in which threads execute — and you haven't guaranteed that order. Two threads are racing to access the same shared data, and whoever wins changes the outcome.
The deeper problem is that thread scheduling is handled by the operating system, which doesn't care about your application's logic. It can pause your thread mid-instruction and give CPU time to something else. That pause can happen at the worst possible moment.
"The bug isn't in any single line of code. It lives in the gap between two lines — in the moment when another thread slips through." — Every engineer who's debugged a race condition at 2am
Your wife Alice has ₹1,000 in her account. You're at an ATM withdrawing ₹700. She's simultaneously doing an online transfer of ₹700. Both systems check the balance, both see ₹1,000, both think "yep, sufficient funds," and both proceed. The bank has now allowed ₹1,400 to leave a ₹1,000 account.
Nobody wrote buggy code. The ATM logic is correct. The transfer logic is correct. The problem is they ran concurrently against shared state without coordinating. That's a race condition.
Section 02 The classic example — and why it's deceptive
Let me show you the simplest possible version. Two threads, one counter:
import threading
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1 # Looks like one step. It isn't.
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()
print(counter)
# What you expect: 200,000
# What you get: 147,382 (different every run)
Here's what trips people up: counter += 1 looks atomic. It's one line. But the CPU doesn't execute it in one step. It expands to three: read the current value from memory, add 1 in a register, then write the result back. The OS can pause your thread between any of those three steps.
When Thread B reads the value before Thread A has written its result back, Thread B is working from stale data. Both threads end up writing the same value. One increment vanishes into thin air.
Section 03 Let's slow it down: a frame-by-frame breakdown
Say counter = 5 at this moment. Here's one way two threads can interleave — and lose an increment:
| Thread A | Step | Thread B |
|---|---|---|
| Reads counter → gets 5 | 1 | |
| 2 | Reads counter → gets 5 | |
| Adds 1, holds 6 in register | 3 | |
| Writes 6 back to counter | 4 | |
| 5 | Adds 1 to its stale 5, holds 6 | |
| Expected: counter = 7 | 6 | Writes 6 → counter stays 6 ✗ |
Section 04 Where I've actually seen this hurt real systems
Textbook examples are fine, but let me give you the situations where this actually bites production systems.
A concert drops and 50,000 people hit the site at once. The code checks seats_available > 0, then books the seat. With 1 seat left and 200 concurrent requests, many of them pass the check before any of them decrements. You've now sold 40 tickets for 1 seat. The refund emails are not fun to write.
This isn't hypothetical. Major ticketing platforms have had this exact failure during high-demand releases. The fix isn't a better check — it's making the check-and-decrement atomic.
Two services read a shared config file, each modify a different key in memory, each write the full file back. Service B's write stomps on Service A's changes. Neither service errors. The config just quietly reverts. You spend an hour wondering why a feature you deployed last week keeps turning itself off.
You click a button 5 times fast. Each click triggers an async setState that reads this.state.count, adds 1, and writes back. By the time the second click fires, the first setState hasn't flushed. Both read the same stale value. You clicked 5 times, count went up by 2. This is why setState(prev => prev + 1) exists — it reads from the queued state, not a snapshot that might already be stale.
Section 05 The check-then-act trap
Most race conditions share a common shape called check-then-act. You check a condition, then act on it — but the condition can change between the check and the act. Here's what that looks like in a Node.js order handler:
async function purchaseItem(itemId, userId) {
// Step 1: check if stock exists
const item = await db.query(
'SELECT quantity FROM items WHERE id = ?', [itemId]
);
if (item.quantity > 0) {
// ⚠️ Right here. Another 400 requests just passed this check.
// They all read quantity = 1. They all think they're fine.
await db.query(
'UPDATE items SET quantity = quantity - 1 WHERE id = ?', [itemId]
);
await db.query(
'INSERT INTO orders (item_id, user_id) VALUES (?, ?)',
[itemId, userId]
);
}
}
The check and the decrement are two separate database round-trips. In the gap between them, every other request in your queue can also pass the check. The SELECT and the UPDATE are not atomic. That gap is the bug.
Section 06 How to actually fix this
The good news: once you understand race conditions, the fixes are conceptually simple. You're always trying to achieve one of three things — eliminate sharing, eliminate mutation, or make the critical section atomic.
Only one thread can hold the lock at a time. Everything else waits. Simple, reliable, but can become a bottleneck under high contention.
Hardware instructions that complete without interruption. Compare-and-swap, fetch-and-add. Zero overhead — the CPU itself guarantees atomicity.
ACID guarantees are exactly for this. Wrap the read and write in one transaction with the right isolation level and let the DB engine coordinate.
Read with a version stamp. Update only if the version matches — retry on conflict. Excellent for low-contention workloads that need high throughput.
Route all writes through a single-consumer queue. No two handlers can touch the same record simultaneously — the queue is your serializer.
Don't mutate shared objects — create new ones. If nothing can be changed in place, there's nothing to race over. Functional programming's cleanest win.
Here's the fixed version of the order handler — making the check and decrement a single atomic SQL operation:
async function purchaseItem(itemId, userId) {
// The WHERE clause and SET happen as one atomic operation inside the DB.
// 400 concurrent requests → exactly 1 succeeds if quantity is 1.
const result = await db.query(`
UPDATE items
SET quantity = quantity - 1
WHERE id = ? AND quantity > 0
`, [itemId]);
if (result.affectedRows === 0) {
throw new Error('Out of stock'); // Clean failure, no oversell
}
await db.query(
'INSERT INTO orders (item_id, user_id) VALUES (?, ?)',
[itemId, userId]
);
// ✅ No gap. No race. If affectedRows is 0, we stop.
}
WHERE quantity > 0 check and the SET quantity = quantity - 1 as a single indivisible unit. It doesn't matter how many threads fire this simultaneously — the database serialises the writes for you. Exactly one request gets the last item.
And the Python fix — wrapping the critical section in a Lock:
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
for _ in range(100_000):
with lock:
counter += 1 # Only one thread in here at a time.
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()
print(counter) # 200,000. Every single time. ✅
Section 07 Tools that find races you can't reproduce manually
The hardest thing about race conditions is that they don't show up in normal testing. They require specific timing — the exact wrong interleaving. These tools force that timing to happen:
| Tool | Language | How it works |
|---|---|---|
ThreadSanitizer (TSan) |
C, C++, Go, Rust | Instruments binary at compile time; reports concurrent access at runtime with near-zero false positives |
Helgrind / DRD |
C, C++ | Valgrind plugins that detect POSIX thread misuse and lock-ordering violations |
Java PathFinder |
Java | Model checker — systematically explores every possible thread interleaving to find races |
go run -race |
Go | Built into the standard toolchain. Just add -race and run your tests. No setup needed. |
stress + slow tests |
Python | Run concurrent tests with artificial scheduling delays to surface timing-sensitive bugs |
Final thoughts The real lesson from three days of chasing ghosts
Here's what those three days debugging negative inventory actually taught me: race conditions aren't a sign of careless code. They're a structural problem. The code I wrote was correct. It just wasn't concurrent-correct — and those are different things.
The mental model I use now: every time I write code that reads shared state and then acts on it, I ask "what happens if another thread runs this same code between my read and my act?" If the answer is "bad things," I need to make those two steps atomic.
Once you develop that instinct, race conditions stop being mysterious. They're just a class of bugs with a clear cause and a finite set of well-understood fixes. The checklist below is what I keep on my wall:
std::atomic, AtomicInteger, or sync/atomic before a full mutex.go run -race takes 2 seconds.If this saved you from a Tuesday afternoon debugging session, I'll consider it worth writing.
Next up: Deadlocks — when the fix for a race condition introduces a new, equally annoying problem.

