Respect SQLite Before Replacing It

Respect SQLite Before Replacing It

AuthOS looked like it had a SQLite problem. It really had blocking crypto, noisy writes, audit logs on the hot path, and no load-shedding discipline.

Mike Chumba Mike Chumba
7 min read
1341 words

The first load test made SQLite look guilty.

AuthOS, a Rust authentication platform with SQLite, PostgreSQL, and MySQL support, collapsed at roughly 130 requests per second with a 17.69 second P95 on the SQLite backend.

The lazy conclusion would have been obvious: SQLite cannot handle serious auth traffic.

The lazy conclusion was wrong.

The application was asking SQLite, Tokio, and Argon2 to do the wrong work on the wrong path. Blocking crypto sat on the async runtime. Audit logs wrote synchronously. “Last seen” updates spammed the database. Saturation turned into hoarding instead of rejection.

After moving CPU work off the runtime, buffering audit writes, debouncing noisy updates, and failing fast under pressure, the SQLite backend reached a shape that can plausibly serve 40 million daily requests on a single node.

The fix started with respecting SQLite’s actual shape.


The Starting Point Was a Death Spiral

The first version worked. That is the dangerous part.

The SSO API handled real flows: Platform Owners approving organizations, admins managing webhooks, and users logging in through OAuth. It was correct enough to demo and complete enough to trust.

Then load testing stripped away the comfort. The system collapsed at about 130 requests per second.

  • Throughput: ~130 req/s (Saturated)
  • Latency (P95): 17.69 seconds (Unusable)
  • Diagnosis: The system was suffering from Resource Starvation.
    1. CPU Blocking: The heavy lifting of cryptography (specifically the Argon2 password hashing used during user registration and login) was running on the async event loop, choking the runtime.
    2. Write Contention: Every single API request was fighting for a lock. Whether it was the auth_callback handler updating a “last seen” timestamp or the audit logger recording a user.login event, synchronous writes were locking the SQLite database, forcing every read operation to queue up indefinitely.

Abstract Architecture Comparison Fig 1: Evolution from a blocking, synchronous monolith to a non-blocking, event-driven architecture.


The Fix Was Discipline, Not a New Database

SQLite was not the enemy. Our request path was.

The fixes were not exotic. Move blocking CPU work away from the event loop. Stop making every request fight for a write lock. Stop writing useless activity updates. Admit saturation early.

1. Get Argon2 Off The Event Loop

  • The Problem: One login could freeze the server for everyone else. Argon2 verification halted the event loop for about 50ms.
  • The Solution: I moved the heavy cryptographic work to tokio::task::spawn_blocking and guarded it with a Semaphore limited to 2x my CPU cores.
  • The Result: Authentication became isolated and throttled. A login spike could no longer punish users already inside the system.
// BEFORE: Blocking the async runtime
async fn verify_password(password: &str, hash: &str) -> bool {
    // ✗ Runs on the main thread, freezing the server for 50ms
    argon2::verify(password, hash) 
}

// AFTER: Offloaded with concurrency limits
static CPU_LIMIT: Semaphore = Semaphore::new(8); // Max 8 concurrent hashes

async fn verify_password(password: &str, hash: &str) -> Result<bool> {
    // ✓ Acquire permit (wait if too busy)
    let _permit = CPU_LIMIT.acquire().await?; 
    
    // ✓ Run on a background thread, unblocking the API
    tokio::task::spawn_blocking(move || {
        argon2::verify(password, hash) 
    }).await?
}

2. Move Audit Writes Out Of The Request Path

  • The Problem: Strict audit trails were killing latency. Every request triggered a disk write. In SQLite, every write competes for exclusive access.
  • The Solution: I implemented an Asynchronous Buffered Actor.
  • The Trade-off: Audit logs can sit in RAM for about 1 second before being flushed.
  • The Result: User requests return immediately. The actor batch writes and retries in the background, absorbing SQLite BUSY errors without making the user wait.
// BEFORE: Synchronous Writer
async fn log_event(event: Event) {
    // ✗ API request hangs here until disk I/O completes
    db.insert(event).await; 
}

// AFTER: Asynchronous Actor
async fn log_event(event: Event) {
    // ✓ Instant return! Pushes to an in-memory channel
    audit_sender.send(event).await; 
}

// Background Worker (The Actor)
async fn run_audit_actor(mut receiver: Receiver<Event>) {
    let mut buffer = Vec::new();
    while let Some(event) = receiver.recv().await {
        buffer.push(event);
        if buffer.len() >= 100 || timeout_elapsed() {
            // ✓ Efficient batch write in background
            db.insert_batch(buffer).await; 
            buffer.clear();
        }
    }
}

Abstract Audit Actor Pattern Fig 2: The Buffered Actor pattern decouples high-volume writes from the critical latency path.

3. Stop Writing The Same Thing

  • The Problem: “Last Seen” updates were written on every request. That flooded the Write-Ahead Log (WAL) for no useful reason.
  • The Solution: I implemented Debouncing. The system only writes the timestamp if more than 5 minutes have passed.
  • The Result: Write IOPS dropped by ~90%.

Traffic Hygiene Diagram Fig 3: Smart debouncing prevents the database from drowning in identical updates.

// BEFORE: Write on every request
async fn track_activity(user_id: Uuid) {
    // ✗ Floods the WAL with identical updates
    db.update_last_seen(user_id, now()).await; 
}

// AFTER: Smart Debouncing
async fn track_activity(user_id: Uuid) {
    let last_seen = cache.get(user_id);
    
    // ✓ Only write if > 5 minutes have passed
    if last_seen < now() - Duration::from_mins(5) {
        db.update_last_seen(user_id, now()).await;
        cache.set(user_id, now());
    }
}

4. Fail Fast Instead Of Hoarding Work

  • The Problem: At 1000 concurrent users, the queue kept growing until requests timed out after 30 seconds.
  • The Solution: I implemented TryAcquire semantics on the Login Semaphore and set the SQLite Writer timeout to a strict 3 seconds.
  • The Result: The system now knows its limit. It sheds load with 429 Too Many Requests when saturated, so accepted requests are not buried behind work that should have been rejected.

Fail Fast / Load Shedding Diagram Fig 4: The semaphore rejects excess work before accepted requests drown behind it.

// BEFORE: Unbounded Queue
async fn login(creds: Credentials) -> Result<Token> {
    // ✗ Queues forever if system is overloaded (30s timeouts)
    let _permit = CPU_LIMIT.acquire().await?; 
    process_login(creds).await
}

// AFTER: Fail Fast Protection
async fn login(creds: Credentials) -> Result<Token> {
    // ✓ Fails instantly if queue is full
    let _permit = CPU_LIMIT.try_acquire().or_else(|_| {
        Err(Error::TooManyRequests("System saturated, try again later"))
    })?;
    
    process_login(creds).await
}

The Final Scoreboard

The result is a useful defense of a SQLite-first architecture. Not blind faith. Measured faith.

MetricInitial StateOptimized State (Moderate Load)Optimized State (Saturation)
Throughput~135 req/s (Unstable)210 req/s (Stable)467 req/s (Maxed)
P95 Latency17,690 ms4.88 ms2,720 ms
Error Rate0% (but unusable)0.00%0.01%
BottleneckCPU & DB LocksNone (Server Bored)DB Connection Pool Limit

Abstract Performance Metrics Fig 3: Latency dropped by 4 orders of magnitude while throughput tripled.


Key Engineering Takeaways

  1. SQLite can carry serious work: With WAL mode, a dedicated Writer Actor, and IMMEDIATE transactions, SQLite sustained 467 requests/second mixed with heavy writes. That is enough for serious software.
  2. Async $\neq$ Non-blocking: Marking a function async does not make CPU work disappear. Argon2 had to leave the runtime path.
  3. ACID is a Design Choice: Critical user data stays strictly consistent. Audit logs and telemetry can be eventually consistent when the product allows it.
  4. The 5ms Win Matters: A 4.88ms P95 latency on a full authentication flow, including database work, application logic, and crypto, is the kind of result people usually assume requires a more expensive architecture.

Complexity Has To Earn Its Place

This system can handle about 40 million requests per day on a single node using a local file database.

That does not mean every system should use SQLite forever. It means complexity should earn its place. Postgres, Redis, and microservices are not moral upgrades. They are tools for specific pressure.

Most teams do not need to escape SQLite. They need to stop abusing it.


Appendix: The 25M MAU Estimation

The 25 Million Monthly Active Users (MAUs) estimate comes from the measured moderate-load result:

The Math:

  1. Benchmark Throughput: The optimized system sustains 210 requests/second at moderate load.
  2. Daily Capacity: 210 RPS $\times$ 86,400 seconds = ~18.1 Million requests/day.
  3. Requests per User: I assume an average of 4 authentication requests per Daily Active User (DAU).
  4. DAU Capacity: 18.1M / 4 = 4.5 Million DAUs.
  5. DAU/MAU Ratio: Assuming a standard engagement ratio of 20%: $$ 4.5\text{M DAU} / 0.20 = \textbf{22.5 Million MAUs} $$

I rounded this to 25M because the saturation limit was higher at 467 RPS, which suggests more headroom.

References:

  • Estimating MAU from DAU: “The DAU/MAU ratio helps estimate the total active user base from daily traffic. For SaaS, a 13–20% ratio is a standard benchmark.” (Mixpanel: What is DAU/MAU? )