The cheapest architecture is the one you understand well enough to push.
That is the bet behind this experiment. A Rust web service. SQLite in WAL mode. One small server. No database cluster. No queue farm. No managed auth bill that grows faster than the product.
The question was blunt: how far can a well-built monolith go before the physics win?
To find out, I took a correct but naive Single Sign-On service built with Rust , Axum , and SQLite in Write-Ahead Log (WAL) mode , then hit it with 10,000 virtual users.
It failed. Then it taught.
The useful part was the chain of bottlenecks: direct writes, runtime starvation, queue relocation, missing batching, CPU work in the writer, WAL checkpoint stalls, and finally the uncomfortable discovery that a smaller server could be more stable than a larger one.
The Monolith Was Correct, Then Load Found the Lie
Every tuning story starts with a system that works until it is asked to work hard.
This one is a multi-tenant Single Sign-On service for a suite of products.
The core architecture was deliberately small:
- Language and framework: Rust with Axum on the Tokio async runtime.
- Database: One SQLite database file in WAL mode.
- Feature under test: SSO plus the OAuth 2.0 Device Authorization Grant for CLI and desktop applications.
The hot path is the endpoint that starts the Device Flow. A client asks for a code. The server stores generated codes. Simple database write. Obvious implementation.
Here is the initial handler and database interaction:
// in handlers/auth.rs
pub async fn device_code(
State(state): State<AppState>,
Json(req): Json<DeviceCodeRequest>,
) -> Result<Json<DeviceCodeResponse>> {
// 1. Validate the incoming request against the database (a read).
// ... validation logic ...
// 2. Call a service to create the codes and write them to the database.
let device_code = DeviceFlowService::create_device_code(
&state.pool,
&req.client_id,
&req.org,
&req.service,
).await?;
// 3. Return the generated codes to the client.
Ok(Json(DeviceCodeResponse {
device_code: device_code.device_code,
user_code: device_code.user_code,
// ... other fields
}))
}
// in auth/device_flow.rs
pub async fn create_device_code(
pool: &SqlitePool,
client_id: &str,
org_slug: &str,
service_slug: &str,
) -> Result<DeviceCode> {
// ... code generation logic ...
// A single, straightforward database INSERT statement.
sqlx::query(
r#"
INSERT INTO device_codes (id, device_code, user_code, ...)
VALUES (?, ?, ?, ...)
"#,
)
.bind(...)
.execute(pool)
.await?;
// ... return the created object
}
This code is clean. The load test still broke it.
Load Testing With k6
The load test uses k6 .
Load distribution:
- 70% subscription checks: GET
/api/subscription - 20% device flow writes: POST
/auth/device/codeand POST/auth/token - 10% batch user reads: GET
/api/user
We define success by a few key metrics:
- Throughput: The total number of successful requests per second.
- Median Latency (p50): The response time that 50% of users experience.
- Tail Latency (p99): The response time for the 99th percentile. This is where overloaded systems confess. The target is
p(99)under 30 seconds. - Failure Rate: The percentage of requests that result in an error.
First Result: Collapse
After ten minutes, the numbers were bad enough to be useful.
| Metric | Value |
|---|---|
| http_req_duration (p99) | 54,360 ms |
| http_req_duration (med) | 162 ms |
| http_reqs | 963,611 |
| http_req_failed | 14.2% |
A p99 latency of nearly a minute is unacceptable. A 14% failure rate means the service is fundamentally broken under load. The logs from the server tell the story.
The server logs are flooded with a single, ominous warning at around 200 concurrent users:
sso-server | WARN sqlx::query: slow statement: execution time exceeded alert threshold ... summary="INSERT INTO device_codes..." ... elapsed=2.24s
Diagnosis
The first issue is Write Contention. WAL mode lets many readers proceed while a writer is active, but SQLite still serializes writes. Only one write happens at a time.
At 200 concurrent users, requests to INSERT into device_codes arrived faster than SQLite could process them. The queue became the system.
A secondary effect is Async Runtime Thread Starvation. Axum runs on Tokio. Tokio uses a small fixed number of OS threads, 4 in this test. When handlers pile up around slow database work, the runtime spends its energy managing blocked I/O instead of polling new network work.
One contended write turned into a service-wide failure.
In an async system, a single slow, blocking operation on a hot path doesn’t just slow down requests, it can consume the entire runtime and trigger a total service outage.
Move the Queue Out of the Web Handler
The handler should handle HTTP. The writer should write.
The first version made every web handler wait on the database writer directly. That ties network responsiveness to the slowest contended write path.
The threads handling network I/O must stay free to accept new work. Contended writes belong somewhere else.
The fix is a small version of the Actor Model : a dedicated background task whose only job is writing to the database. Web handlers send write commands over an async channel.
This has two immediate benefits:
- Sending a message on a channel is an extremely fast, non-blocking operation. The web handler is freed up to handle the next incoming request.
- The writes are naturally serialized by the single receiving task, which perfectly matches SQLite’s single-writer limitation.
We use a Tokio MPSC (Multi-Producer, Single-Consumer) channel .
First, define the message and spawn the writer in main.rs:
// A message type to send to our writer task.
// It includes a `oneshot` channel for the writer to send the result back.
pub enum DbRequest {
CreateDeviceCode {
// ... fields
responder: oneshot::Sender<Result<DeviceCode>>,
},
}
// The writer task itself.
async fn db_writer_task(pool: SqlitePool, mut rx: mpsc::Receiver<DbRequest>) {
while let Some(req) = rx.recv().await {
match req {
DbRequest::CreateDeviceCode { responder, ... } => {
let result = DeviceFlowService::create_device_code(&pool, ...).await;
// Send the result back to the waiting handler.
let _ = responder.send(result);
}
}
}
}
// In `main()`
let (tx, rx) = mpsc::channel::<DbRequest>(1024);
tokio::spawn(db_writer_task(pool.clone(), rx));
// The channel sender `tx` is added to the shared application state.
let app_state = AppState { db_tx: tx, ... };
Then send write requests from handlers/auth.rs:
pub async fn device_code(
State(state): State<AppState>,
Json(req): Json<DeviceCodeRequest>,
) -> Result<Json<DeviceCodeResponse>> {
// ... validation logic ...
// Create a `oneshot` channel to receive the response.
let (tx, rx) = oneshot::channel();
let db_request = DbRequest::CreateDeviceCode { responder: tx, ... };
// Send the message. This is non-blocking and returns instantly.
state.db_tx.send(db_request).await?;
// Asynchronously wait for the response from the writer task.
// While waiting, this task yields and the worker thread is free for other work.
let device_code = rx.await??;
Ok(Json(DeviceCodeResponse { ... }))
}
The Second Result
The next test proves the decoupling worked and also proves decoupling is not enough.
| Metric | Value |
|---|---|
| http_req_duration (p99) | 56,622 ms |
| http_req_duration (med) | 0.701 ms |
| http_reqs | 900,911 |
| http_req_failed | 16.0% |
Median latency dropped from 162ms to 0.7ms. The web handlers were no longer blocked. Good.
The p99 was still terrible. The failure rate was still high. Also good, because now the next bottleneck was visible.
Diagnosis
We solved thread starvation. We did not solve slow writes. We moved the queue from the web server into the MPSC channel.
The writer task was still doing thousands of individual transactions. The handlers became fast enough to overwhelm it more efficiently.
Decoupling a bottleneck doesn’t eliminate it; it moves it. True optimization requires addressing the root cause of the slowness, not just shuffling the queue.
Batching Made the Writer Honest
The writer task was chatty. Every request opened a transaction, inserted one row, and committed.
That is expensive at scale.
One transaction inserting 200 rows is far cheaper than 200 transactions inserting one row each.
So the writer becomes a batch processor. It waits for a message, greedily drains more messages up to a limit, and writes them inside one transaction.
The writer task becomes a batch processor:
const BATCH_SIZE: usize = 256;
const BATCH_TIMEOUT: std::time::Duration = std::time::Duration::from_millis(5);
async fn db_writer_task(pool: SqlitePool, mut rx: mpsc::Receiver<DbRequest>) {
let mut batch = Vec::with_capacity(BATCH_SIZE);
loop {
// Wait for a message, but with a timeout.
let msg = tokio::time::timeout(BATCH_TIMEOUT, rx.recv()).await;
match msg {
Ok(Some(req)) => {
batch.push(req);
// If the batch is full, process it.
if batch.len() >= BATCH_SIZE {
process_batch(&pool, std::mem::take(&mut batch)).await;
}
}
Err(_) => { // Timeout elapsed
// If there's anything in the batch, process it.
if !batch.is_empty() {
process_batch(&pool, std::mem::take(&mut batch)).await;
}
}
Ok(None) => break, // Channel closed
}
}
// ... process any remaining items
}
async fn process_batch(pool: &SqlitePool, batch: Vec<DbRequest>) {
// Start a single transaction.
let mut transaction = pool.begin().await.unwrap();
// Dynamically build a single multi-row INSERT statement.
// e.g., INSERT INTO ... VALUES (...), (...), (...), ...
// ... logic to build the query and bind all parameters ...
match large_insert_query.execute(&mut transaction).await {
Ok(_) => {
// Send success back to all waiting handlers.
}
Err(e) => {
// Send the error back to all waiting handlers.
}
}
// Commit the single transaction.
transaction.commit().await.unwrap();
}
This processes a batch either when it is full or when a short timeout (5ms) expires, preserving latency under lighter loads.
The Third Result
The next run finally crosses the target.
| Metric | Value |
|---|---|
| http_req_duration (p99) | 29,597 ms |
| http_req_duration (med) | 1.5 ms |
| http_reqs | 1,094,883 |
| http_req_failed | 14.1% |
P99 latency is now under 30 seconds. Throughput increased by about 20% to more than 1 million requests. Median latency stayed low.
Batching was right. The next bottleneck was waiting.
Diagnosis
The writer was still doing work it should not do. It received a batch of requests, then generated IDs and codes inside the single-threaded writer task.
// Inside the single-threaded writer task...
for req in &batch {
// This work is happening sequentially on the writer's thread!
let id = uuid::Uuid::new_v4().to_string();
let device_code = DeviceFlowService::generate_device_code();
let user_code = DeviceFlowService::generate_user_code();
// ...
}
While the writer generated 256 UUIDs and user codes, writes waited. The MPSC channel filled again.
We had accidentally serialized CPU-bound generation onto the single I/O worker.
A dedicated I/O task must be ruthlessly focused on I/O. Offloading CPU-bound work to an I/O task can inadvertently re-serialize your application logic and become a new bottleneck.
Keep CPU Work Away From the I/O Worker
The writer should write. Nothing else.
Code generation is CPU-bound and parallelizable. The web handlers are the right place for it. The writer task is I/O-bound and must stay focused.
The fix is to send fully formed data over the channel. The writer binds values and commits.
The handler now does the generation work:
pub async fn device_code(
State(state): State<AppState>,
Json(req): Json<DeviceCodeRequest>,
) -> Result<Json<DeviceCodeResponse>> {
// ... validation ...
// Perform CPU-bound work here, in the parallel web handler.
let id = Uuid::new_v4().to_string();
let device_code = DeviceFlowService::generate_device_code();
let user_code = DeviceFlowService::generate_user_code();
// The message now contains the data, not the request to create it.
let db_request = DbRequest::CreateDeviceCode {
id,
device_code: device_code.clone(),
user_code: user_code.clone(),
// ...
};
// Send and wait for confirmation.
state.db_tx.send(db_request).await?;
let _ = rx.await??;
Ok(Json(DeviceCodeResponse { device_code, user_code, ... }))
}
The writer’s process_batch function no longer needs to generate anything. It simply receives a batch of messages and binds the data they contain. This makes the writer itself much faster and more focused.
The Fourth Result
The next result was the most useful disappointment.
| Metric | Value |
|---|---|
| p99 Latency | 25,778 ms |
| Median Latency | 1.8 ms |
| Total Requests | 1,075,562 |
| Failure Rate | 12.4% |
P99 improved a little. Failure rate improved a little. No breakthrough.
The design was better, but the bottleneck had moved below the application.
Confronting the Physical Limits
While running the fourth load test, this was logged at around 500 concurrent users:
sso-server | WARN sqlx::query: slow statement: execution time exceeded alert threshold summary="PRAGMA wal_checkpoint(TRUNCATE);" ... elapsed=5.89s
This log is now the only “slow query” warning.
The Final Bottleneck
The Write-Ahead Log is fast because inserts append to the -wal file. But that data eventually has to move into the main .db file. That process is a checkpoint.
Under this load, the -wal file grew fast. The background task running PRAGMA wal_checkpoint(TRUNCATE) every 10 seconds had to do heavy I/O. To do it safely, it needed an exclusive write lock.
For the 5-6 seconds spent on that I/O, the database was locked for writes.
This is the source of our tail latency.
- Our
db_writer_taskruns, processing batches and writing to the WAL file at lightning speed. - Every 10 seconds, the checkpoint task kicks in and locks the database for 6 seconds.
- During this 6-second “stall,” our
db_writer_taskis completely blocked. When it tries to begin a transaction, it simply waits. - Meanwhile, the MPSC channel, which has a buffer of 16,384 messages, continues to fill with tens of thousands of requests from the hyper-efficient Axum handlers.
- When the checkpoint finishes and releases the lock, the writer task is faced with a colossal backlog. The requests that arrived during the stall are the ones that experience the 25-second latency.
It could easily be mitigated by setting the
wal_checkpoint(PASSIVE)but that might end up in a runaway-walfile size.
The Rust code was no longer the limit. Disk I/O and SQLite’s checkpoint mechanics were.
The goal of performance tuning is to eliminate application-level bottlenecks until your performance is dictated by the known, physical limits of your hardware and platform. Reaching this wall is a form of success.
Finding Balance and the True Limit
The 10,000 VU test found a wall. But the test machine itself was also under strain.
That raised a better question: was 10,000 VUs measuring the server, or was it measuring the whole test environment falling apart?
Two final tests clarified the limit.
Finding the Sustainable Limit at 6,000 Users
First, cap the load at 6,000 VUs.
Test Results at 6,000 Virtual Users (4-vCPU):
| Metric | Value |
|---|---|
| p99 Latency | 11,995 ms (12 seconds) |
| Throughput | 2,304 req/s |
| Failure Rate | 0.02% (Virtually Zero) |
The failure rate vanished. P99 was cut to 12 seconds. Throughput increased by more than 20%.
That is overloaded-system behavior. Reduce pressure to a sustainable level and the system becomes more efficient. The cliff was real.
The Paradox
The 4-vCPU server had another problem: it was unbalanced.
Four producer threads could flood one disk-limited writer. So the next test used the most constrained environment: a tiny 1-vCPU server with only 1GB of RAM.
| Metric | Value |
|---|---|
| p99 Latency | 16,086 ms (16 seconds) |
| Throughput | 1,766 req/s |
| Failure Rate | 0.0001% (Effectively Zero) |
On a server with a quarter of the CPU and RAM, the service became perfectly stable. Throughput dropped by about 25%, but the failure rate disappeared.
The reason is system balance. On the 1-vCPU machine, CPU became the governor. Web handlers competed for CPU time with the writer task. That implicitly rate-limited producers to a pace the I/O subsystem could handle.
The huge channel backlog did not form. The WAL grew less aggressively. Checkpoint stalls mattered less.
This is Theory of Constraints in practice. Over-provision CPU relative to disk and you can create instability. A smaller, balanced server can be slower and more reliable.
The Economic Point
The experiment started with a simple question: how far can a monolith go?
The answer is farther than the default industry sales pitch suggests.
A clean Rust and SQLite service failed under pressure. Decoupling, batching, and separation of CPU-bound and I/O-bound work moved the bottleneck out of the application and down into disk I/O.
The final stable results, even on a modest 1-vCPU server, changed the business argument:
- Peak Throughput: A sustained ~1,700 requests per second.
- Median Latency (p50): A near-instant 2.3 milliseconds.
- Tail Latency (p99): A bounded 16 seconds under peak stress.
- Reliability: A virtually perfect 99.999% success rate under a peak load of 6,000 virtual users.
Model typical user behavior as one request every 10 seconds during peak hours, and this becomes a practical capacity claim:
This monolith can comfortably serve between 500,000 and 2,000,000 Monthly Active Users.
The technical result is interesting. The financial implication is the real punch.
A Practical Cost Analysis
Hosted platforms like Auth0 , Clerk , and auth components from providers like Supabase sell convenience. That used to be an easier trade.
Today, AI-assisted development changes the calculation. You do not need to dig through 10 search results and an old forum thread to make progress. You can get useful debugging help in seconds.
The managed service pitch still has value, especially for teams buying speed and risk reduction. But the cost at scale deserves scrutiny.
For 500,000 Monthly Active Users (MAU):
Cost of Our Monolith: The 1-vCPU, 1GB RAM server that delivered our stable results can be provisioned from a cloud provider for max $5 per month**. Including backups and data transfer, a generous, all-in operational cost would be **~$10 per month.
Cost of Hosted Alternatives:
- Supabase/Firebase Auth: These services offer generous free tiers (typically 50,000 MAU). Beyond that, they charge per user. At 500,000 MAU, the cost would be approximately $1,350 per month.
- Clerk.dev: Their “Pro” plan, aimed at scaling applications, is priced per MAU. At 500,000 MAU, the cost would be $10,000 per month.
- Auth0/Okta: These enterprise-grade platforms offer more complex features, and their pricing reflects that. A plan supporting our feature set (multi-tenancy, custom domains, device flow) for 500,000 MAU would almost certainly be well over $15,000 per month.
The financial reality is stark.
| Service | Estimated Monthly Cost at 500,000 MAU |
|---|---|
| Our Rust + SQLite Monolith | ~$10 |
| Supabase / Firebase Auth | ~$1,350 |
| Clerk.dev (Pro Plan) | ~$10,000 |
| Auth0 / Okta (Professional / Custom) | ~$15,000+ |
This is a strategic difference.
At scale, the efficient monolith offers a 100x to 1000x cost advantage over managed counterparts. That can mean more than $100,000 per year staying available for product, marketing, hiring, or runway.
Managed services are a trade. Familiar logos do not make the trade automatically wise.
This experiment proves another path exists. Rust, SQLite, and disciplined engineering can build systems that are performant and economically serious.
True scalability starts with understanding the limits of the simple system first.
The decision to invest in engineering craftsmanship is not just a technical choice; it’s one of the most significant financial decisions a business can make.




