Tell me about a time you had to make an important technical decision without complete information.

Question

Accepted Answer

Last year we were building a clickstream ingestion pipeline — 50k events/sec at launch, projected 200k within a year, with a hard launch deadline tied to a partner integration three weeks out.

The decision: queueing layer between edge collectors and warehouse loaders. Two viable options — Kafka, which the rest of the org runs, or a managed service that would save us 2-3 weeks of ops work but priced on throughput. The unknown was peak behavior: we had no time to load-test either at 200k/sec, and neither vendor had a public benchmark at our exact write pattern.

Rather than guess, I scoped the unknown into three things I could actually answer in a week: (1) does Kafka rebalance cleanly at our partition count when we hit 100k/sec — testable at half-scale in our staging cluster; (2) does the managed service handle our specific write pattern — I emailed the vendor's SE and got a customer reference at 80k/sec with a similar pattern; (3) what's the total cost-of-ownership over 18 months at 200k/sec — back-of-envelope math, but enough to bracket it.

What I learned from those three: Kafka rebalanced fine at half-scale, the managed service's reference customer was very happy but acknowledged occasional throttling at peak, and the 18-month TCO came out roughly equal — managed service slightly cheaper at low scale, Kafka slightly cheaper at high scale.

I made the call to go with Kafka, for reasons that weren't the ones I would've named at the start. The cost wasn't the deciding factor — it was the throttling note from the reference customer combined with our partner SLA. If our pipeline throttled during a launch event, we'd be paying for an incident we couldn't easily debug at the vendor layer. With Kafka, debugging would be painful but possible.

We launched on time and held 50k/sec without issues. At month four we hit 150k and started seeing the rebalancing edge cases we'd predicted at half-scale, which we'd already drafted a runbook for. The big thing I learned: I'd been about to pick Kafka for the wrong reason ('we already run it'), and forcing myself to decompose the unknowns made me realize the actual decision driver was debuggability during incidents. I've used that decomposition pattern on every major call since.

Answer

Yeah, this happens a lot actually. One time we had to pick between two databases for a new service.

We didn't have time to do a full benchmark, so I had to make a call based on what I knew.

I went with Postgres because I'd used it before and it seemed safer. The other option was a newer NoSQL thing that the team was excited about but we didn't really know how it would scale.

It turned out fine. Postgres handled our load without issues.

Answer

Last year we were building a new ingestion pipeline for clickstream data — projected at around 50k events per second at launch with growth to maybe 200k within a year. We had about three weeks before a hard launch deadline tied to a partner integration.

The main decision was the queueing layer between our edge collectors and the warehouse loaders. We had two viable options: Kafka, which the rest of the org already ran, or a managed service we'd been evaluating that would shave 2-3 weeks of ops work but charged on throughput.

We didn't have time to load-test either option at the projected 200k/sec. So I had to decide without knowing how either would behave at peak.

I went with Kafka. The reasoning was that the ops cost of debugging an unknown managed service at peak would be higher than the savings during the build phase, and the rest of the org's experience meant we'd have help if something went wrong.

We hit launch and the pipeline handled the load fine. About four months in we crossed 150k/sec and started seeing some rebalancing issues, which we'd anticipated.

Tell me about a time you had to make an important technical decision without complete information.

How to think about it

Weak · sample answer

Why this scores weak

Key takeaways

Average · sample answer

Why this scores average

Key takeaways

Strong · sample answer

Why this scores strong

Key takeaways