The shiny new thing

Back in 2018, we started a company with friends. Coming from a long stint in tech consulting, I was especially looking forward to having complete freedom in choosing whatever technology we saw fit.

Our product was a social media platform for professionals, and we stored posts written by users in Postgres. The pipeline for publishing a post was simple at first but soon required long-lasting processing, such as push notification targeting and image recognition for automatic moderation.

At first, these steps were a part of our createPost API request, but this made the API slow to respond. Also, if the process crashed in the middle of a request, the system could have been left in an inconsistent state. At this point, we decided to move the work to a background process.

We discussed implementing a queue using Postgres, knowing this would be a tried and true option. However, we envisioned a glorious and high-traffic future ahead of us and went with Kafka instead. While Kafka gave us many guarantees (reliable, at-least-once processing) and a blueprint for scaling up (partitioning), it also came with implications we didn’t fully understand.

Queue sera, sera

Operationally, Kafka was expensive. We used a hosted solution instead of building our own cluster, but it was still by far the priciest component in our stack both in terms of the monthly hosting bill, and the workload placed on our team. The multi-tenant Kafka cluster was flaky with disappearing brokers and stuck consumers causing processing delays.

Architecturally, we first needed to ensure our background processing was idempotent. This was quite straightforward. For example, we could assign unique deduplication ids to each push notification to ensure we didn’t send out duplicates. The bigger problem was that our createPost operation was no longer atomic: we were writing to Postgres first and then to Kafka, and the Kafka writes would fail occasionally, leaving the system in an inconsistent state.

To fix this, we discussed a couple of options:

Write to Postgres and then to Kafka within a single Postgres transaction. This isn’t bulletproof: for example, the Kafka write could take a while and then succeed, causing the transaction to time out and roll back, causing another inconsistency.
Write to Kafka only, and treat the Kafka event as the single source of truth. Then, insert the post to Postgres in a background Kafka consumer. This could have worked, but our UI and tests relied on being able to read their own writes, instead of having to wait for the post to appear in the database.
Use change data capture (CDC), where we could only write to Postgres and have the changes automatically replicated to Kafka. This wasn’t possible with our hosted Postgres solution at the time.
Implement a transactional outbox for the outgoing Kafka messages: in this option, we write both the post and Kafka event to Postgres in a single transaction, and a background worker then writes the outgoing messages to Kafka.

We chose the last option, so now we had a queue in Postgres to feed Kafka, which begs the question if we could have simply used Postgres all along? I think so. And we wouldn’t have been the only ones to do so.

The balancing act

The excellent essay Choose Boring Technology introduces the concept of innovation tokens: each company is given a few tokens, and choosing non-boring technology (in your context) will deduct from your balance.

For our needs, and given our team’s expertise, we spent a token on Kafka. In another context, Kafka might have been the best choice. We also made numerous boring choices along the way, though (Heroku 😴, Postgres 💤).

We must innovate, but we must also balance our technology choices with creating business value.