Building Real-Time Trading Systems with Kafka

I’ve spent the last three years building real-time trading systems for a major European energy company. The core challenge: process market data, calculate positions, and execute trades - all in real-time, with zero tolerance for data loss.

Kafka is at the center of it all. Here’s what I’ve learned about using it for trading systems.

Why Kafka for Trading?

Trading systems have specific requirements that make Kafka a natural fit:

Durability. Every trade, every price tick, every position change must be persisted. Kafka’s append-only log with configurable retention means you never lose data - and you can replay history if something goes wrong.

Ordering. Events for a single instrument must be processed in order. Kafka guarantees ordering within a partition. Partition by instrument ID, and you get exactly the semantics you need.

Throughput. Energy markets generate thousands of price updates per second. Kafka handles this easily. Our production cluster processes 50,000+ events per second with single-digit millisecond latency.

Decoupling. Trading systems have many components: market data ingestion, pricing engines, risk calculators, order management, reporting. Kafka lets each component evolve independently.

The Architecture

Here’s the high-level topology we use:

Market Data Feed
      |
      v
[Ingestion Service] --> Kafka: market-data
      |
      v
[Pricing Engine] --> Kafka: prices
      |
      v
[Position Calculator] --> Kafka: positions
      |
      +---> [Risk Service]
      |
      +---> [Trading UI]
      |
      +---> [Reporting Service]

Each arrow is a Kafka topic. Each box is an independent service that consumes from upstream topics and produces to downstream topics.

Key Patterns

1. Partition by Instrument

producer.send(new ProducerRecord<>(
    "market-data",
    instrument.getId(),  // Key determines partition
    priceUpdate
));

All events for a single instrument go to the same partition. This guarantees ordering and means your consumer can maintain local state per instrument without coordination.

2. Idempotent Processing

Network issues happen. Consumers restart. You will process some messages more than once.

Design your processors to be idempotent:

public void processPrice(PriceUpdate update) {
    String key = update.getInstrumentId() + "-" + update.getTimestamp();

    if (processedIds.contains(key)) {
        return; // Already processed
    }

    // Process the update
    updatePosition(update);

    processedIds.add(key);
}

We use a combination of Kafka’s exactly-once semantics and application-level deduplication. Belt and suspenders.

3. State in Kafka, Not Databases

For position calculations, the temptation is to read from Kafka, update a database, and move on. Don’t.

Instead, use Kafka as the source of truth. Store positions as a compacted topic:

// Positions topic with log compaction
// Key: instrumentId
// Value: current position

positions.put(instrumentId, newPosition);
positionProducer.send(new ProducerRecord<>(
    "positions",
    instrumentId,
    newPosition
));

With log compaction, Kafka keeps only the latest value per key. Your topic becomes a distributed key-value store that any service can consume.

4. Time Windows for Aggregation

Risk calculations often need time-windowed aggregations: “What’s my P&L for the last 5 minutes?”

Kafka Streams makes this simple:

KStream<String, Trade> trades = builder.stream("trades");

KTable<Windowed<String>, Double> pnlByWindow = trades
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> 0.0,
        (key, trade, agg) -> agg + trade.getPnl(),
        Materialized.as("pnl-store")
    );

The state store is backed by Kafka, so you get fault tolerance for free.

Pitfalls I’ve Hit

Consumer Lag During Spikes

Market opens are brutal. Volume spikes 10x for the first 30 minutes. If your consumers can’t keep up, lag builds, and you’re trading on stale data.

Solutions:

Over-provision consumers (we run 3x normal capacity during market hours)
Prioritize recent data (skip old updates if you’re too far behind)
Alert on lag thresholds (we alert if lag exceeds 100ms)

Rebalancing Storms

Adding or removing consumers triggers a rebalance. During rebalance, processing stops. If your rebalances take too long, you accumulate lag.

Solutions:

Use static group membership (Kafka 2.3+)
Increase session.timeout.ms to reduce false positives
Use incremental cooperative rebalancing

props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    CooperativeStickyAssignor.class.getName());

Schema Evolution

Your price update schema will change. New fields, removed fields, renamed fields. If you’re not careful, you’ll break downstream consumers.

We use Avro with a schema registry:

props.put("schema.registry.url", "http://schema-registry:8081");
props.put("value.serializer", KafkaAvroSerializer.class);

Schema registry enforces compatibility rules. Backward compatible changes are allowed; breaking changes are rejected at publish time.

Exactly-Once Complexity

Kafka’s exactly-once semantics are powerful but add complexity. You need:

Idempotent producers (enable.idempotence=true)
Transactional producers for atomic multi-topic writes
read_committed isolation on consumers

We use exactly-once for critical paths (position updates, trade execution) and accept at-least-once for less critical flows (logging, analytics).

Monitoring

You can’t operate what you can’t see. Key metrics we watch:

Metric	Alert Threshold
Consumer lag	> 100ms
Partition leader elections	> 0 per hour
Under-replicated partitions	> 0
Request latency p99	> 50ms
Producer error rate	> 0.1%

We use Prometheus + Grafana for dashboards and PagerDuty for alerts. Every on-call rotation starts with checking Kafka health.

What I’d Do Differently

Start with fewer topics. We over-partitioned early. More topics means more operational overhead. Start with a single topic per domain and split later when you have data on actual throughput.

Invest in testing infrastructure. Embedded Kafka for unit tests, Testcontainers for integration tests. We spent too long debugging issues that better tests would have caught.

Document your schemas. Not just the Avro files - the business meaning of each field. Six months from now, you won’t remember why adjustedPrice differs from rawPrice.

The Stack

For reference, here’s what we run:

Kafka: Confluent Platform 7.x (managed)
Consumers: Java 17, Spring Boot, Kafka Streams
Schema Registry: Confluent Schema Registry
Monitoring: Prometheus, Grafana, Kafka Exporter
Serialization: Avro

The managed platform is worth it. Operating Kafka yourself is a full-time job.

Conclusion

Kafka is excellent for trading systems. The combination of durability, ordering guarantees, and high throughput matches the requirements almost perfectly.

The hard parts aren’t Kafka itself - they’re the operational concerns: monitoring, schema evolution, handling spikes, and testing. Invest early in infrastructure and observability. Your future self will thank you.

If you’re building trading systems and want to chat, reach out: [email protected]