Apache Kafka Fundamentals
Kafka is a distributed, partitioned, append-only commit log. It is not a traditional message broker; it does not delete messages upon consumption and does not push messages to consumers. Instead, consumers pull data and track their own state via offsets.
Core Primitive: The Partitioned Log
A **Topic** is a logical stream. Physically, it is divided into **Partitions**.
- **Ordering**: Strict ordering is guaranteed ONLY within a single partition.
- **Parallelism**: The number of partitions is the upper bound on consumer parallelism within a consumer group.
- **Immutability**: Once a record is written to a partition at an **Offset**, it cannot be modified.
Concrete Example: Partition Key Selection
Choosing a key like `user_id` ensures all events for a specific user land in the same partition and are processed in order.
```java
// Producer Record with Key
ProducerRecord<String, String> record = new ProducerRecord<>("orders", "user_123", "{\"order_id\": \"987\"}");
// All user_123 records land in the same partition
```
Durability and Replication
Kafka achieves fault tolerance via the **ISR (In-Sync Replicas)** model.
- **Replication Factor (RF)**: Usually 3. One leader, two followers.
- **min.insync.replicas**: Minimum number of ISRs that must acknowledge a write for it to be considered successful. Recommended: 2 for RF=3.
- **acks=all**: The producer waits for the full ISR set to acknowledge.
Durable Producer Configuration
```properties
bootstrap.servers=kafka-1:9092,kafka-2:9092
acks=all
retries=2147483647
max.in.flight.requests.per.connection=5
enable.idempotence=true
```
Consumer Group Semantics
A **Consumer Group** allows multiple instances to divide the partitions of a topic.
- Each partition is assigned to exactly one consumer in the group.
- If a consumer fails, the **Group Coordinator** triggers a **Rebalance**.
- **Cooperative Sticky Assignor**: Modern (Kafka 2.4+) strategy that minimizes "stop-the-world" time during rebalances by only moving the necessary partitions.
Production Failure Modes
1. **Consumer Lag Spirals**: Processing time exceeds ingestion rate. Result: Disk pressure on brokers or missed SLAs. Fix: Scale partitions and consumers.
2. **Unbalanced Partitions**: High-cardinality keys with skewed distribution (e.g., a "system" user with 1000x activity). Fix: Salt the key or use a different sharding strategy.
3. **Zombies**: A consumer hangs, triggers a rebalance, then returns and tries to commit. Fix: Use `transactional.id` and Kafka transactions for "exactly-once" semantics.
Performance Tuning
- **Linger and Batching**: Set `linger.ms=5` and `batch.size=32768` to increase throughput at the cost of slight latency.
- **Compression**: Use `compression.type=lz4` or `zstd`. This reduces network I/O and storage costs significantly with minimal CPU overhead.
- **Page Cache**: Kafka relies on the OS page cache. Do not allocate massive heaps (keep JVM heap < 8GB); let the kernel use the remaining RAM for caching log segments.
Summary of Technical implementation added
- Added concrete Producer configuration properties for durability.
- Added Java snippet for keyed partitioning.
- Detailed the ISR and `min.insync.replicas` interaction.
- Included specific performance tuning parameters (`linger.ms`, `batch.size`).
- Explicitly defined the "Zombies" failure mode.