Configure Kafka for High Availability

What does it mean to optimize Kafka for High Availability?

This means that the Kafka applications should recover as quickly as possible from failures to ensure reads and writes are always available from and to topic partitions by the respective Kafka clients.

Minimum In-sync Replicas

When Producer configuration acks is set to all, then min.insync.replicas specifies the minimum number of ISRs that must acknowledge a write before a produce request (which is a write request) can be considered successful. If this set minimum is not met, then the produce request fails and the producer will throw an exception. Thus, this decreases the availability of a Kafka partition for writes. So setting a lower value of this configuration makes the cluster capable of tolerating more replica failures. Kindly check the following section to understand the trade-offs that need to be taken care of while settings its value.

Trade-offs

Write Availability vs Durability

https://krishnakrmahto.com/configure-kafka-for-high-durability?t=1674281143851#heading-durability-vs-write-availability

Read Availability vs Durability

No trade-off. More details - https://krishnakrmahto.com/configure-kafka-for-high-durability#heading-durability-vs-read-availability

Consumer Failures

Kakfa can detect consumer failures and initiate rebalance so that the partitions that the failed consumers were consuming from are distributed amongst the remaining consumers belonging to the same consumer group.

Consumer failures can be seen as:

Hard failures (for e.g. - SIGKILL) which imply a certain failure of the consumer process.
Soft failures (for e.g. - consumer session time out) which imply the consumer process is deemed to have failed by Kafka, however, the consumer application might still be processing the messages it had polled last.

Both of the above failures are detected by Kafka cluster after which a rebalance is triggered.

The soft failures can be mitigated when the reasons are related to network, and/or message processing time, or any configuration in general by tuning the following configurations.

heartbeat.interval.ms, session.timeout.ms
- heartbeat.interval.ms is the interval between two successive heartbeats that are sent to the consumer group coordinator (which is one of the Kafka brokers) to let it know that the consumer process is alive. These heartbeats are sent by a background thread.
- heartbeat.interval.ms must be set lower than session.timeout.ms. It is recommended to set heartbeat.interval.ms <= (1/3) * session.timeout.ms [1].
- If the consumer group coordinator does not receive a heartbeat within session.timeout.ms from a consumer, then the consumer will be removed from the group and a rebalance will be initiated by the consumer group coordinator.
- Setting lower values for session.timeout.ms and heartbeat.interval.ms will help detect failed consumers and recovery faster, but this could also cause undesirable rebalances - for e.g. if the consumer usually takes a only few more seconds than the session.timeout.ms to process the messages polled.
- Setting higher values for these configurations would reduce the chances of unwanted rebalances, but it also entails a longer time to detect real failures.
- Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms, max.poll.records
- max.poll.interval.ms sets the maximum interval between two successive calls to poll(). When this timeout occurs, then the consumer also stops sending heartbeats and the background thread sends an explicit LeaveGroup request eventually resulting in a rebalance. If the consumer is supposed to take longer to process a batch of messages, then we should increase the value of this configuration.
- The max poll interval configuration is for the main thread that processes the messages in contrast to session.timeout.ms which is for the background heartbeat thread. It is possible, for e.g., that the main thread is deadlocked while the background thread is still sending heartbeats - max.poll.interval.ms ensures that such cases are addressed and acted upon.
- We can also reduce max.poll.records which should decrease the total time the consumer takes to process the batch.
- Even if we try configuring max.poll.records, it is still difficult to predict the exact interval between two poll() requests. Ideally, max.poll.interval.ms value should be set to a value that is high enough so that it can be reached only rarely by a healthy consumer application, but also it has to be low enough so that we do not end up with a hanging consumer main thread for too long.
- The fact that session.timeout.ms is configured to be less than or equal to max.poll.interval.ms ensures that an unhealthy consumer is detected and recovered from much earlier.

References

https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#consumerconfigs_heartbeat.interval.ms
Kafka - The Definitive Guide
KIP-62: Allow consumer to send heartbeats from a background thread - Apache Kafka - Apache Software Foundation
If references to declarative claims made in this doc are not found in the references above, then they may have been taken directly from the Kafka code base - https://github.com/apache/kafka.