Kafka Troubleshooting Checklist

Troubleshooting Complex Apache Kafka Production Issues: A Consultant's Diagnostic Checklist
Apache Kafka, while robust and scalable, is a complex distributed system. When production issues arise—be it plummeting throughput, skyrocketing latency, or mysterious data loss—diagnosing the root cause can feel like searching for a needle in a haystack. Standard troubleshooting steps might not suffice for intricate problems that span across brokers, clients, network, and underlying infrastructure. This is where a systematic, expert-driven diagnostic approach becomes invaluable.
At ActiveWizards, our consultants frequently parachute into complex Kafka environments to resolve critical production fires. This article distills our collective experience into a 15-point diagnostic checklist. It's designed to guide you methodically through troubleshooting complex Kafka issues, helping you ask the right questions, inspect the right metrics, and ultimately, restore your cluster to health. Think of this as your field guide for when Kafka gets complicated.
The Foundational Mindset: Systematic Investigation
Before diving into the checklist, adopt this mindset:
- Define the Problem Clearly: What are the exact symptoms? When did they start? Is it impacting all topics/clients or a subset?
- Gather Evidence: Don't guess. Collect logs, metrics, and configurations.
- Isolate the Scope: Is it a producer, consumer, or broker issue? Is it network, disk, or CPU?
- Correlate Events: Did the issue coincide with a deployment, configuration change, or infrastructure event?
- Reproduce (if possible): Can you reproduce the issue in a non-production environment?
Diagram 1: General Troubleshooting Flow for Kafka Issues.
A Consultant's 15-Point Diagnostic Checklist
This checklist is categorized by common problem areas. Start with the most relevant category based on initial symptoms.
Broker & Cluster Health Diagnostics
- Check for Under-Replicated / Offline Partitions:
- Metric: `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`, `kafka.controller:type=KafkaController,name=OfflinePartitionsCount`.
- Why: Value > 0 indicates reduced fault tolerance, potential data loss risk, or unavailable partitions.
- Action: Examine broker logs on affected leaders/replicas for errors (disk full, network issues, crashes). Check `IsrShrinksPerSec` for ISR instability. Ensure brokers are alive and reachable.
- Verify Active Controller:
- Metric: `kafka.controller:type=KafkaController,name=ActiveControllerCount`.
- Why: Must be 1 across the cluster. If 0 or >1, the cluster is in a critical state (split-brain or no leader election).
- Action: Check ZooKeeper (if used) or KRaft quorum health. Examine logs on all brokers for controller election messages or errors.
- Assess Broker Resource Saturation (CPU, Memory, Disk, Network):
- Metrics: OS-level CPU utilization (esp. iowait), free memory (page cache!), disk queue depth/await times, network throughput/errors. Kafka JMX: `NetworkProcessorAvgIdlePercent`, `RequestHandlerAvgIdlePercent`.
- Why: Overloaded brokers cause high latency and instability.
- Action: Identify the bottleneck. If CPU, check compression, SSL, thread pool sizes. If disk, check I/O patterns, disk health. If memory, check JVM heap vs. page cache allocation. If network, check bandwidth limits, MTU.
- Inspect Broker Logs for Recurring Errors or Warnings:
- Location: `logs/server.log`, `logs/controller.log`, `logs/state-change.log`.
- Why: Logs often contain explicit error messages, stack traces, or warnings about misconfigurations, hardware issues, or connectivity problems.
- Action: Look for keywords like ERROR, WARN, FATAL, Exception, timeout, "failed to". Correlate timestamps with problem occurrence.
- Review ZooKeeper/KRaft Quorum Health:
- Metrics (ZK): `zk_avg_latency`, `zk_outstanding_requests`, `zk_followers`, `zk_synced_followers`. Use `mntr` or `srvr` 4-letter words.
- Metrics (KRaft): `kafka.raft:type=RaftManager,name=current-leader-id`, `...current-epoch`, `...high-watermark`. Examine controller logs.
- Why: Kafka relies on its consensus layer. Issues here (e.g., quorum loss, high latency) cripple the cluster.
- Action: Ensure quorum members are healthy, network connectivity is good between them, and they are not resource-starved.
Producer-Side Diagnostics
- Analyze Producer Error Rates & Latency:
- Metrics: `record-error-rate`, `request-latency-avg/max/p99`, `errors-total`.
- Why: High error rates or latencies indicate problems sending data.
- Action: Check producer logs for specific exceptions (`TimeoutException`, `RecordTooLargeException`, `NotLeaderForPartitionException`). Verify broker connectivity, `acks` settings, `retries`, `max.request.size`, and idempotence configuration.
- Check Producer Buffer Fullness & Request Pipelining:
- Metrics: `buffer-available-bytes`, `max.in.flight.requests.per.connection`.
- Why: Full buffers (low `buffer-available-bytes`) or misconfigured pipelining can stall producers.
- Action: Increase `buffer.memory` if chronically full and brokers can handle load. Ensure `max.in.flight.requests.per.connection` is appropriate (1 for strict ordering without idempotence, up to 5 with idempotence).
Consumer-Side Diagnostics
- Investigate High or Growing Consumer Lag:
- Metric: `records-lag-max` per consumer group/topic/partition.
- Why: The most common symptom of consumers not keeping up with producers.
- Action:
- Is processing logic slow? Optimize consumer code. Consider async processing.
- Insufficient consumer instances for partitions? Scale out consumers.
- Is `max.poll.records` too high for `max.poll.interval.ms`? Adjust these.
- Downstream system bottlenecked?
- Check consumer logs for errors or frequent rebalances.
- Look for Frequent Consumer Rebalances:
- Metrics: `join-rate`, `sync-rate`, `last-rebalance-seconds-ago`. Consumer logs showing "Revoking partitions" / "Assigning partitions".
- Why: Rebalances stop consumption. Frequent rebalances indicate instability (consumers crashing, `session.timeout.ms` too low, `max.poll.interval.ms` violations, "fencing" with static membership if instance IDs conflict).
- Action: Stabilize consumers. Increase timeouts if appropriate. Ensure unique `group.instance.id` for static members. Fix underlying consumer crashes.
- Examine Consumer Fetch Behavior & Errors:
- Metrics: `fetch-latency-avg/max/p99`, `fetch-throttle-time-avg/max`, `records-consumed-rate`.
- Why: High fetch latency points to broker load or network issues. Throttling indicates broker quotas are being hit.
- Action: Check broker health and network. If throttled, review quotas or consumer fetch rates.
Network & Configuration Diagnostics
- Verify Network Connectivity & Performance (Client-Broker, Broker-Broker):
- Tools: `ping`, `traceroute`, `iperf`, `netstat`, `ss`. OS network error counters.
- Why: Packet loss, high latency, or bandwidth saturation severely impact Kafka.
- Action: Test connectivity between all relevant nodes. Check for firewall issues, DNS resolution problems, MTU mismatches, switch/router issues.
- Review Critical Configurations for Mismatches or Suboptimal Values:
- Areas: `message.max.bytes` (broker/topic) vs. client `max.request.size`/`max.partition.fetch.bytes`. Replication factors. Security settings (SSL, SASL). `acks`. `linger.ms`, `batch.size`.
- Why: Incorrect or inconsistent configurations are a common source of problems.
- Action: Systematically review configurations on brokers, topics, producers, and consumers. Ensure consistency where needed (e.g., security protocols).
- Check for Silent Data Loss (If Suspected):
- Symptoms: Gaps in data, applications behaving unexpectedly.
- Why: Could be `acks=0` on producer, unhandled producer errors, topic misconfiguration (e.g., `min.insync.replicas` too low with `acks=all`), or bugs in client logic.
- Action: This is complex. Audit producer `acks` and error handling. Verify `min.insync.replicas`. Use tools to compare record counts/checksums if possible. This often requires deep application-level tracing.
Environment & External Factor Diagnostics
- Look for "Noisy Neighbor" Issues or Shared Resource Contention:
- Context: Virtualized environments, Kubernetes, shared storage/network.
- Why: Other applications or VMs consuming excessive resources can impact Kafka performance unpredictably.
- Action: Monitor resource usage at the host/hypervisor/node level. Consider resource quotas, dedicated nodes, or QoS if possible.
- Correlate with External Events & Changes:
- Examples: Recent deployments (Kafka, clients, infrastructure), OS patching, hardware changes, network maintenance, dependency service outages.
- Why: Kafka issues are often triggered by external changes.
- Action: Maintain a change log. Correlate issue start times with known events. Review deployment histories.
Isolated metrics rarely tell the whole story. The key to diagnosing complex issues is correlating metrics across different layers. For example, high producer latency + low broker CPU + high broker network queue time might point to a saturated broker network card or an upstream network bottleneck, not a CPU issue on the broker itself. Use dashboards that allow overlaying multiple metrics.
When to Call in the Experts
While this checklist provides a strong diagnostic framework, some Kafka production issues are deeply intricate, requiring specialized knowledge of Kafka internals, distributed systems patterns, and advanced performance analysis techniques. If you've run through this checklist and are still struggling, or if the business impact is severe, engaging expert Kafka consultants like ActiveWizards can provide a rapid path to resolution and help implement long-term stability.
Conclusion: From Chaos to Clarity
Troubleshooting Apache Kafka in production can be challenging, but a systematic diagnostic approach transforms it from a guessing game into a methodical investigation. This 15-point checklist, born from real-world consulting engagements, provides a structured path to identify root causes and restore your Kafka cluster's health. Remember to combine this checklist with robust monitoring, clear problem definition, and iterative testing.
Facing Stubborn Kafka Production Issues? ActiveWizards Can Help!
Our expert Kafka consultants specialize in diagnosing and resolving complex production problems, optimizing performance, and ensuring the stability of your critical data pipelines. Don't let Kafka issues derail your business.
Comments (0)
Add a new comment: