Kafka Monitoring with Prometheus, Grafana, and Telegraf

Kafka monitoring with Prometheus and Grafana is most useful when it goes beyond generic host dashboards and focuses on the Kafka metrics that actually drive incidents: replication state, request latency, consumer lag, JVM behavior, and the infrastructure surrounding the brokers.

Prometheus, Grafana, and Telegraf remain a practical stack for Kafka observability because each tool solves a clear part of the problem:

Kafka and its clients expose JVM and application metrics
Prometheus scrapes and stores time-series data
Telegraf collects host and system metrics that Kafka itself does not expose
Grafana turns those metrics into dashboards, alerts, and operational context

This article updates the older setup-oriented guide into a production-oriented monitoring pattern you can still use in 2026.

The Monitoring Architecture

The cleanest setup is usually:

expose Kafka broker and client metrics through JMX
convert those metrics into a Prometheus-friendly endpoint with a JMX exporter
collect host-level metrics with Telegraf
scrape both sources from Prometheus
visualize and alert from Grafana

That gives you two different layers of signal:

Kafka layer: broker requests, replication state, fetch behavior, producer and consumer health
Infrastructure layer: CPU saturation, disk latency, memory pressure, filesystem usage, network throughput

If you skip the infrastructure layer, you can miss the real cause of Kafka incidents. A broker may look unhealthy because of I/O wait, noisy neighbors, or network pressure rather than a Kafka configuration bug.

What to Monitor on Kafka

The highest-value Kafka metrics usually fall into four groups.

1. Broker Request Path

These metrics tell you whether the broker is keeping up with client traffic:

request rates by API
request queue time
request handler idle time
network processor idle time
produce and fetch latency
bytes in and bytes out

If these deteriorate, producers and consumers feel it quickly.

2. Partition and Replication Health

These metrics tell you whether the cluster is safe, not just fast:

under-replicated partitions
offline partitions
in-sync replica counts
leader elections
replica fetch lag

Fast throughput means little if replicas are falling behind or partitions are going offline during failures.

3. Consumer Health

Many Kafka incidents are actually consumer incidents. Monitor:

consumer lag
rebalance frequency
poll and fetch behavior
records consumed rate
commit latency and failures

Consumer lag should always be interpreted alongside throughput and partition skew. A single lag number without that context is often misleading. For a metric-focused companion article, see kafka-monitoring-key-metrics-guide.

4. JVM and Host Signals

Kafka is still a JVM application running on real machines. Monitor:

heap usage and GC pauses
open file descriptors
disk throughput and disk latency
page cache pressure
network throughput and packet errors
CPU usage and I/O wait

These infrastructure signals are where Telegraf earns its keep.

Why Prometheus, Telegraf, and Grafana Still Work Well Together

This stack remains relevant because it is simple and composable.

Prometheus

Prometheus gives you pull-based collection, service discovery, alert rules, and a query model that works well for infrastructure and Kafka metrics. It is especially good at tracking rates, error trends, and lag over time rather than only showing raw point-in-time numbers.

Telegraf

Telegraf fills the host-metrics gap. It is useful for collecting:

CPU, memory, disk, filesystem, and network metrics
container or VM metrics
supporting signals from OS and adjacent services

Telegraf can expose those metrics on a Prometheus-compatible endpoint, which keeps your scraping model consistent.

Grafana

Grafana is where operators correlate the signals:

consumer lag rising
broker request time increasing
disk I/O wait spiking
replication health degrading

That correlation is the difference between a dashboard and an observability workflow.

A Modern Instrumentation Pattern

The older version of this article used hard-coded package versions and ZooKeeper-era examples. That ages badly. A better approach is to keep the setup generic.

Kafka JMX Export

Kafka metrics are commonly exposed through JMX and then translated for Prometheus scraping by a Java agent.

KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/etc/jmx/kafka.yml"

Use the same pattern for JVM-based producers, consumers, Connect workers, or stream-processing applications when they need dedicated metrics endpoints.

If you are running modern Kafka in KRaft mode, that does not change the basic observability approach. You still monitor broker request behavior, controller health, replication state, and client-side metrics. The storage and control plane changed; the need for strong telemetry did not.

Prometheus Scrape Jobs

Your Prometheus config should treat Kafka and host metrics as separate targets.

scrape_configs:
  - job_name: kafka-brokers
    static_configs:
      - targets:
          - broker-1.internal:7071
          - broker-2.internal:7071
          - broker-3.internal:7071

  - job_name: telegraf-hosts
    static_configs:
      - targets:
          - broker-1.internal:9273
          - broker-2.internal:9273
          - broker-3.internal:9273

Telegraf Prometheus Client Output

Expose Telegraf metrics on a local HTTP endpoint so Prometheus can scrape them.

[[outputs.prometheus_client]]
  listen = ":9273"

[[inputs.cpu]]
  percpu = true
  totalcpu = true

[[inputs.mem]]

[[inputs.disk]]

[[inputs.diskio]]

[[inputs.net]]

That gives you the minimum infrastructure layer most Kafka teams need.

Dashboards That Actually Help During Incidents

The most useful Grafana dashboards are not the prettiest ones. They are the ones that reduce diagnosis time.

A good operational dashboard usually includes:

cluster throughput and request latency
under-replicated and offline partition counts
broker CPU, disk, and network by node
consumer lag by group and topic
JVM heap and GC panels
alerts or annotations for deploys, rebalances, and broker restarts

Avoid building one giant dashboard with every available metric. Split them by use:

broker health
consumer health
capacity and infrastructure
incident drill-down

Common Monitoring Mistakes

Several monitoring mistakes show up repeatedly in Kafka estates:

treating consumer lag as the only health metric
ignoring disk latency while focusing on broker CPU
not watching replication health during traffic spikes
scraping only brokers and not producers, consumers, or Connect
building dashboards without alert thresholds or runbook context

Another common mistake is benchmarking and monitoring separately. In practice, they should reinforce each other. If you are tuning Kafka performance, also see kafka-benchmarking-methodologies-and-tools-for-performance.

What “Good” Looks Like

You do not need a perfect observability platform on day one. You need a stack that answers these questions quickly:

Are brokers healthy?
Is the cluster safe?
Are consumers keeping up?
Is the bottleneck inside Kafka or outside it?
Which host, topic, partition, or client is responsible?

Prometheus, Telegraf, and Grafana still form a solid answer when they are used with that operational goal in mind.

Conclusion

Kafka monitoring works best when you combine Kafka-native metrics with infrastructure telemetry and then visualize them in a way that supports incident response, capacity planning, and tuning. Prometheus stores the time series, Telegraf fills the host-level gap, and Grafana gives operators the correlation layer.

The stack is not new, but it is still effective. What changed is the standard for using it: modern Kafka teams need production-grade dashboards, alerting, and diagnosis flows rather than a handful of setup commands.

Need Help Building a Production Monitoring Stack for Kafka?

ActiveWizards helps teams design Kafka observability, alerting, and dashboarding systems for reliable production operations.

Talk to Our Data Engineering Team

Kafka Monitoring with Prometheus, Grafana, and Telegraf

The Monitoring Architecture

What to Monitor on Kafka

1. Broker Request Path

2. Partition and Replication Health

3. Consumer Health

4. JVM and Host Signals

Why Prometheus, Telegraf, and Grafana Still Work Well Together

Prometheus

Telegraf

Grafana

A Modern Instrumentation Pattern

Kafka JMX Export

Prometheus Scrape Jobs

Telegraf Prometheus Client Output

Dashboards That Actually Help During Incidents

Common Monitoring Mistakes

What “Good” Looks Like

Conclusion

Need Help Building a Production Monitoring Stack for Kafka?

Bring the system under review

Igor Bobriakov

Data Engineering

Real-Time IoT Analytics Platform for Smart Agriculture

Real-time anomaly detection processing 2.4M events/day with 70% fewer false positives

High-Throughput Real-Time Facial Recognition Platform

Related Articles

Advanced Kafka Performance Tuning: Producer, Broker, and Consumer

Streaming RAG: Real-Time Retrieval for Agents That Can't Wait

Pinecone Performance Tuning for RAG: Latency, Throughput, and Read Nodes