Kafka – Components and Description
Apache Kafka is a distributed event-streaming platform designed to handle high-throughput and fault-tolerant data processing. Below are its key components and their descriptions:
1. Kafka Components
1.1. Topics
- Definition: A topic is a category or feed name to which records are published.
- Function: Producers write data to topics, and consumers read data from them.
- Features:
- Topics are partitioned for scalability.
- Messages within a topic are ordered in each partition.
- Use Case: For example, a topic named “OrderEvents” can hold all events related to orders.
1.2. Partitions
- Definition: Each topic is split into partitions for parallelism and scalability.
- Function:
- Each partition is an ordered sequence of records.
- Records in a partition are identified by an offset (a unique sequential ID).
- Features:
- Partitions allow Kafka to scale horizontally by distributing data across multiple brokers.
- Producers can assign data to specific partitions using keys.
- Use Case: Different partitions can be processed by multiple consumers simultaneously for load balancing.
1.3. Brokers
- Definition: A broker is a Kafka server that stores data and serves client requests.
- Function:
- A Kafka cluster comprises multiple brokers.
- Each broker handles partitions assigned to it.
- Features:
- Brokers replicate data for fault tolerance.
- The cluster is managed by one or more Kafka Controllers.
- Use Case: Broker failures are managed by leader election for partitions.
1.4. Producers
- Definition: Producers are clients that publish messages to Kafka topics.
- Function:
- Send messages to specific topics.
- Determine the partition for messages (based on keys or round-robin).
- Features:
- Support asynchronous writes for high throughput.
- Acknowledge mechanisms to ensure message delivery.
- Use Case: A payment service sending transaction events to a topic.
1.5. Consumers
- Definition: Consumers are clients that read messages from Kafka topics.
- Function:
- Subscribe to topics and process messages.
- Keep track of offsets to ensure no duplicate processing.
- Features:
- Support for consumer groups to enable load balancing.
- Each message is processed by one consumer in a group.
- Use Case: An analytics service processing real-time order data.
1.6. Consumer Groups
- Definition: A set of consumers working together to process messages from a topic.
- Function:
- Each partition in a topic is consumed by only one consumer in the group.
- Ensures high availability and scalability in message consumption.
- Use Case: Multiple services consuming the same topic but dividing the workload.
1.7. ZooKeeper (Legacy Component)
(Note: Being replaced by Kafka Raft Metadata in newer versions)
- Definition: Manages and coordinates the Kafka cluster.
- Function:
- Maintains metadata about the brokers and topics.
- Handles leader election for partitions.
- Use Case: Ensures proper partition-to-broker mapping.
1.8. Kafka Connect
- Definition: A tool for integrating Kafka with external systems (e.g., databases, file systems).
- Function:
- Provides ready-to-use connectors for importing/exporting data.
- Use Case: Syncing a database with a Kafka topic.
1.9. Kafka Streams
- Definition: A Java library for building stream-processing applications.
- Function:
- Processes data in real-time directly from Kafka topics.
- Supports operations like filtering, aggregations, joins, etc.
- Use Case: Aggregating website clicks for analytics.
1.10. Schema Registry
(Optional, part of Confluent Kafka ecosystem)
- Definition: Manages schemas for the data being streamed.
- Function:
- Ensures compatibility between producers and consumers.
- Avoids schema-related errors.
- Use Case: Enforcing Avro/JSON schema validation for event data.
2. Kafka Architecture Workflow
- Producers send messages to a topic.
- Kafka distributes these messages across partitions.
- Brokers store and manage these partitions.
- Consumers read messages from topics (using consumer groups for load balancing).
- Kafka ensures fault tolerance through replication and leader election.
Kafka’s design makes it suitable for high-throughput, fault-tolerant, and real-time data processing use cases like log aggregation, event sourcing, and stream processing.
In Short,
Topics
A topic is a category or stream of messages, where data is stored. Topics are split into partitions. Each topic has at least one partition, and each partition contains messages in an immutable, ordered sequence. A partition is represented as a collection of segment files of equal sizes.
Partition
Topics can have multiple partitions, allowing Kafka to handle large volumes of data efficiently. Each partition allows parallel processing and scalability.
Partition Offset
Every message in a partition has a unique identifier known as the offset. The offset is a sequential ID for each message, ensuring the order of messages within the partition.
Replicas of Partition
Replicas are backup copies of a partition. These replicas don’t handle read or write operations but exist to ensure data availability and prevent loss in case of partition failure.
Brokers
Brokers are simple systems responsible for storing and managing the published data. Each broker may host zero or more partitions per topic. If there are N partitions for a topic and N brokers, each broker will handle one partition. If there are more brokers than partitions (N + M), the extra brokers won’t handle any partitions for that topic. Conversely, if there are fewer brokers than partitions (N – M), brokers will share multiple partitions, which could lead to uneven load distribution. This scenario is not recommended.
Kafka Cluster
A Kafka cluster consists of multiple brokers. Clusters can be expanded without causing downtime. The Kafka cluster is responsible for managing the persistence and replication of messages, ensuring data is safely stored across the system.
Producers
Producers are responsible for publishing messages to one or more Kafka topics. They send data to Kafka brokers, which append the messages to the corresponding partition’s segment file. Producers can target specific partitions for their messages or allow Kafka to determine the partition.
Consumers
Consumers subscribe to one or more topics and retrieve data from Kafka brokers. They pull messages from brokers, consuming the messages that have been published.
Leader
The leader is the broker responsible for handling all read and write operations for a particular partition. Each partition has one leader broker at any given time.
Follower
Followers are brokers that replicate the data from the leader. They do not handle read or write operations but remain synchronized with the leader’s data. If the leader fails, one of the followers automatically takes over as the new leader. Followers behave like normal consumers, pulling messages and maintaining their local data stores.
Recent Comments