Kafka Basics and Core Concepts with Examples

In this article, I will discuss the Kafka Basics and Core Concepts with Examples. We will also discuss the following Kafka Basics and Core Concepts in detail.

What is a Messaging System?

A messaging system is responsible for transferring data from one application to another so the applications can focus on data without worrying about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging systems. Two types of messaging patterns are available − one is the point-to-point, and the other is a publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub.

Point-to-Point Messaging System

In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue. A typical example of this system is an Order Processing System, where one Order Processor will process each order, but Multiple Order Processors can also work at the same time. The following diagram depicts the structure.

Publish-Subscribe Messaging System

In the publish-subscribe system, messages persist on a topic. Unlike the point-to-point system, consumers can subscribe to one or more topics and consume all the messages on that topic. In the Publish-Subscribe system, message producers are called publishers, and message consumers are called subscribers.

What is Kafka?

Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data. It enables you to pass messages from one endpoint to another. Kafka is suitable for both offline and online message consumption. Kafka messages persist on the disk and are replicated within the cluster to prevent data loss.

Is Kafka a message queue?

It certainly can act as a message queue, but it’s not limited to that. It can also act as a FIFO queue, a Pub/ Sub messaging system, and a real-time streaming platform. Because of Kafka’s durable storage capability, it can even be used as a Database.

Kafka is commonly used for real-time streaming data pipelines, i.e., to transfer data between systems, build systems that transform continuously flowing data, and build event-driven systems.

Kafka Core Concepts

Message

A message is the atomic unit of data for Kafka. Let’s say that we are building a log monitoring system and pushing each log record into Kafka. Our log message is a JSON with the following structure.

{
“level” : “ERROR”,
“message” : “NullPointerException”
}

When we push the above JSON into Kafka, we are actually pushing 1 message. Kafka saves this JSON as a byte array, which is a message for Kafka. This is that atomic unit, a JSON having two keys: “level” and “message”. But it does not mean you can’t push anything else into Kafka. You can push String, Integer, a JSON of different schema, and everything else, but we generally push different types of messages into different topics (we will get to know what a topic is soon).

Messages might have an associated “Key,” which is nothing but some metadata. This key is used to determine the destination partition (we will know soon as well) for a message.

Topic

Topics, as the name suggests, are the logical categories of messages in Kafka, a stream of the same type of data. Going back to our previous example of the logging system, let’s say our system generates application logs, windows logs, and database logs and pushes them to Kafka for other services to consume. Now, these three types of logs can logically be divided into three topics: appLogs, windowsLogs, and dbLogs. We can create these three topics in Kafka: whenever there’s an app log message, we push it to the appLogs topic, and for database logs, we push it to the dbLogs topic. This way, we have logical segregation between messages, sort of like having different tables for holding different types of data.

Partitions

Partition is the core concept behind Kafka’s scaling capabilities. Let’s say that our system becomes really popular, and hence, there are millions of log messages per second. So now the node on which the appLogs topic is present is unable to hold all the data that is coming in. We initially solve this by adding more storage to our node, i.e., vertical scaling. But as we all know vertical scaling has its limit, once that threshold is reached, we need to horizontally scale, which means we need to add more nodes and split the data between the nodes. When we split the data of a topic into multiple streams, we call all those smaller streams the “Partition” of that topic.

This idea of partitions is where a single topic has 4 partitions, and all of them hold a different set of data. The blocks you see here are the different messages in that partition. Let’s imagine the topic to be an array. Now, due to memory constraints, we have split the single array into 4 different smaller arrays. When we write a new message to a topic, the relevant partition is selected, and that message is added at the end of the array.

An offset for a message is the index of the array for that message. The numbers on the blocks in this picture denote the Offset. The first block is at the 0th offset, and the last block would be at the (n-1)th offset. The system’s performance also depends on how we set up partitions. (Please note that on Kafka, it is not going to be an actual array but a symbolic one.)

Offsets

The offset is simple. It is a sequence number of a message in a partition. This number is assigned as the messages arrive in a partition. And these numbers, once assigned, never change. They are immutable. This sequencing means that Kafka stores messages in the order of arrival within a partition. The first message gets an offset zero. The next message receives an offset one, and so on. But remember that there is no global offset across partitions. Offsets are local to the partition. So, if you want to locate a message, you should know three things. Topic name, Partition number, and offset number. If you have these three things, you can directly locate a message.

Producer/Publishers in Kafka

A producer is the Kafka client that publishes messages on a Kafka topic. One of the producer’s core responsibilities is to decide which partition to send the messages to. Depending on various configurations and parameters, the producer decides the destination partition. Let’s look a bit more into this.

No Key specified => When no key is specified in the message, the producer will randomly decide on a partition and try to balance the total number of messages on all partitions.
Key Specified => When a key is specified with the message, the producer uses Consistent Hashing to map the key to a partition. Don’t worry if you don’t know what consistent hashing is, in short, it’s a hashing mechanism where for the same key same hash is generated always, and it minimizes the redistribution of keys on a re-hashing scenario like a node adds or a node removal to the cluster. So, let’s say that in our logging system, we use the source node ID as the key, and then the logs for the same node will always go to the same partition. This is very relevant for the order guarantees of messages in Kafka. We will shortly see how.
Partition Specified => You can hardcode the destination partition as well.
Custom Partitioning logic => We can write some rules depending on which partition can be decided.

Consumer/Subscriber in Kafka

So far, we have produced messages, and to read those messages, we use Kafka Consumer. A consumer reads messages from partitions in an ordered fashion. So, if 1, 2, 3, and 4 were inserted into a topic, the consumer would read it in the same order. Since every message has an offset, every time a consumer reads a message it stores the offset value onto Kafka or Zookeeper, denoting that it is the last message the consumer read. So, in case a consumer node goes down, it can come back and resume from the last read position. Also, if at any point in time, a consumer needs to go back in time and read older messages, it can do so by just resetting the offset position.

Consumer Group

A consumer group is a collection of consumers who work together to read messages on a topic. There are some very interesting concepts here. Let’s go through them.

1. Fan Out Exchange => Multiple consumer groups can subscribe to a single topic. For example, let’s say you are building an OTP service.

Now, you need to send both text and email OTP. Your OTP service can put the OTP in Kafka, and then the SMS Service consumer group and Email Service consumer group can both receive the message and send the SMS and email out.

2. Order Guarantee => We have already seen that a topic can be partitioned, and multiple consumers can consume from the same topic. So, how do you maintain the order of messages on the consumer end? Good question. One partition cannot be read by multiple consumers in the same consumer group. This is enabled by the consumer group only; only one consumer in the group gets to read from a single partition. Let me explain.

So, the producer produces 6 messages. Each message is a key-value pair, for key “A” the value is “1”, for “C” the value is “1”, for “B” the value is “1”, for “C” the value is “2” ….. The “B” value is “2”. Our topic has 3 partitions, and due to a consistent hashing mechanism, messages with the same key always go to the same partition, so all the messages with “A” as the key will get grouped, and the same for B and C. Now, as each partition has only one consumer, they get messages in order only. So, the consumer will receive A1 before A2 and B1 before B2, and thus the order is maintained.

Going back to our logging system example, the keys are the source node ID. All the logs for node1 will always go to the same partition. Since the messages always go to the same partition, we will maintain the order of the messages.

This will not be possible if the same partition has multiple consumers in the same group. If you read the same partition in the different consumers who are in different groups, then the messages will also end up ordering for each consumer group.

So, for 3 partitions, you can have a max of 3 consumers, if you had 4 consumers, one consumer will be sitting idle. But for 3 partitions, you can have 2 consumers, then one consumer will read from one partition, and one consumer will read from two partitions. If one consumer goes down in this case, the last surviving consumer will end up reading from all three partitions, and when new consumers are added back, again partition would be split between consumers, this is called re-balancing.

Note: A single consumer can read the message from one or more partitions, but more than one consumer cannot consume a single partition in the same consumer group.

Broker in Kafka

A broker is a single Kafka server. Brokers receive messages from producers, assign them offsets, and then commit them to the partition log, which is basically writing data to disk.

Cluster in Kafka

A Kafka cluster is a group of broker (more than one Kafka server) nodes working together to provide scalability, availability, and fault tolerance. One of the brokers in a cluster works as the Controller, which basically assigns partitions to brokers and monitors for broker failure to do certain administrative stuff. Running a single Kafka broker is possible, but it doesn’t give all the benefits that Kafka in a cluster can, such as data replication.

Zookeepers manage the brokers in the cluster. There may be multiple Zookeepers in a cluster; in fact, the recommendation is three to five, keeping an odd number so that there is always a majority and keeping the number as low as possible to conserve overhead resources.

In a cluster, partitions are replicated on multiple brokers depending on the replication factor of the topic to have failover capability. What it means is, for a topic of replication factor 3, each partition of that topic will live onto 3 different brokers. When a partition is replicated among 3 brokers, one of the brokers will act as the leader for that partition, and the other two will be followers. Data is always written on the leader broker and replicated to the followers. This way, we do not lose data or availability of the cluster, and if the leader goes down, another leader is elected.

Let’s look at a practical example. I am running a 5-node Kafka cluster locally, and I run this command.

kafka-topics — create — zookeeper zookeeper:2181 — topic applog — partitions 5 — replication-factor 3

The above command will create the following:

Create a topic
Create 5 partitions of that topic
And replicate data of all 5 partitions into a total of 3 nodes (3 brokers)

The below screenshot describes the topic we just created.

Let’s take Partition 0. The leader node for this partition is Node 2. The data for this partition is replicated on nodes (brokers) 2,5 and 1. So, one partition is replicated on 3 nodes (brokers), and this behavior is repeated for all 5 partitions. Also, you can see that all the leader nodes for each partition are different. So, to utilize the nodes properly, the Kafka controller broker distributed the partitions evenly across all nodes. You can also observe the replications are evenly distributed, and no node is overloaded. The controller Broker does all of these with the help of a Zookeeper.

Zookeeper

Kafka does not function without zookeepers (at least for now; they plan to deprecate them soon). Zookeeper works as the central configuration and consensus management system for Kafka. It tracks the brokers, topics, partition assignment, leader election, and basically all the metadata about the cluster.

Zookeeper is top-level software developed by Apache that acts as a centralized service. It is used to maintain naming and configuration data and provide flexible and robust synchronization within distributed systems. Zookeeper keeps track of the status of the Kafka cluster nodes and also keeps track of Kafka topics, partitions, etc.

Features

So, to start with, we can say that Kafka is a very fast, highly reliable, extremely fast, distributed pub-sub messaging system. Today, Kafka is being used by many big tech giants like Netflix, Twitter, LinkedIn, Uber, etc., all across the globe. It is so popular because of its exceptional speed and performance. It is really easy to set up and use.

Kafka has very low latency and can handle several millions of messages per second, which makes it easy to use for real-time data processing.

It is written in Scala and Java and has a very high throughput.

Kafka Vs Traditional Messaging (RabbitMQ)

Now, if we compare Kafka versus Traditional messaging systems, we can see the following differences,

Kafka has a record retention feature, which means that it retains the record even after it is consumed for 7 days, which is the default.
For example, RabbitMQ pushes messages to its consumers, while Kafka supports pulling messages from its consumers.
Kafka supports multiple subscribers, which means the same message can be consumed by multiple consumers if they do not belong to the same consumer group. In contrast, in RabbitMQ, a message can only be consumed once, and once it is consumed, it will be deleted.
Kafka can process several millions of messages per second, and that too with very few nodes, whereas RabbitMQ can also process millions of messages per second, but it requires more than 30 nodes.
Kafka provides the feature of message ordering, and messages are sent to topics with the Key, whereas this feature is not supported in RabbitMQ.

I hope you enjoy this article on Kafka Basics and Core Concepts with Examples and understand the Kafka Basics and Core Concepts with Examples.