Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. (Wikipedia).

Brokers: is a Kafka Server. A Kafka cluster is composed of multiple brokers.

  • Each broker is identified with its integer ID.
  • Connected to any broker, you will be connected to the entire cluster.
  • Each broker has some kind of data, not all data.
  • Each broker contains certain topic partitions.
  • A good number of the broker is 3 for a cluster.

Kafka Broker Discovery:

  • Every Kafka broker is also called a “bootstrap server”. It means you only need to connect to one broker and you will be connected to the entire cluster.
  • Each broker knows about all the information about the other.

Topic: is a particular stream of data.

  • It is similar to the table in Database like MySQL.
  • A topic is identified by its name

Partitions: The topics are split into partitions. Partitions have numbers and it starts from 0 and increases to 1,2,3,…

  • Each partition is ordered.
  • Each partition not going to have the same number of messages. It’s independence.
  • Can change the number of partitions in a topic when topic created.
  • Each message within a partition gets an incremental id, called offset.

Offsets: When you push a message to the topic. The first message to partition 0 is going to have the offset 0. The next is going to 1,2,3,… The same with partition 1,2,…

  • The data is kept only for a limited time.
  • Offset only have a meaning for a specific partition.
  • Order is guaranteed only within a partition.
  • The data is written to a partition can’t be changed.
Topic, Partition, Offset (Source: Apache Kafka)

Topic Replication Factor: Topic Replication Factor is the reason makes Kafka distributed. If a machine goes down, Kafka Cluster still receives the message and send messages to the consumer normally. That is the Topic Replication Factor. The replication basically allowed us to ensure that data wouldn’t be lost.

  • Topics should have a replication factor > 1.
  • If a broker is down, another broker can serve the data.
  • One is going to be the leader and other is going to be the follower (ISR -> in-sync replica).
  • The producer sends a message to Kafka Cluster. That message is going to be sent to partition leader and copied to the partition follower.
  • At the same time, only one broker can be a leader. The only leader can receive and serve data.

Example: Like the image below:

  • The replication factor is 2. It means one message from the producer is going to send to Partition0 of Broker0, and then it is going to be copied to Partition0 of Broker1 and Partition0 of Broker2.
  • When the Broker1 is down, Broker0 and Broker2 can still serve the data.
  • When the Broker0 is the leader partition, which is down. Broker1 or Broker2 will become the leader. When Broker0 is back, It will be the leader again.
Topic Replication Factor (Source: dzone.com)

Producer: Producers write data to topics.

  • Broker failures, Producers will auto recover.
  • If you send the data without a key, the data will be sent round robin to each Broker in Cluster. (Load Balancing).

Acknowledgment (ACK): is a synonym for confirmation. When the Broker receives the data, the Broker can send an ACK to Producer to confirm message had received. Kafka has 3 status of Acknowledgment:

  • ACK = 0. The producer just sends the data and not wait for ack. (Most speed, Worst integrity-Data Lost).
  • ACK = 1. The producer will wait for leader acknowledgment (Low speed, Limited data lost). If the data sent to the leader and the leader down before the leader syn data with the follower, the data is going to be lost.
  • ACK = all. If one message sent successfully, It must be received by the leader and the follower partition. (Worst speed, High integrity-No data lost)

Message Keys: Key can be string, number,…

  • key=null: data is sent round robin (Broker0 then Broker1 then Broker2)
  • key!=null: all messages for that key will always go to the same partition.
  • A key is sent if you need message ordering for a specific field.

Consumer: read data from a topic.

  • Consumers know which broker to read from, if one broker failures, consumers know how to recover.
  • Consumers read from multiple partitions. Data is read in order within each partition.
  • There is no guarantee across the order between two partitions. (consumer will read some message in Partition0 and read some message in Partition1,…)

Consumer Groups:

(Source: cloudera.com)
  • If two consumer is set to a group, each consumer within a group reads from exclusive partitions.
  • If you have more consumers than partitions, some consumers will be inactive.

Consumer Offsets: is a bookmarking of Kafka. It stores the offsets at which a consumer group has been reading. If a consumer dies, it will be able to read back from where it left off.

  • It will be stored in a Kafka topic and that Kafka topic is named __consumer_offsets.
  • When a consumer in a group has processed data received from Kafka, it should be committing the offsets.

Delivery Semantics for Consumers: Kafka provides 3 delivery semantics:

  • At most once: Offsets are committed as soon as the message is received. If the processing goes wrong, the message will be lost.
  • At least once (usually): Offsets are committed after the message is processed. If the processing goes wrong, the message will be read again.
  • Exactly once: It can be achieved for Kafka to Kafka workflows using Kafka Streams API. For Kafka to External System workflows use an idempotent consumer.

Zookeeper: is manages brokers.

  • It helps in performing leader election for partitions.
  • It sends the notification to Kafka in case of changes.
  • Zookeeper must be runn before Kafka server start.
  • It by design operates with an odd number of servers.
  • It has a leader(Leader handle the writes from the brokers) the rest of the servers are followers (handle reads).
  • Zookeeper running folder does not store consumer offsets with Kafka.

Kafka Guarantees:

  • Messages are appended to a topic-partition in the order they are sent.
  • Consumers read messages in the order stored in a topic-partition.
  • With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
  • As long as the number of partitions remains constant for a topic, the same key will always go to the same partition.

This is my note after 1 month of learning Kafka on the internet.