Top Apache Kafka Interview Questions


We have compiled most frequently asked Kafka Interview Questions that can help you crack your next Kafka interview.

1. What is Apache Kafka ?

Apache Kafka is an open-source distributed Publish-Subscribe Messaging system and it is used primarily to build real-time data streaming applications.
Apache Kafka is used by world leading organisations  as it provides high-performance, High Throughput and Low Latency.
Kafka was created at LinkedIn in January 2011 and it started helping organisation to make their system real time and loosely coupled.

2. Explain the key components of Kafka architecture?

The main component of Apache Kafka Architecture is :

  • Broker
  • Partitions
  • Consumer
  • Producer
  • Topic
  • Zookeeper

3. What is the role of Broker and zookeeper in Kafka?

Kafka broker is one of the core components of the Kafka architecture. It is also known as the Kafka server and a Kafka node. The Kafka broker is used for managing the storage of the messages in the topic. We can call it as the mediator between the producer and consumer. We define a Kafka cluster when there is more than one broker available. The Kafka broker is responsible for transferring the producer is publishing an event and the subscriber or consumer will be consuming those messages.

As mentioned above that Kafka cluster consists of various brokers. Kafka cluster implements the Zookeeper for maintaining the state of the cluster. It has also been seen that an individual broker can handle thousands of requests for reads and writes per second. When no performance impact is seen then every broker in the Kafka cluster can handle terabytes of messages.

ZooKeeper is a distributed coordination service used by Kafka for managing cluster metadata, leader election, and synchronisation among brokers. Kafka brokers rely on ZooKeeper for coordination and consensus, such as maintaining information about brokers, partitions, topics, and consumer offsets.

4. How does Kafka ensure fault tolerance?

As you know, Apache Kafka is a distributed streaming platform that is widely used for developing real-time streaming applications. Kafka is designed to be fault-tolerant, it means that it can handle failures with minimal downtime and no data loss. One of the key components of fault tolerance in Kafka is its ability to handle node failures gracefully.

Kafka cluster is a group of multiple kafka broker. Brokers store messages and acts as a mediator between producer and consumer. Kafka fault tolerance is primarily achieved through replication. Each topic in Kafka can be configured with a replication factor, which defines the number of copies of the messages that will be kept across different brokers. If we set replication factor as 3 then Kafka will keep 3 copies, one in each kafkaesque broker. This ensures that if a broker fails, the data can still be served from another broker that has a replica.

To manage these replicas, Kafka uses a leader-follower model. For each partition of a topic, one broker will be the leader, and the others will be followers. The leader handles all read and write requests for the partition, while the followers replicate the data from the leader. In case the leader broker fails, one of the followers will automatically be elected as the new leader. Moreover, this leader follower assignment will be handled by zookeeper.

5. How does Kafka handle data retention?

Kafka retention refers to the time that how long messages are stored in Kafka topics before they are eligible for deletion. Whenever Kafka producer publish a message on Kafka topic then message will be automatically deleted as soon as retention period expires. Retention Policy is very much useful as it ensures topics do not grow indefinitely in size, managing storage usage effectively. The default time of retention policy is 7 days but we can customise it as per our business needs.

Kafka allows administrators to configure data retention settings for individual topics. The two primary parameters that control data retention are:

  1. log.retention.hours: This parameter specifies the maximum time in hours that a message will be retained in a topic’s log. Messages older than this duration are eligible for deletion.
  2. log.retention.bytes: This parameter specifies the maximum size in bytes of a topic’s log. Once a topic’s log size exceeds this value, Kafka begins to delete older messages to free up space.

6. How can you achieve exactly-once semantics in Kafka?

We can achieve exactly once semantics in kafka by using Transactional Producer where each message consumed by Kafka Consumer will get processed and published to another Kafka Topic considered as a single atomic operation.

7. Explain the use of the Kafka consumer group?

When we send message in a distributed environment using a messaging system, we mostly want to achieve two scenarios:

  • send a message to a specific group of consumers  either one or more
  • broadcast the message to all the consumers

Kafka allows us to achieve both of these scenarios by using Consumer Groups.

Kafka Consumer groups allow multiple consumers to work together to consume messages from one or more partitions of a topic.

Let’s understand all the benefits.

  • Parallel Processing:
    • Kafka topics divided into multiple partition and we can create any number of partitions depending on the business needs or load on the topic. However, important point to note here is that only one consumer can consume messages from each partition  within a consumer group at any given time.
    • Kafka allows for parallel processing of messages, when we divide the topic into multiple partitions and create multiple consumers within the same consumer group. Each consumer within the group processes messages from its assigned partitions independently and concurrently.
  • Scalability:
    • We can scale our Kafka Consumer applications horizontally for message consumption and can scale upto any number of consumers in a group to increase the overall throughput and processing capacity of your Kafka consumers.
  • Broadcasting:
    • If we keep group id of all the consumer unique then each consumer can consume a message from each available partitions of topic.
  • Load Balancing:
    • Kafka’s consumer group mechanism ensures load balancing across consumers within the group. Each consumer is assigned a subset of partitions to consume from, and Kafka dynamically adjusts the partition assignments to distribute the workload evenly among the consumers.
  • Fault Tolerance:
    • Consumer groups provide fault tolerance by allowing multiple consumers to consume messages from the same topic. If one consumer within a consumer group fails or becomes unavailable then Kafka redistributes the partitions it was consuming to the remaining consumers in the group.

8. How does Kafka manage Horizontal scaling?

We can scale our consumer applications horizontally either by increasing the number of partitions or by increasing the Kafka broker/server. Depending on the use case, we can use one of them.

Scale out Partitions:

  • Partitioning allows Kafka to parallelise message processing by distributing the workload across multiple brokers.
  • Kafka producers can write messages to different partitions concurrently, and Kafka consumers can consume messages from different partitions in parallel.

Scale out brokers:

  • Scaling out brokers increases the cluster’s capacity to handle more message throughput and storage, as well as improving fault tolerance and resilience to failures.

9. How is load balancing maintained in Kafka?

In kafka, Kafka Producer handles the load balancing by spread out the message load between partitions following the round robin fashion and it does preserve message order.

10. What are the differences between MQ and Kafka?

MQ: A Message Queue (MQ) is an asynchronous service-to-service communication protocol used in microservices architectures. In MQ, messages are queued until they are processed and deleted. Each message is processed only once.

Kafka: It is a fault-tolerant, high throughput pub-sub messaging system built for real-time input and processing of streaming data. 

11. How does Kafka guarantee message ordering within a partition?

Kafka guarantees message ordering within a partition by using the concept of offset. In Kafka, each messages gets published on Persistent Storage called Kafka Topic which are divided into Partitions. Each partition is an ordered, immutable sequence of messages. This means that if messages were sent from the producer in a specific order, the broker will write them to a partition and all consumers will read from that in the same order. So naturally, single-partition topic is easier to enforce ordering compared to its multiple-partition siblings. 

12. What is the purpose of Kafka Streams?

Kafka Streams leverages Kafka producer and consumer libraries and Kafka’s in-built capabilities to provide a lightweight, scalable, and fault-tolerant framework for stream processing. It deals with messages as an unbounded, continuous, and real-time flow of records, with the following characteristics:

  • Single Kafka Stream to consume and produce
  • Perform complex processing
  • Stateful Stream Processing
  • Exactly-once Processing
  • Support stateless operations
  • Reduced number of lines to create Kafka Consumer and Producer
  • Threading and parallelism
  • Interact only with a single Kafka Cluster.

13. What is the significance of the Kafka schema registry?

The Schema Registry is an external process that runs on a server outside of your Kafka cluster. It is actually a database for the schemas used in your Kafka eco-system and handles the distribution and synchronisation of schemas to the producer and consumer by storing a copy of the schema in its local cache. Let’s see what Schema Registry provides:

  • Schema Evolution
  • Backward and Forward Compatibility
  • Data Consistency and Validation:
  • Versioning and History
  • Security and Governance

14. Mention the API’s provided by Apache Kafka?

Apache Kafka provides five major API’s which includes:

  1. Producer: A Kafka Producer API is used to publish a message on Kafka Topic. We can configure the Kafka Producer as per our business requirements.
  2. Consumer: A Kafka Consumer API is used to consume or poll messages from list of Kafka Topic.Consumers can be configured to read from specific partitions, consume from specific offsets, or automatically commit offsets.
  3. Streams: A Kafka Streams API is used to build real-time stream processing applications on top of Kafka. It comes with a fault-tolerant cluster architecture that is highly scalable, making it suitable for handling hundreds of thousands of messages every second.
  4. Connector: Kafka Connect is an API used to move data into and out of Kafka and can connect Kafka topics with external systems such as databases, file systems, and message queues.
  5. Admin: A Kafka Admin API is used to manage Kafka clusters, topics, and configurations. It allows for creating, deleting, altering topics, querying metadata about topics and partitions, and modifying cluster configurations.

15. What is the maximum size of a message that can be received by Apache Kafka?

Kafka has a default limit of 1MB per message in kafka topic. However, this value can be adjusted based on your business requirements. Administrators can modify the message.max.bytes parameter in the Kafka broker configuration file (server.properties) to increase or decrease the maximum message size allowed by the broker. It is important to note that increasing the maximum message size can have implications for memory usage and network bandwidth, so it should be adjusted cautiously and with consideration of the overall system capacity.

16. Explain the retention period in an Apache Kafka cluster?

The default retention period of a Kafka topic is one week. This means that each message inside the kafka topic remains for one week. Kafka will purge/delete messages that are older than one week. However, Administrators can modify the retention period as per the requirement by using below command.

17. Explain the roles of leader and follower in Apache Kafka?

Every partition has exactly one partition leader which handles all the read/write requests of that partition. If replication factor is greater than 1, the additional partition replications act as partition followers.
Kafka guarantees that every partition replica resides on a different broker whether it’s the leader or a follower, so the maximum replication factor is the number of brokers in the cluster.

Every partition follower is reading messages from the partition leader and does not serve any consumers to consume those messages. Followers is just keep the backup of all the messages which is getting processed on or by the leader.

A partition follower is considered in-sync if it’s reading records from the partition leader without lagging behind and without losing connection to ZooKeeper (max lag default is 10 seconds and ZooKeeper timeout is 6 seconds, both are configurable).

If a partition follower is lagging behind or lost connection from ZooKeeper, it considered out-of-sync.
When a partition leader shut down for any reason  (for example broker crash, network failure), then one of its in-sync partition followers becomes the new leader.

18. What is transaction in kafka ?

A Kafka transaction allows producers and consumers to perform operations atomically across multiple Kafka topics. It ensures that all the operations within a transaction are either completed successfully, or none of them are committed. This atomicity ensures data consistency and integrity.

Kafka has its own transaction manager which enables atomicity for read, process and write operations. Kafka Producers are capable of creating session (i.e. transactional session) and send messages within the sessions. Hence it can decide to either commit / abort the transaction. First, let’s consider what an atomic read-process-write cycle means. So, when kafka consumer is consuming events/messages from one kafka topic then performing some database operations or any modifications to the messages and then publishing it to another kafka topic. In this kind of scenario, when a broker crash or any network failure happens  while processing events then transaction is either completed or it will rollback everything. It helps us to provide exactly once processing where each consumed events will be processed only once.
Kafka transactions plays an important role in scenarios where data integrity is critical, such as financial transactions and logging systems.

A Transactional Producer allows us to send messages to multiple partitions and guarantees all these writes are either committed or discarded. This is done by grouping multiple calls into a single transaction. Once a transaction is started, you can call commitTransaction() or abortTransaction() to complete it.

Note: Kafka Consumers configured with isolation.level=read_committed will not consume messages from aborted transactions.

Problems occurred when using Vanilla Kafka Producer and Consumer:
Vanilla Kafka producers and consumers configured for at-least-once delivery semantics, which could result in duplicate message processed in case of any failure in the following ways:

  1. The producer.send() could result in duplicate writes of message due to its internal retries. This is addressed by the idempotent producer.
  2. We may reprocess the input message , resulting in duplicate messages being written to the output topic, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing but before marking events as consumed at consumer level. Thus when it resumes, it will consume the same event again and write it to output topic again, causing a duplicate.
  3. In a distributed environments, applications will crash or temporarily lose connectivity with the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances”.

19. What is the role of the transactional ID in Kafka transactions?

A transactional ID is a unique identifier which is assigned to a Kafka producer instance. It is used to associate producer operations with specific transactions. This allows Kafka to ensure that all messages produced within a transaction are either committed together or aborted.
Note: We should always provide unique transaction id to each kafka producer instance even if application has multiple instances.

20. Are Kafka transactions supported in all Kafka client libraries?

Kafka transactions are supported by most of Kafka client libraries, including Java, Python, and others.

The questions mentioned above are the top 20 Kafka interview questions which are asked in an interview. Also, if you need more Kafka interview questions and answers please comment to us

1 thought on “Top Apache Kafka Interview Questions”

  1. I am not sure where you’re getting your info, but good topic. I needs to spend some time learning much more or understanding more. Thanks for magnificent info I was looking for this information for my mission.

    Reply

Leave a comment