#Training 10:
Streaming Data Pipeline with Apache Kafka

Ashila Ghassani
6 min readOct 22, 2023

Apache Kafka

Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications.

Kafka Components

  1. Kafka Broker

Producer: Produces messages that send data from the source to the broker. Broker: Stores messages sent by the consumer.
Consumer: Consumes messages stored in the broker.

2. Kafka Topic

Inside the broker, there are topics. Messages in the broker have destinations. Topic: Classifies messages in the broker according to their respective purposes. For instance, from the example above, we classify which messages are for students and which are for teachers. Messages related to students will go into the student topic, and messages related to teachers will go into the Teacher Topic.

We cannot directly delete data that is in the Broker. The way to delete data in the broker is through the Retention Policy. For example, if we have set from the beginning that data will be stored for 7 days in the broker, then the data in the broker will be deleted after being stored for 7 days.

3.Kafka Partition

In one topic, we can create several partitions to determine the number of consumers.

4.Kafka Consumer Group

This is used if we want consumers to consume messages according to their partitions.

For example, when we don’t create a consumer group, each consumer will consume all its partitions. Consumer 1 consumes Part 1 and 2, and consumer 2 also consumes Part 1 and 2 (the concept is similar to pub-sub).

If we want consumers to consume messages according to the partitions created in the broker, we can create a consumer group so that each consumer consumes messages according to its partition.

Note: Partitions are better to equally distributed because when consumers run concurrently, the processing time will be quicker if the consumers work at the same time with evenly distributed message consumption.
For example, as depicted in the illustration above, if partition 1 has only 1 message and the other partitions have 3 messages each, the time it takes for a consumer to process all the messages will be longer compared to if both partition 1 and partition 2 each have 2 messages.

5.Kafka Cluster

In Kafka, you can have multiple brokers within one cluster to prevent broker down and avoid data loss.

6.Kafka Message Cluster

7. Connect API

In Kafka Connect, it can act both as a producer and as a consumer. Essentially, we can use existing scripts in Kafka to create both a producer and a consumer within the Kafka ecosystem to transfer data from the data source to the destination (e.g., data warehouse).

Hands On Kafka:

You can check on this post about how running Apache Kafka basic:

Kafka Stream

Kafka Streams is a client library for building applications and microservices(like producer and consumer), where the input and output data are stored in Kafka clusters and using Java and Scala.

Example : Case Word Count

All is well → 3

Hello world → 2

Kafka Stream Processing Topology

  1. Kafka Streams DSL
    Common data transformation operations such as map, filter, join and
    aggregations.
  2. Processor API
    Lower-Level, allows developers define and connect custom processors.

Kafka Streams DSL

  • Built-in abstractions for streams and tables in the form of KStream, KTable, and GlobalKTable.
  • Declarative, functional programming style with stateless transformations (e.g.map and filter, not dependencies between data) and
    stateful transformations such as aggregations (e.g.count and reduce), joins (e.g. leftJoin), and windowing (e.g. session windows).

Windowing

  • Tumbling Window
    Fit with case : Analysis for Statistic(ex : Reporting Daily/Monthly)
  • Hoping Window
    Fit with case : Analysis Traffic road and Moving average in trading.
  • Sliding Window
    Fit with case: System Analytic
  • Session Window
    Fit with case: User Tracking(Notification/Alert)

ksqlDB

ksqlDB allows you to build stream processing applications on top of Apache Kafka with the ease of building traditional applications on a relational database. Using SQL , it makes it easy to build Kafka-native applications for processing streams of real-time data.

Example :

1. Create logical plan SQL that we want to create

2. Once our logical plan is created, it will produce a script from each step of the logical plan we have made(it’s like wrapper), as follows:

So ksqlDB or Kafka Streams?

Using ksqlDB when:
- New to streaming and Kafka
- Prefer SQL to writing code than Java or Scala
- To quicken and broaden the adoption and value of Kafka in your organization
- Prefer an interactive experience with Ul and CLI
- Use cases include enriching data, joining data sources filtering, transforming, and masking data, identifying anomalous events
- Use case is naturally expressible by using SQL with optional help from UDFS
- Want the power of Kafka Streams but you arent on the JVM,
use the ksqlDB REST API from Python,Go,C#, JavaScript, and shell

Using Kafka Streams when:
- Prefer writing and deploying JVM applications like Java and Scala; for example, due to people skills, tech environment
- Use case is not naturally expressible through SQL, for example,finite state machines
- Building microservices
- Must integrate with external services, or use 3rd-party libraries (but ksqlDB user defined functions(UDFs) may help)
- To customize or fine-tune a use case, for example, with the Kafka Streams Processor API: custom join variants, or probabilistic counting at very large scale with Count-Min Sketch
- Need queryable state, which ksqlDB doesn’t support

So that’s all I can share in this post. I hope it can help or provide new insights to anyone reading this.

Thank you!! :))

--

--