#Training 9:
Introducing to Streaming Data Pipeline

Ashila Ghassani
5 min readOct 22, 2023

Batch vs Streaming Pipelines

Batch: First, data is grouped together. Then, the parts we want are taken and sent to their destination. It’s easier to set up this way.

Streaming: Data flows non-stop, like water in a pipe. It keeps moving to its destination. This method is harder to keep up with because the data doesn’t stop flowing.

Streaming Data Pipelines

Streaming data pipeline is a set of processes and technologies that designed to efficiently and continuously collect, process, and deliver data in real time or near real time.

Why streaming data pipeline?

Streaming data pipelines are essential for processing and analyzing data more quickly to make informed decisions faster.

Streaming Data Pipeline Use Case

  1. Financial Services
    - Algorithmic trading(processing real-time market data for automated trading decisions)
    - Fraud detection (identifying fraudulent transactions that occur
  2. Healthcare
    - Patient monitoring ( continuously tracking patient vital signs and alerting healthcare providers because anomalies)
  3. E-Commerce
    - Real-time recommendations(providing personalized product recommendations based on user behavior)

Disadvantages of Streaming Pipelines :
Streaming is expensive because the system has to run 24/7, since turning it off could result in data loss. If there’s a surge in data, for example during a 9.9 event(promotion event), then with streaming, you’d need to increase the specifications of the messaging system.

Messaging System

What is data pipeline for?

A data pipeline is a concept where we have a structured design, so the data flow is clearly depicted. Essentially, it’s about moving data from the source to the target or the desired location.

How do we move data in batch:

For batch processing, we use the ETL concept. We create a script to retrieve data from the source (Extract). After that, we process the data according to our needs (Transform) and then send it to the destination (Load).

How do we move data in Stream:

Unlike batch processing, where we rely on the script we build to handle all the ETL processes and thus retrieve data from its source,
with streaming, we provide a container or pipeline. The source sends its data to this pipeline, which then send the data to the destination.
This is called a messaging system.

What is messaging system?

Messaging system (a.k.a message broker) is software that enables applications, systems, and services to communicate with each other and exchange information with asynchronous.
- Involves the system of sending messages and receiving messages
- Two types of messaging model includes point-to-point and pub-sub(publish suscrib)

Point to point message model

A specific message can be consumed by only one consumer and it will instantly disappear from the queue once it is read.

Pub-sub message model

The message can be consumed by many consumer and it will instantly disappear for every topic that created.

The use of the message broker mentioned above is to allow systems to communicate asynchronously. For instance, if receiver 1 is down, the data is processed or passes through the broker first. It acts as a temporary storage to check which receiver can accommodate the data, ensuring no data is lost.

Additional Use case:
Want to move data from OLTP to OLAP using which message model? Additional requirement: there are two OLAPs, with one as the primary and the second as backup.
Answer : Therefore, use pub-sub. If you use point-to-point, you’d have to create the system twice, whereas we only need the data to be split into two parts because one of them is just a backup. That’s why pub-sub is more suitable. With pub-sub, the sender only needs to send one message, but it can be received by two receivers (in this case, the two OLAPs).

Messaging System Tools

  • RabbitMQ : point-to-point
  • Google Cloud : Pub/Sub pub-sub
  • Apache Kafka : Both

Streaming Processing Frameworks

In a streaming system, the data source sends its data to the messaging system, then to the processing engine, then back to the messaging system, and finally to the target location where we want to store the data. Data transformation in the Streaming Pipeline is done at the processing engine stage.

Streaming Processing Frameworks :
1.
Apache Kafka: Kafka Streams (doesn’t require a cluster, because Kafka is essentially a script/library, so there’s no need for a cluster for its processing engine):
Low-latency, exactly-once, Kafka-centric, lightweight, suitable for Kafka-based applications.

2.Apache Flink(needs a cluster for its processing engine) :
Low-latency, exactly-once, versatile, complex learning curve, suitable for mission-critical real-time processing.

3.Apache Spark Streaming (needs a cluster for its processing engine):
Moderate latency, at-least-once (with effort), integrates with Spark ecosystem,suitable if already using Spark for batch processing.

So that’s all I can share in this post. I hope it can help or provide new insights to anyone reading this.

Thank you!! :))

--

--