#Training 5 :
Introduction to ETL, Data Lake, Data Warehouse, and Setup Environment.

Ashila Ghassani
3 min readOct 15, 2023

Did you ever heard about ETL? ELT? Data Lake? So how about Data Warehouse? And how do you set up all of this?

In this post, I’m trying to give some simple explanations about:
1. ETL vs ELT
2. Data Lake vs Data Warehouse
3. Environment Setup

ETL vs ELT

As we know, ELT stands for extract, transform, load.
But you’ve probably also heard of ELT. What’s the difference between ETL and ELT? What’s better?
Let’s discuss it:

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are data integration methods. However, each has unique characteristics and is suitable for different data needs.
The most important difference is that :
ETL transforms data first before loading it to data lake, while ELT load data raw first to data lake and transforms it afterward.’

So ETL vs ELT, what’s better?

The decision between ETL and ELT depends on several factors because it will determine your data storage, analysis, and processing.
You might consider the following things:

With ETL, data will be easy to manage and compliance so it’s great for those prioritizing data security.
Otherwise, with ELT, data will be more flexible and low initial cost so it more flexibility to analysts.

So, it’s your decision to consider whether ETL or ELT is more suitable:)

Data Lake vs Data Warehouse

Have you ever wondered where data is typically stored? Perhaps you’ve heard the term ‘data lake’? Or ‘data warehouse’?
So, what’s the difference between that two? Are both places where data is stored?
Let’s discuss :

Data lakes and data warehouses are both storage systems for big data used by data engineer/data scientist/data analyst.

So, what’s the different ?

Data Lake
A data lake is a storage repository that holds raw, unstructured, semi-structured, and structured data with minimal processing. This allows it to collect and store vast volumes of diverse data from various sources. The data lake’s scalability, flexibility, and cost-efficiency enable quick analysis for any purpose, making it ideal for machine learning.

Data Warehouse
A data lake is a storage repository that gathers and manages data from various sources to provide valuable business insights. The centralization and integration features of a data warehouse make it ideal for storing historical data, analytics and reporting, and support decision-making.

Which is a better: data lake or data warehouse?

Most companies usually benefit from using both. Data lakes are designed to store amounts of raw data, perfect for machine learning. On the other hand, data warehouses are essential for more specific business analytics and reports.

Environment Setup

This is what you typically need to prepare first if you want to start in data engineering :

Setting Up the Environment (Hardware) :
● Servers and clusters (physical or virtual machines)
● Storages (HDDs, SSDs, cloud-based)
● Networking infrastructure
Introduction to ETL, DL, and DW

Setting Up the Environment (Software):
● ETL Tools (e.g. Spark, NiFi, Talend)
● Data Integration and Workflow Orchestration (e.g. Airflow, Nifi)
● Database Systems (relational db, NoSQL, columnar)
● Big Data Technologies (e.g. Hadoop, Spark, Kafka)

Setting Up the Environment (Software):
● Data Serialization Formats (e.g. JSON, Parquet, Avro, CSV)
● Version Control (Git)
● Scripting and Programming Languages (e.g. Python, Java)
● Data Quality and Governance (e.g. Talend, Trifacta)

Setting Up the Environment (DBMS):
● Relational DBMS
PostgreSQL, MySQL, Ms SQL Server, Oracle
● NoSQL DBMS
MongoDB, Cassandra, Redis
● Columnar DBMS
BigQuery, Redshift, Snowflake

Setting Up the Environment (Containerization)
● Docker
● Kubernetes

Cloud-based solutions
● GCP(Google Cloud)
● AWS
● Azure

So that’s all I can share in this post. I hope it can help or provide new insights to anyone reading this.

Thank you!! :))

--

--