Skip to main content
Data Engineering

Data Engineering

Build the pipelines that make data useful at scale

160h total9 courses3 stages
Start this roadmap free

What you'll be able to do

  • Build batch and streaming data pipelines
  • Model and warehouse data for analytics
  • Orchestrate workflows with tools like Airflow
  • Operate reliable, monitored data systems

Before you start

  • Python fundamentals
  • Basic SQL
  • Comfort with the command line

Level 1 ·Programming & SQL Mastery

Python for Data Engineering

beginner18h

Python beyond basics: file I/O, subprocess, requests, and writing production scripts.

  • Parse CSV & JSON from disk and APIs
  • Context managers & error handling
  • Write a file processing pipeline script

SQL: Advanced Querying & Data Modelling

beginner20h

CTEs, window functions, dimensional modelling, and query optimisation.

  • Window functions: ROW_NUMBER, LAG, LEAD
  • CTEs & recursive queries
  • Star schema design for a sales dataset

dbt: Data Build Tool

intermediate14h

Transform data in your warehouse with version-controlled, tested SQL models.

  • Staging → intermediate → mart model layers
  • dbt tests: unique, not_null, relationships
  • Generate dbt docs site

Level 2 ·Data Pipeline Tools

Apache Airflow: Workflow Orchestration

intermediate18h

DAGs, operators, sensors, XComs, and managing complex pipeline dependencies.

  • ETL DAG: extract from API → load to Postgres
  • Sensor that waits for a file to land
  • TaskGroup & dynamic task mapping

Kafka: Event Streaming

intermediate16h

Producers, consumers, topics, partitions, and stream processing with Kafka Streams.

  • Producer & consumer in Python
  • Topic partitioning & consumer groups
  • Stream a real-time clickstream

Apache Spark: Distributed Processing

advanced18h

PySpark DataFrames, SQL, UDFs, and processing large datasets at scale.

  • Load & transform 1M-row CSV with PySpark
  • Spark SQL join & aggregation
  • Write partitioned Parquet to S3

Level 3 ·Cloud Data Platforms & Capstone

BigQuery & Snowflake Data Warehousing

advanced16h

Cloud DWH architecture, cost management, partitioning, clustering, and BI integration.

  • Load data from GCS to BigQuery
  • Partition by date, cluster by user_id
  • Connect Looker Studio for visualisation

Data Quality & Great Expectations

advanced10h

Validate, document, and profile your data with Great Expectations.

  • Expectation suite for a pipeline output
  • Integrate GX into Airflow DAG

Capstone: End-to-End Data Platform

advanced30h

Ingest → transform → model → visualise: a complete modern data stack.

  • Kafka → Spark → BigQuery pipeline
  • dbt models on BigQuery
  • Airflow orchestrating the full flow
  • Dashboard in Looker Studio or Metabase

Frequently asked

Is the Data Engineering roadmap free?+

Yes. The entire Data Engineering roadmap and every curated resource is free to follow on Commit. You can track your progress, keep a daily streak, and earn a shareable certificate at no cost — there is no paywall.

How long does the Data Engineering roadmap take to complete?+

About 160 hours of focused study across 9 courses and 3 stages. At roughly one hour a day that is about 6 months; you can move faster by studying more each day.

Do I get a certificate for finishing the Data Engineering roadmap?+

Yes. When you complete the roadmap on Commit you receive a verifiable certificate of completion that you can add to LinkedIn and your public Commit profile as proof of what you finished.

Make it stick

Copy this roadmap into Commit and turn it into a tracked program with a streak graph, study logging, and a shareable certificate when you finish. Free forever.

Start Data Engineering free