Build DE Project with Me | Data Engineering Daily

Session	Date	Day	Focus	Layer	Time
Day 1	20 June	Saturday	Azure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBAC	Setup	8:00 PM - 11:00 PM
Day 2	21 June	Sunday	Data source mapping — Postgres schema, API contracts, file formats, volume estimation	Design	8:00 PM - 11:00 PM
Day 3	24 June	Wednesday	Bronze ingestion — batch ELT from Postgres + partner file drops to ADLS Delta	Bronze	8:00 PM - 11:00 PM
Day 4	27 June	Saturday	Streaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day)	Bronze	8:00 PM - 11:00 PM
Day 5	28 June	Sunday	Multi-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORC	Bronze	8:00 PM - 11:00 PM
Day 6	1 July	Wednesday	CDC implementation — watermark-based change capture from PostgreSQL fleet DB	Bronze	8:00 PM - 11:00 PM
Day 7	4 July	Saturday	Incremental load + idempotent pipeline — high-watermark pattern + MERGE on Delta	Silver	8:00 PM - 11:00 PM
Day 8	5 July	Sunday	Silver transforms — dedup, type cast, schema enforcement, NULL handling with PySpark	Silver	8:00 PM - 11:00 PM
Day 9	8 July	Wednesday	ADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestion	Silver	8:00 PM - 11:00 PM
Day 10	11 July	Saturday	SCD Type 2, late-arriving data handling, schema drift alerting	Silver	8:00 PM - 11:00 PM
Day 11	12 July	Sunday	Gold layer — fact + dimension models, Z-ordering, partition strategy, DW load	Gold	8:00 PM - 11:00 PM
Day 12	15 July	Wednesday	Pipeline audit table + DQ checks + Azure Monitor alerts + cost tracking	Monitoring	8:00 PM - 11:00 PM
Day 13	18 July	Saturday	Failure simulation — corrupt file, schema drift, duplicate events — fix them live	Failures	8:00 PM - 11:00 PM
Day 14	19 July	Sunday	End-to-end run — full pipeline from raw events to Gold warehouse tables, validate	E2E	8:00 PM - 11:00 PM
Day 15	22 July	Wednesday	Interview walk-through — system design, trade-off questions, resume bullets, mock Q&A	Interview	8:00 PM - 11:00 PM
Day 16	25 July	Saturday	Spark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuning	Performance	8:00 PM - 11:00 PM
Day 17	26 July	Sunday	Metadata-driven pipeline — control table design, config ingestion, reusable ingestion framework	Framework	8:00 PM - 11:00 PM
Day 18	29 July	Wednesday	CI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deployments	CI/CD	8:00 PM - 11:00 PM

🎯 The Real Business Problem

What a Senior DE is actually solving in production

📊

Fragmented Data Sources

EV telemetry arrives from IoT sensors, fleet APIs, PostgreSQL operational DB, partner CSVs, and third-party XML feeds — all in different formats and at different frequencies.

📉

No Single Source of Truth

Business teams can't trust vehicle usage reports because raw data has duplicates, late arrivals, schema drift, and no lineage tracking.

⏱️

Reporting Latency

Analysts wait 24+ hours for daily reports. Charging station utilisation data is stale before decisions can be made on fleet routing and energy pricing.

🔒

Credential Sprawl

DB passwords and API keys are hardcoded across notebooks and scripts. Every new joiner gets credentials in Slack — a security and audit nightmare.

What we build: A unified, secure, scalable Azure DE platform that ingests 50M+ records daily from all these sources, applies Medallion architecture, and delivers clean analytics-ready gold tables — with full audit trail, monitoring, and zero hardcoded secrets.

🏗️ Architecture Overview — Medallion on Azure

Sources

🗄️ PostgreSQL
Fleet & ops DB

🌊 Event Stream
IoT telemetry

📂 File Dumps
CSV/JSON/XML/ORC/Parquet

🔌 REST APIs
3rd-party EV data

→

Bronze Layer

ADLS Gen2

Raw ingestion · Delta format · No transformations
~50M records / day

→

Silver Layer

Databricks + PySpark

Dedup · Type cast · Schema enforce · CDC apply · Validate

→

Gold Layer

Data Warehouse

Business aggregates · KPIs · Reporting-ready tables

🔶 Bronze — Raw as-is · Append-only · Full history

⬜ Silver — Cleansed · Validated · Deduplicated

🔆 Gold — Aggregated · Partitioned · BI-Ready

📦 Data Scale & Volume

Data Source	Format	Volume	Frequency	Load Type
EV Telemetry (IoT)	JSON (streaming)	~20M events/day	Real-time	Streaming
Fleet & Charging DB (Postgres)	Relational tables	~5M rows/day delta	Every 2 hours	CDC / Incremental
Energy Pricing Feed (API)	JSON / XML	~500K records/day	Hourly	Full + Delta
Partner Fleet Exports	CSV / ORC	~10M rows/batch	Daily	Batch
Maintenance & Alerts	Parquet	~2M records/day	Every 6 hours	Incremental
Government Registration Data	XML	~500K records/month	Monthly	Full Load

Total Bronze ingestion: ~50M+ records per day · Retention: 3 years rolling · ~15 TB estimated annual volume

🗂️ Core Data Model (Gold Layer)

These are the final business tables you will build and populate — what analysts and BI tools consume:

🚗

dim_vehicle

Vehicle ID, make, model, battery capacity, registration, active flag — SCD Type 2 for historical tracking.

⚡

dim_charging_station

Station ID, location, charger type (AC/DC), max power, operator — geo-partitioned.

📅

dim_date / dim_time

Full date spine and time grain — used for all joins in fact tables.

📊

fact_charging_session

Session-level fact — vehicle, station, energy consumed, duration, cost, status. ~15M rows/month.

📡

fact_telemetry_hourly

Aggregated hourly vehicle telemetry — speed, battery %, odometer, temperature. Pre-aggregated from raw 20M events/day.

💰

fact_energy_cost

Station-level cost per kWh by time window — joined with sessions for profitability reporting.

☁️ Azure Services We Will Use

🔷

Azure Databricks

PySpark jobs, Delta Live Tables concept, Unity Catalog basics, job clusters, and workflow scheduling.

🌊

Azure Event Stream

Real-time ingestion of EV IoT telemetry at 20M+ events/day into Bronze Delta table.

📦

ADLS Gen2 + Blob Storage

Tiered landing zones — hot tier for active ingestion, cool tier for historical Bronze files.

🔄

Azure Data Factory

Parameterised pipelines, ForEach loops, copy activities, triggers, integration runtimes.

🏬

Azure Synapse / Warehouse

Gold layer serving — dedicated SQL pool, external tables, PolyBase, partitioning strategy.

🔑

Azure Key Vault

All secrets — DB passwords, API keys, SAS tokens — fetched at runtime. No plaintext credentials anywhere.

🐘

PostgreSQL (Source)

Operational fleet + charging DB — we implement CDC and incremental extraction patterns on it.

🪙

Delta Lake

ACID + schema enforcement + time travel + MERGE (upsert) — foundation of all three Medallion layers.

📋

Azure Monitor + Log Analytics

Pipeline run logs, failure alerts, SLA dashboards, and cost tracking — real production observability.

🔐

Azure Active Directory / Entra ID

Service principals, RBAC for storage and Databricks, managed identities — zero-credential architecture.

⚙️ Production Engineering Patterns We'll Implement

📥

ELT Pipeline

Load raw → transform in-platform. Separation of concerns between ingestion and transformation layers.

🔁

CDC — Change Data Capture

Watermark-based and log-based CDC from PostgreSQL. Only process changed rows — not full dumps.

📈

Incremental Load with High Watermark

Track last-processed timestamp per source table. Re-runnable without reprocessing old data.

✅

Idempotent Pipeline Design

MERGE-based upserts on Delta tables. Every pipeline run produces identical output regardless of how many times it runs.

🔂

Backfill Strategy

Re-process historical date ranges without breaking current data. Date-partition override approach.

🧬

SCD Type 2 (Slowly Changing Dimensions)

Track vehicle and station attribute changes over time — effective_from / effective_to pattern.

📂

Multi-Format Ingestion

Single parameterised reader handles Parquet, CSV, JSON, XML, ORC — format detected from metadata config.

🧱

Schema Evolution Handling

Delta Lake schema mergeSchema + explicit schema checks. Alerting on unexpected column additions from upstream.

🔀

Late-Arriving Data

Event-time vs processing-time reconciliation. Reprocess affected partitions from Bronze when late events arrive.

🗺️

System Design Decisions

Partitioning strategy, Z-ordering, file compaction, cluster sizing, cost vs latency trade-offs — all documented.

🚨 Real Pipeline Failures We'll Handle

Production pipelines break — we'll deliberately introduce and fix these:

💥

Corrupt Source File

Malformed JSON / truncated Parquet from partner drops — quarantine to dead-letter folder, alert, continue pipeline.

🔌

Source DB Connection Drop

PostgreSQL connection timeout mid-batch — retry logic with exponential backoff, checkpoint recovery.

📉Schema Drift from Source

Upstream team adds a column without notice — pipeline detects schema change, raises alert, continues with known schema.

🔁

Duplicate Records

Event Stream delivers duplicate IoT events — Silver layer dedup on composite key before MERGE to Gold.

🕐

Pipeline SLA Breach

Bronze load takes 4h instead of 1h — Azure Monitor alert fires, on-call notification via webhook.

📊

Data Quality Failure

NULL rate exceeds threshold in critical column — pipeline publishes DQ report, halts Gold load, creates incident.

📡 Observability & Monitoring Setup

📋

Pipeline Run Metadata Table

Custom pipeline_audit table — records every run: source, rows read, rows written, duration, status, error message.

🔔

Azure Monitor Alerts

ADF pipeline failure → email + webhook alert. Databricks job timeout → PagerDuty-style notification.

📊

Data Quality Dashboard

Automated DQ checks after each Silver run — NULL %, duplicate %, out-of-range values logged to a QC table.

🏷️

Data Lineage Tracking

Source-to-Gold column lineage documented in pipeline metadata — know exactly which Bronze file a Gold record came from.

💰

Cost Tracking

Databricks DBU usage per job, ADF activity runs — weekly cost report to prevent budget overruns.

🕐

SLA Report

Daily report: did each pipeline meet its SLA window? Trend over 30 days — visible late-delivery patterns.

📅 18-Day Project Schedule (2 hours / day)

Day	Focus	Layer
Day 1	Azure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBAC	Setup
Day 2	Data source mapping — Postgres schema, API contracts, file formats, volume estimation	Design
Day 3	Bronze ingestion — batch ELT from Postgres + partner file drops to ADLS Delta	Bronze
Day 4	Streaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day)	Bronze
Day 5	Multi-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORC	Bronze
Day 6	CDC implementation — watermark-based change capture from PostgreSQL fleet DB	Bronze
Day 7	Incremental load + idempotent pipeline — high-watermark pattern + MERGE on Delta	Silver
Day 8	Silver transforms — dedup, type cast, schema enforcement, NULL handling with PySpark	Silver
Day 9	ADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestion	Silver
Day 10	SCD Type 2, late-arriving data handling, schema drift alerting	Silver
Day 11	Gold layer — fact + dimension models, Z-ordering, partition strategy, DW load	Gold
Day 12	Pipeline audit table + DQ checks + Azure Monitor alerts + cost tracking	Monitoring
Day 13	Failure simulation — corrupt file, schema drift, duplicate events — fix them live	Failures
Day 14	End-to-end run — full pipeline from raw events to Gold warehouse tables, validate	E2E
Day 15	Interview walk-through — system design, trade-off questions, resume bullets, mock Q&A	Interview
Day 16	Spark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuning	Performance
Day 17	Metadata-driven pipeline — control table design, config ingestion, reusable ingestion framework	Framework
Day 18	CI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deployments	CI/CD

👤 Who This Is For

✅

Has 2+ years in DE / Data roles

You understand ETL basics but haven't built a production-grade multi-source pipeline at scale before.

✅

Wants an Azure-based project on resume

Most DE interviews now require Azure/cloud project experience — this gives you a real, explainable one.

✅

Struggling with system design rounds

No project = no story. This gives you architecture decisions to defend with actual trade-off reasoning.

✅

Switching to Data Engineering

Coming from Software, Analytics, or BI — this is the fastest way to build and articulate a production DE project.

❓

Frequently Asked Questions

Everything you need to know before enrolling

Will the sessions be live? ▾

Yes — all 18 sessions are conducted live on Zoom at 8:00 PM – 11:00 PM IST on Wednesdays, Saturdays, and Sundays. You get direct access to ask questions during the session.

Will I get the recordings? ▾

Yes — lifetime access to all recorded sessions is included. If you miss a live session, the recording will be shared with you so you can catch up at your own pace.

Is this for beginners or experienced engineers? ▾

This project is designed for people with at least 1–2 years in a data or software role. You should be comfortable reading Python and have a basic understanding of SQL. Total beginners will find it challenging — it is production-grade work, not an intro course.

Is this suitable for people from a different domain switching to Data Engineering? ▾

Yes — if you are coming from Software Engineering, Analytics, BI, or any tech-adjacent background and want to switch into Data Engineering, this project is ideal. It gives you a real, explainable production project to put on your resume and walk through in interviews.

Is this for someone already in the data domain but looking for growth? ▾

Absolutely. If you are already a Data Engineer but have not built a multi-source production pipeline at scale on Azure, this project fills that gap. It covers advanced patterns like CDC, SCD Type 2, idempotent pipelines, Spark performance tuning, and CI/CD — the exact skills needed to move from 8–12 LPA to 20+ LPA roles.

Do I need an Azure account? Will it cost me anything? ▾

Azure provides a free trial with ₹13,500 of credits for new accounts — which is more than enough to complete this project. We will guide you through setting up everything on Day 1. No additional Azure cost is required for most participants.

Will I get the full code and architecture diagram? ▾

Yes — all code is pushed to a shared GitHub repository after each session. You will also receive the architecture diagram, documentation, and resume bullet templates you can customise and add to your profile immediately.

📦 What You Get

✅ Full Azure DE project — documented code + architecture diagram
✅ 50M+ record pipeline — real data volume, real scale decisions
✅ Medallion architecture implemented end-to-end
✅ CDC, Incremental Load, SCD Type 2, Idempotent patterns
✅ Multi-format ingestion (Parquet, CSV, JSON, XML, ORC)
✅ Azure Key Vault + service principal auth — production security

✅ Pipeline audit table + DQ checks + Azure Monitor alerts
✅ Live failure simulation + debugging session
✅ Gold data model: facts + dimensions + SCD Type 2
✅ System design walk-through — every architecture decision explained
✅ Resume bullet templates + interview storytelling framework
✅ Lifetime access to recordings + all code on GitHub

Build it once. Explain it in every interview.

₹999 One-time · Lifetime access of recording + code

EV Intelligence Platform
End-to-End Azure Data Engineering

🗓️ Session Details

Calendar (Starting 20th June)

🎯 The Real Business Problem

🏗️ Architecture Overview — Medallion on Azure

📦 Data Scale & Volume

🗂️ Core Data Model (Gold Layer)

☁️ Azure Services We Will Use

⚙️ Production Engineering Patterns We'll Implement

🚨 Real Pipeline Failures We'll Handle

📡 Observability & Monitoring Setup

📅 18-Day Project Schedule (2 hours / day)

👤 Who This Is For

Frequently Asked Questions

📦 What You Get

EV Intelligence PlatformEnd-to-End Azure Data Engineering

🗓️ Session Details

Calendar (Starting 20th June)

🎯 The Real Business Problem

🏗️ Architecture Overview — Medallion on Azure

📦 Data Scale & Volume

🗂️ Core Data Model (Gold Layer)

☁️ Azure Services We Will Use

⚙️ Production Engineering Patterns We'll Implement

🚨 Real Pipeline Failures We'll Handle

📡 Observability & Monitoring Setup

📅 18-Day Project Schedule (2 hours / day)

👤 Who This Is For

Frequently Asked Questions

📦 What You Get

EV Intelligence Platform
End-to-End Azure Data Engineering