🛠️ Production-Grade Azure DE Project

EV Intelligence Platform
End-to-End Azure Data Engineering

Build a real production-grade Data Engineering system processing 50M+ records from the Electric Vehicles domain — across streaming, batch, CDC, and multi-format sources — all on Microsoft Azure with Medallion Architecture.

50M+Records
18Days
8+Azure Services
2h/ Day
EVDomain
Liveon Zoom
LanguageEnglish

₹999 One-time · Live classes with life time access of recorded sessions

🗓️ Session Details

📅
Days

18 Days

Duration

30+ Hours

🕗
Timing

8:00 PM to 11:00 PM

📍
Days

Wednesday, Saturday, Sunday

Calendar (Starting 20th June)
Session Date Day Focus Layer Time
Day 120 JuneSaturdayAzure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBACSetup8:00 PM - 11:00 PM
Day 221 JuneSundayData source mapping — Postgres schema, API contracts, file formats, volume estimationDesign8:00 PM - 11:00 PM
Day 324 JuneWednesdayBronze ingestion — batch ELT from Postgres + partner file drops to ADLS DeltaBronze8:00 PM - 11:00 PM
Day 427 JuneSaturdayStreaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day)Bronze8:00 PM - 11:00 PM
Day 528 JuneSundayMulti-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORCBronze8:00 PM - 11:00 PM
Day 61 JulyWednesdayCDC implementation — watermark-based change capture from PostgreSQL fleet DBBronze8:00 PM - 11:00 PM
Day 74 JulySaturdayIncremental load + idempotent pipeline — high-watermark pattern + MERGE on DeltaSilver8:00 PM - 11:00 PM
Day 85 JulySundaySilver transforms — dedup, type cast, schema enforcement, NULL handling with PySparkSilver8:00 PM - 11:00 PM
Day 98 JulyWednesdayADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestionSilver8:00 PM - 11:00 PM
Day 1011 JulySaturdaySCD Type 2, late-arriving data handling, schema drift alertingSilver8:00 PM - 11:00 PM
Day 1112 JulySundayGold layer — fact + dimension models, Z-ordering, partition strategy, DW loadGold8:00 PM - 11:00 PM
Day 1215 JulyWednesdayPipeline audit table + DQ checks + Azure Monitor alerts + cost trackingMonitoring8:00 PM - 11:00 PM
Day 1318 JulySaturdayFailure simulation — corrupt file, schema drift, duplicate events — fix them liveFailures8:00 PM - 11:00 PM
Day 1419 JulySundayEnd-to-end run — full pipeline from raw events to Gold warehouse tables, validateE2E8:00 PM - 11:00 PM
Day 1522 JulyWednesdayInterview walk-through — system design, trade-off questions, resume bullets, mock Q&AInterview8:00 PM - 11:00 PM
Day 1625 JulySaturdaySpark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuningPerformance8:00 PM - 11:00 PM
Day 1726 JulySundayMetadata-driven pipeline — control table design, config ingestion, reusable ingestion frameworkFramework8:00 PM - 11:00 PM
Day 1829 JulyWednesdayCI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deploymentsCI/CD8:00 PM - 11:00 PM

🎯 The Real Business Problem

What a Senior DE is actually solving in production

📊
Fragmented Data Sources

EV telemetry arrives from IoT sensors, fleet APIs, PostgreSQL operational DB, partner CSVs, and third-party XML feeds — all in different formats and at different frequencies.

📉
No Single Source of Truth

Business teams can't trust vehicle usage reports because raw data has duplicates, late arrivals, schema drift, and no lineage tracking.

⏱️
Reporting Latency

Analysts wait 24+ hours for daily reports. Charging station utilisation data is stale before decisions can be made on fleet routing and energy pricing.

🔒
Credential Sprawl

DB passwords and API keys are hardcoded across notebooks and scripts. Every new joiner gets credentials in Slack — a security and audit nightmare.

What we build: A unified, secure, scalable Azure DE platform that ingests 50M+ records daily from all these sources, applies Medallion architecture, and delivers clean analytics-ready gold tables — with full audit trail, monitoring, and zero hardcoded secrets.

🏗️ Architecture Overview — Medallion on Azure

Sources
🗄️ PostgreSQL
Fleet & ops DB
🌊 Event Stream
IoT telemetry
📂 File Dumps
CSV/JSON/XML/ORC/Parquet
🔌 REST APIs
3rd-party EV data
Bronze Layer

ADLS Gen2

Raw ingestion · Delta format · No transformations
~50M records / day
Silver Layer

Databricks + PySpark

Dedup · Type cast · Schema enforce · CDC apply · Validate
Gold Layer

Data Warehouse

Business aggregates · KPIs · Reporting-ready tables
🔶 Bronze — Raw as-is · Append-only · Full history
⬜ Silver — Cleansed · Validated · Deduplicated
🔆 Gold — Aggregated · Partitioned · BI-Ready

📦 Data Scale & Volume

Data SourceFormatVolumeFrequencyLoad Type
EV Telemetry (IoT)JSON (streaming)~20M events/dayReal-timeStreaming
Fleet & Charging DB (Postgres)Relational tables~5M rows/day deltaEvery 2 hoursCDC / Incremental
Energy Pricing Feed (API)JSON / XML~500K records/dayHourlyFull + Delta
Partner Fleet ExportsCSV / ORC~10M rows/batchDailyBatch
Maintenance & AlertsParquet~2M records/dayEvery 6 hoursIncremental
Government Registration DataXML~500K records/monthMonthlyFull Load

Total Bronze ingestion: ~50M+ records per day · Retention: 3 years rolling · ~15 TB estimated annual volume

🗂️ Core Data Model (Gold Layer)

These are the final business tables you will build and populate — what analysts and BI tools consume:

🚗
dim_vehicle

Vehicle ID, make, model, battery capacity, registration, active flag — SCD Type 2 for historical tracking.

dim_charging_station

Station ID, location, charger type (AC/DC), max power, operator — geo-partitioned.

📅
dim_date / dim_time

Full date spine and time grain — used for all joins in fact tables.

📊
fact_charging_session

Session-level fact — vehicle, station, energy consumed, duration, cost, status. ~15M rows/month.

📡
fact_telemetry_hourly

Aggregated hourly vehicle telemetry — speed, battery %, odometer, temperature. Pre-aggregated from raw 20M events/day.

💰
fact_energy_cost

Station-level cost per kWh by time window — joined with sessions for profitability reporting.

☁️ Azure Services We Will Use

🔷
Azure Databricks

PySpark jobs, Delta Live Tables concept, Unity Catalog basics, job clusters, and workflow scheduling.

🌊
Azure Event Stream

Real-time ingestion of EV IoT telemetry at 20M+ events/day into Bronze Delta table.

📦
ADLS Gen2 + Blob Storage

Tiered landing zones — hot tier for active ingestion, cool tier for historical Bronze files.

🔄
Azure Data Factory

Parameterised pipelines, ForEach loops, copy activities, triggers, integration runtimes.

🏬
Azure Synapse / Warehouse

Gold layer serving — dedicated SQL pool, external tables, PolyBase, partitioning strategy.

🔑
Azure Key Vault

All secrets — DB passwords, API keys, SAS tokens — fetched at runtime. No plaintext credentials anywhere.

🐘
PostgreSQL (Source)

Operational fleet + charging DB — we implement CDC and incremental extraction patterns on it.

🪙
Delta Lake

ACID + schema enforcement + time travel + MERGE (upsert) — foundation of all three Medallion layers.

📋
Azure Monitor + Log Analytics

Pipeline run logs, failure alerts, SLA dashboards, and cost tracking — real production observability.

🔐
Azure Active Directory / Entra ID

Service principals, RBAC for storage and Databricks, managed identities — zero-credential architecture.

⚙️ Production Engineering Patterns We'll Implement

📥
ELT Pipeline

Load raw → transform in-platform. Separation of concerns between ingestion and transformation layers.

🔁
CDC — Change Data Capture

Watermark-based and log-based CDC from PostgreSQL. Only process changed rows — not full dumps.

📈
Incremental Load with High Watermark

Track last-processed timestamp per source table. Re-runnable without reprocessing old data.

Idempotent Pipeline Design

MERGE-based upserts on Delta tables. Every pipeline run produces identical output regardless of how many times it runs.

🔂
Backfill Strategy

Re-process historical date ranges without breaking current data. Date-partition override approach.

🧬
SCD Type 2 (Slowly Changing Dimensions)

Track vehicle and station attribute changes over time — effective_from / effective_to pattern.

📂
Multi-Format Ingestion

Single parameterised reader handles Parquet, CSV, JSON, XML, ORC — format detected from metadata config.

🧱
Schema Evolution Handling

Delta Lake schema mergeSchema + explicit schema checks. Alerting on unexpected column additions from upstream.

🔀
Late-Arriving Data

Event-time vs processing-time reconciliation. Reprocess affected partitions from Bronze when late events arrive.

🗺️
System Design Decisions

Partitioning strategy, Z-ordering, file compaction, cluster sizing, cost vs latency trade-offs — all documented.

🚨 Real Pipeline Failures We'll Handle

Production pipelines break — we'll deliberately introduce and fix these:

💥
Corrupt Source File

Malformed JSON / truncated Parquet from partner drops — quarantine to dead-letter folder, alert, continue pipeline.

🔌
Source DB Connection Drop

PostgreSQL connection timeout mid-batch — retry logic with exponential backoff, checkpoint recovery.

📉Schema Drift from Source

Upstream team adds a column without notice — pipeline detects schema change, raises alert, continues with known schema.

🔁
Duplicate Records

Event Stream delivers duplicate IoT events — Silver layer dedup on composite key before MERGE to Gold.

🕐
Pipeline SLA Breach

Bronze load takes 4h instead of 1h — Azure Monitor alert fires, on-call notification via webhook.

📊
Data Quality Failure

NULL rate exceeds threshold in critical column — pipeline publishes DQ report, halts Gold load, creates incident.

📡 Observability & Monitoring Setup

📋
Pipeline Run Metadata Table

Custom pipeline_audit table — records every run: source, rows read, rows written, duration, status, error message.

🔔
Azure Monitor Alerts

ADF pipeline failure → email + webhook alert. Databricks job timeout → PagerDuty-style notification.

📊
Data Quality Dashboard

Automated DQ checks after each Silver run — NULL %, duplicate %, out-of-range values logged to a QC table.

🏷️
Data Lineage Tracking

Source-to-Gold column lineage documented in pipeline metadata — know exactly which Bronze file a Gold record came from.

💰
Cost Tracking

Databricks DBU usage per job, ADF activity runs — weekly cost report to prevent budget overruns.

🕐
SLA Report

Daily report: did each pipeline meet its SLA window? Trend over 30 days — visible late-delivery patterns.

📅 18-Day Project Schedule (2 hours / day)

DayFocusLayer
Day 1Azure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBACSetup
Day 2Data source mapping — Postgres schema, API contracts, file formats, volume estimationDesign
Day 3Bronze ingestion — batch ELT from Postgres + partner file drops to ADLS DeltaBronze
Day 4Streaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day)Bronze
Day 5Multi-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORCBronze
Day 6CDC implementation — watermark-based change capture from PostgreSQL fleet DBBronze
Day 7Incremental load + idempotent pipeline — high-watermark pattern + MERGE on DeltaSilver
Day 8Silver transforms — dedup, type cast, schema enforcement, NULL handling with PySparkSilver
Day 9ADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestionSilver
Day 10SCD Type 2, late-arriving data handling, schema drift alertingSilver
Day 11Gold layer — fact + dimension models, Z-ordering, partition strategy, DW loadGold
Day 12Pipeline audit table + DQ checks + Azure Monitor alerts + cost trackingMonitoring
Day 13Failure simulation — corrupt file, schema drift, duplicate events — fix them liveFailures
Day 14End-to-end run — full pipeline from raw events to Gold warehouse tables, validateE2E
Day 15Interview walk-through — system design, trade-off questions, resume bullets, mock Q&AInterview
Day 16Spark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuningPerformance
Day 17Metadata-driven pipeline — control table design, config ingestion, reusable ingestion frameworkFramework
Day 18CI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deploymentsCI/CD

👤 Who This Is For

Has 2+ years in DE / Data roles

You understand ETL basics but haven't built a production-grade multi-source pipeline at scale before.

Wants an Azure-based project on resume

Most DE interviews now require Azure/cloud project experience — this gives you a real, explainable one.

Struggling with system design rounds

No project = no story. This gives you architecture decisions to defend with actual trade-off reasoning.

Switching to Data Engineering

Coming from Software, Analytics, or BI — this is the fastest way to build and articulate a production DE project.

Frequently Asked Questions

Everything you need to know before enrolling

Yes — all 18 sessions are conducted live on Zoom at 8:00 PM – 11:00 PM IST on Wednesdays, Saturdays, and Sundays. You get direct access to ask questions during the session.
Yes — lifetime access to all recorded sessions is included. If you miss a live session, the recording will be shared with you so you can catch up at your own pace.
This project is designed for people with at least 1–2 years in a data or software role. You should be comfortable reading Python and have a basic understanding of SQL. Total beginners will find it challenging — it is production-grade work, not an intro course.
Yes — if you are coming from Software Engineering, Analytics, BI, or any tech-adjacent background and want to switch into Data Engineering, this project is ideal. It gives you a real, explainable production project to put on your resume and walk through in interviews.
Absolutely. If you are already a Data Engineer but have not built a multi-source production pipeline at scale on Azure, this project fills that gap. It covers advanced patterns like CDC, SCD Type 2, idempotent pipelines, Spark performance tuning, and CI/CD — the exact skills needed to move from 8–12 LPA to 20+ LPA roles.
Azure provides a free trial with ₹13,500 of credits for new accounts — which is more than enough to complete this project. We will guide you through setting up everything on Day 1. No additional Azure cost is required for most participants.
Yes — all code is pushed to a shared GitHub repository after each session. You will also receive the architecture diagram, documentation, and resume bullet templates you can customise and add to your profile immediately.

📦 What You Get

  • ✅ Full Azure DE project — documented code + architecture diagram
  • ✅ 50M+ record pipeline — real data volume, real scale decisions
  • ✅ Medallion architecture implemented end-to-end
  • ✅ CDC, Incremental Load, SCD Type 2, Idempotent patterns
  • ✅ Multi-format ingestion (Parquet, CSV, JSON, XML, ORC)
  • ✅ Azure Key Vault + service principal auth — production security
  • ✅ Pipeline audit table + DQ checks + Azure Monitor alerts
  • ✅ Live failure simulation + debugging session
  • ✅ Gold data model: facts + dimensions + SCD Type 2
  • ✅ System design walk-through — every architecture decision explained
  • ✅ Resume bullet templates + interview storytelling framework
  • ✅ Lifetime access to recordings + all code on GitHub

Build it once. Explain it in every interview.

₹999 One-time · Lifetime access of recording + code

Login to Enroll