Build a real production-grade Data Engineering system processing 50M+ records from the Electric Vehicles domain — across streaming, batch, CDC, and multi-format sources — all on Microsoft Azure with Medallion Architecture.
₹999 One-time · Live classes with life time access of recorded sessions
18 Days
30+ Hours
8:00 PM to 11:00 PM
Wednesday, Saturday, Sunday
| Session | Date | Day | Focus | Layer | Time |
|---|---|---|---|---|---|
| Day 1 | 20 June | Saturday | Azure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBAC | Setup | 8:00 PM - 11:00 PM |
| Day 2 | 21 June | Sunday | Data source mapping — Postgres schema, API contracts, file formats, volume estimation | Design | 8:00 PM - 11:00 PM |
| Day 3 | 24 June | Wednesday | Bronze ingestion — batch ELT from Postgres + partner file drops to ADLS Delta | Bronze | 8:00 PM - 11:00 PM |
| Day 4 | 27 June | Saturday | Streaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day) | Bronze | 8:00 PM - 11:00 PM |
| Day 5 | 28 June | Sunday | Multi-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORC | Bronze | 8:00 PM - 11:00 PM |
| Day 6 | 1 July | Wednesday | CDC implementation — watermark-based change capture from PostgreSQL fleet DB | Bronze | 8:00 PM - 11:00 PM |
| Day 7 | 4 July | Saturday | Incremental load + idempotent pipeline — high-watermark pattern + MERGE on Delta | Silver | 8:00 PM - 11:00 PM |
| Day 8 | 5 July | Sunday | Silver transforms — dedup, type cast, schema enforcement, NULL handling with PySpark | Silver | 8:00 PM - 11:00 PM |
| Day 9 | 8 July | Wednesday | ADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestion | Silver | 8:00 PM - 11:00 PM |
| Day 10 | 11 July | Saturday | SCD Type 2, late-arriving data handling, schema drift alerting | Silver | 8:00 PM - 11:00 PM |
| Day 11 | 12 July | Sunday | Gold layer — fact + dimension models, Z-ordering, partition strategy, DW load | Gold | 8:00 PM - 11:00 PM |
| Day 12 | 15 July | Wednesday | Pipeline audit table + DQ checks + Azure Monitor alerts + cost tracking | Monitoring | 8:00 PM - 11:00 PM |
| Day 13 | 18 July | Saturday | Failure simulation — corrupt file, schema drift, duplicate events — fix them live | Failures | 8:00 PM - 11:00 PM |
| Day 14 | 19 July | Sunday | End-to-end run — full pipeline from raw events to Gold warehouse tables, validate | E2E | 8:00 PM - 11:00 PM |
| Day 15 | 22 July | Wednesday | Interview walk-through — system design, trade-off questions, resume bullets, mock Q&A | Interview | 8:00 PM - 11:00 PM |
| Day 16 | 25 July | Saturday | Spark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuning | Performance | 8:00 PM - 11:00 PM |
| Day 17 | 26 July | Sunday | Metadata-driven pipeline — control table design, config ingestion, reusable ingestion framework | Framework | 8:00 PM - 11:00 PM |
| Day 18 | 29 July | Wednesday | CI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deployments | CI/CD | 8:00 PM - 11:00 PM |
What a Senior DE is actually solving in production
EV telemetry arrives from IoT sensors, fleet APIs, PostgreSQL operational DB, partner CSVs, and third-party XML feeds — all in different formats and at different frequencies.
Business teams can't trust vehicle usage reports because raw data has duplicates, late arrivals, schema drift, and no lineage tracking.
Analysts wait 24+ hours for daily reports. Charging station utilisation data is stale before decisions can be made on fleet routing and energy pricing.
DB passwords and API keys are hardcoded across notebooks and scripts. Every new joiner gets credentials in Slack — a security and audit nightmare.
What we build: A unified, secure, scalable Azure DE platform that ingests 50M+ records daily from all these sources, applies Medallion architecture, and delivers clean analytics-ready gold tables — with full audit trail, monitoring, and zero hardcoded secrets.
ADLS Gen2
Raw ingestion · Delta format · No transformationsDatabricks + PySpark
Dedup · Type cast · Schema enforce · CDC apply · ValidateData Warehouse
Business aggregates · KPIs · Reporting-ready tables| Data Source | Format | Volume | Frequency | Load Type |
|---|---|---|---|---|
| EV Telemetry (IoT) | JSON (streaming) | ~20M events/day | Real-time | Streaming |
| Fleet & Charging DB (Postgres) | Relational tables | ~5M rows/day delta | Every 2 hours | CDC / Incremental |
| Energy Pricing Feed (API) | JSON / XML | ~500K records/day | Hourly | Full + Delta |
| Partner Fleet Exports | CSV / ORC | ~10M rows/batch | Daily | Batch |
| Maintenance & Alerts | Parquet | ~2M records/day | Every 6 hours | Incremental |
| Government Registration Data | XML | ~500K records/month | Monthly | Full Load |
Total Bronze ingestion: ~50M+ records per day · Retention: 3 years rolling · ~15 TB estimated annual volume
These are the final business tables you will build and populate — what analysts and BI tools consume:
Vehicle ID, make, model, battery capacity, registration, active flag — SCD Type 2 for historical tracking.
Station ID, location, charger type (AC/DC), max power, operator — geo-partitioned.
Full date spine and time grain — used for all joins in fact tables.
Session-level fact — vehicle, station, energy consumed, duration, cost, status. ~15M rows/month.
Aggregated hourly vehicle telemetry — speed, battery %, odometer, temperature. Pre-aggregated from raw 20M events/day.
Station-level cost per kWh by time window — joined with sessions for profitability reporting.
PySpark jobs, Delta Live Tables concept, Unity Catalog basics, job clusters, and workflow scheduling.
Real-time ingestion of EV IoT telemetry at 20M+ events/day into Bronze Delta table.
Tiered landing zones — hot tier for active ingestion, cool tier for historical Bronze files.
Parameterised pipelines, ForEach loops, copy activities, triggers, integration runtimes.
Gold layer serving — dedicated SQL pool, external tables, PolyBase, partitioning strategy.
All secrets — DB passwords, API keys, SAS tokens — fetched at runtime. No plaintext credentials anywhere.
Operational fleet + charging DB — we implement CDC and incremental extraction patterns on it.
ACID + schema enforcement + time travel + MERGE (upsert) — foundation of all three Medallion layers.
Pipeline run logs, failure alerts, SLA dashboards, and cost tracking — real production observability.
Service principals, RBAC for storage and Databricks, managed identities — zero-credential architecture.
Load raw → transform in-platform. Separation of concerns between ingestion and transformation layers.
Watermark-based and log-based CDC from PostgreSQL. Only process changed rows — not full dumps.
Track last-processed timestamp per source table. Re-runnable without reprocessing old data.
MERGE-based upserts on Delta tables. Every pipeline run produces identical output regardless of how many times it runs.
Re-process historical date ranges without breaking current data. Date-partition override approach.
Track vehicle and station attribute changes over time — effective_from / effective_to pattern.
Single parameterised reader handles Parquet, CSV, JSON, XML, ORC — format detected from metadata config.
Delta Lake schema mergeSchema + explicit schema checks. Alerting on unexpected column additions from upstream.
Event-time vs processing-time reconciliation. Reprocess affected partitions from Bronze when late events arrive.
Partitioning strategy, Z-ordering, file compaction, cluster sizing, cost vs latency trade-offs — all documented.
Production pipelines break — we'll deliberately introduce and fix these:
Malformed JSON / truncated Parquet from partner drops — quarantine to dead-letter folder, alert, continue pipeline.
PostgreSQL connection timeout mid-batch — retry logic with exponential backoff, checkpoint recovery.
Upstream team adds a column without notice — pipeline detects schema change, raises alert, continues with known schema.
Event Stream delivers duplicate IoT events — Silver layer dedup on composite key before MERGE to Gold.
Bronze load takes 4h instead of 1h — Azure Monitor alert fires, on-call notification via webhook.
NULL rate exceeds threshold in critical column — pipeline publishes DQ report, halts Gold load, creates incident.
Custom pipeline_audit table — records every run: source, rows read, rows written, duration, status, error message.
ADF pipeline failure → email + webhook alert. Databricks job timeout → PagerDuty-style notification.
Automated DQ checks after each Silver run — NULL %, duplicate %, out-of-range values logged to a QC table.
Source-to-Gold column lineage documented in pipeline metadata — know exactly which Bronze file a Gold record came from.
Databricks DBU usage per job, ADF activity runs — weekly cost report to prevent budget overruns.
Daily report: did each pipeline meet its SLA window? Trend over 30 days — visible late-delivery patterns.
| Day | Focus | Layer |
|---|---|---|
| Day 1 | Azure setup — Databricks workspace, ADLS Gen2, Key Vault, service principals, RBAC | Setup |
| Day 2 | Data source mapping — Postgres schema, API contracts, file formats, volume estimation | Design |
| Day 3 | Bronze ingestion — batch ELT from Postgres + partner file drops to ADLS Delta | Bronze |
| Day 4 | Streaming ingestion — Azure Event Stream → Bronze Delta (20M EV IoT events/day) | Bronze |
| Day 5 | Multi-format ingestion — unified reader for Parquet, CSV, JSON, XML, ORC | Bronze |
| Day 6 | CDC implementation — watermark-based change capture from PostgreSQL fleet DB | Bronze |
| Day 7 | Incremental load + idempotent pipeline — high-watermark pattern + MERGE on Delta | Silver |
| Day 8 | Silver transforms — dedup, type cast, schema enforcement, NULL handling with PySpark | Silver |
| Day 9 | ADF parameterised pipelines — ForEach, dynamic content, meta-driven ingestion | Silver |
| Day 10 | SCD Type 2, late-arriving data handling, schema drift alerting | Silver |
| Day 11 | Gold layer — fact + dimension models, Z-ordering, partition strategy, DW load | Gold |
| Day 12 | Pipeline audit table + DQ checks + Azure Monitor alerts + cost tracking | Monitoring |
| Day 13 | Failure simulation — corrupt file, schema drift, duplicate events — fix them live | Failures |
| Day 14 | End-to-end run — full pipeline from raw events to Gold warehouse tables, validate | E2E |
| Day 15 | Interview walk-through — system design, trade-off questions, resume bullets, mock Q&A | Interview |
| Day 16 | Spark performance tuning — AQE, shuffle optimization, skew handling, join strategy, caching, cluster tuning | Performance |
| Day 17 | Metadata-driven pipeline — control table design, config ingestion, reusable ingestion framework | Framework |
| Day 18 | CI/CD implementation — Azure DevOps pipelines and Git workflow for data engineering deployments | CI/CD |
You understand ETL basics but haven't built a production-grade multi-source pipeline at scale before.
Most DE interviews now require Azure/cloud project experience — this gives you a real, explainable one.
No project = no story. This gives you architecture decisions to defend with actual trade-off reasoning.
Coming from Software, Analytics, or BI — this is the fastest way to build and articulate a production DE project.
Everything you need to know before enrolling
Build it once. Explain it in every interview.
₹999 One-time · Lifetime access of recording + code
Login to Enroll