Azure Data Factory — Data Integration at Scale

Build enterprise data pipelines with Azure Data Factory — ingest, transform, and orchestrate data from any source.

slides

Slide 1 / 8

Azure Data Factory

Enterprise Data Integration & ETL Pipeline Service
Azure Data & Analytics — Episode 29

Speaker Script

“Welcome back. Today we cover Azure Data Factory — Azure's cloud-native data integration service. Every enterprise has data scattered across dozens of systems: SQL databases, Salesforce, SAP, SharePoint, flat files, REST APIs. Data Factory is the orchestration layer that moves, transforms, and schedules data flows between all these systems. It's the backbone of modern enterprise data engineering in Azure.”

Slide 2 / 8

ADF Core Concepts

Pipelines — logical grouping of activities
Activities — individual steps (Copy, Data Flow, Lookup, ForEach)
Linked Services — connection definitions for data sources
Datasets — reference to specific data in a linked service
Triggers — when and how pipelines run
Integration Runtime — compute that executes activities

Speaker Script

“Data Factory has a clear conceptual hierarchy. Pipelines contain activities — the individual steps of your data workflow. Linked Services define connections to data sources — your Azure SQL credentials, your Salesforce OAuth connection, your S3 access keys. Datasets describe the specific data within a linked service — the table name, the file path, the container name. Triggers define when pipelines run — on a schedule, when a file arrives, or manually. The Integration Runtime is the compute that executes the activities.”

Slide 3 / 8

Copy Activity

High-throughput data movement between 100+ sources and sinks
Sources: Azure SQL, Cosmos DB, Salesforce, REST, S3, SAP, Oracle...
Sinks: Azure Data Lake, Synapse, Blob, SQL, Cosmos DB...
Parallel copy — multi-threaded for maximum throughput
Schema mapping — map source to destination columns
Fault tolerance — skip bad rows, log to error file

Speaker Script

“The Copy Activity is ADF's workhorse — it moves data from any source to any destination with high throughput. It supports over 100 data stores as both source and sink, including virtually every database, file system, SaaS application, and cloud storage service. Parallel copy uses multiple threads to maximize throughput — it can move terabytes of data in minutes. Schema mapping handles column name differences between source and destination. Fault tolerance can skip malformed rows and log them for review rather than failing the entire pipeline.”

Slide 4 / 8

Data Flows — Visual ETL

Visual, code-free data transformation designer
Transformations: filter, join, aggregate, pivot, sort, lookup
Runs on managed Apache Spark clusters
Debug mode — preview data at each transformation step
Handles complex multi-source transformations visually
Generated Spark code — transparent and auditable

Speaker Script

“Data Flows provide a visual drag-and-drop designer for data transformations. You connect transformation nodes — source, filter, join, aggregate, sort, sink — in a canvas without writing code. Debug mode is invaluable — click any node and see a sample of the data flowing through at that point. Under the hood, Data Flows generate Apache Spark code that runs on ADF's managed Spark clusters, so they scale to large datasets automatically. For teams without deep code skills, Data Flows make complex transformations accessible.”

Slide 5 / 8

Incremental Loading Patterns

Full load — copy everything every run (simple, expensive)
Incremental — copy only new/changed records
Watermark pattern — track last loaded timestamp
Change Data Capture — capture row-level changes from source
Delta Lake merge — upsert changes into data lake

Speaker Script

“Loading data is easy; loading it efficiently is an art. Full loads copy everything every time — simple but creates unnecessary load and cost. Incremental loading copies only records that changed since the last run. The watermark pattern stores the maximum timestamp from the last run and queries only records with a later timestamp. Change Data Capture reads the database transaction log to capture inserts, updates, and deletes with no impact on the source system. For large production data pipelines, incremental loading is essential.”

Slide 6 / 8

Triggers & Orchestration

Schedule trigger — cron expression, daily/hourly/weekly
Tumbling window — independent runs with historical backfill
Event trigger — blob created/deleted in storage
Manual trigger — on-demand pipeline execution
Pipeline dependencies — run B after A succeeds
Error handling — retry policies, failure paths

Speaker Script

“ADF offers several trigger types. Schedule triggers run pipelines on a cron schedule. Tumbling window triggers process data in non-overlapping time windows — perfect for hourly or daily batch processing with built-in backfill if a run fails. Event triggers respond to storage events — a new file appearing in a container automatically triggers processing. Within pipelines, activities can branch on success or failure — retry a failed activity, send an alert email, or run a cleanup activity on failure.”

Slide 7 / 8

Live Azure Demo

Create ADF and linked services
Build copy pipeline: SQL → Parquet in Data Lake
Add Data Flow: filter and aggregate
Set daily schedule trigger
Monitor pipeline run and view metrics

Speaker Script

“Let me walk through building a real data pipeline. I'll create Data Factory, configure linked services for Azure SQL and Blob Storage, build a pipeline that copies SQL table data to Parquet format in a Data Lake, add a Data Flow to filter and aggregate the data, set a daily trigger, and run it manually to watch the execution. ADF's monitoring interface shows you exactly how long each activity took and how many rows were processed.”

Slide 8 / 8

Summary & What's Next

✅ ADF — enterprise ETL orchestration, 100+ connectors
✅ Copy Activity — high-throughput data movement
✅ Data Flows — visual ETL on managed Spark
✅ Incremental loading — efficient, scalable data updates
✅ Schedule + event triggers — automated pipeline execution
Next: Azure Databricks →

Speaker Script

“Azure Data Factory is the glue that holds enterprise data architectures together. Whether you're building a simple nightly batch job or a complex real-time data mesh, ADF handles the orchestration. Next we cover Azure Databricks — the premium Apache Spark platform for data engineering and machine learning, combining the power of Databricks with the enterprise capabilities of Azure.”

🖥️Azure Demo Steps

1Create Azure Data Factory
2Create a linked service to Azure SQL Database
3Create a linked service to Azure Blob Storage
4Build a copy pipeline: SQL → Blob as Parquet
5Add a Data Flow transformation (filter + aggregate)
6Configure trigger: daily at 2am
7Monitor pipeline run and view activity logs