DataBricks Lakehouse Services

One Platform for Data, Analytics & AI. Built on the Lakehouse Architecture.

DataBricks is the unified data, analytics, and AI platform built on the Lakehouse architecture — the paradigm that combines the low-cost, open-format storage of data lakes with the transactional reliability, performance, and governance of data warehouses. By storing all data in open Delta Lake format on cloud object storage (S3, ADLS, GCS) and processing it through Apache Spark, SQL, and AI-optimised compute, DataBricks eliminates the architectural split between data lakes and data warehouses that has forced organisations to maintain duplicate data copies, inconsistent governance, and separate engineering teams for ETL and BI workloads. DataBricks' Unity Catalog provides unified governance across all data and AI assets, Delta Live Tables brings declarative pipeline engineering to ETL, and the platform's AI runtime (powered by Mosaic AI) enables organisations to train, tune, and deploy LLMs and ML models on the same data infrastructure their analysts query. SourceMash delivers DataBricks engagements covering workspace architecture, cloud migration, Delta Lake modelling, Delta Live Tables pipeline engineering, Delta Sharing, MLflow ML operations, Unity Catalog governance, and FinOps DBU optimisation for enterprise clients.

Start Your Cloud Engagement Explore All Services

Core DataBricks Service Areas

AWS

Azure | GCP | Multi-Cloud DataBricks

DLT

Delta Live Tables | Delta Sharing Certified

DataBricks

Certified Architects & Engineers

35%

Avg. DBU Cost Reduction via FinOps

Architecture Migration Delta & Modelling DLT Pipelines Delta Sharing ML & Mosaic AI Governance FinOps

DataBricks Platform

Lakehouse Architecture Unifying Data, Analytics & AI. Open, Governed, and Scalable.

DataBricks solves the fundamental architectural split that has plagued enterprise data platforms for a decade: the separation between data lakes (cheap, open-format storage for raw data, ML, and streaming) and data warehouses (reliable, high-performance SQL analytics with governance). Maintaining both means duplicate storage costs, inconsistent data quality, separate governance models, and engineering teams that cannot collaborate because they work on different copies of the same data. The Lakehouse architecture stores all data in Delta Lake — an open, transactional storage layer built on Parquet files with ACID guarantees, time travel, and schema enforcement — sitting on standard cloud object storage. DataBricks SQL warehouses provide high-performance BI querying on the same Delta tables that DataBricks ML runtime uses for feature engineering and model training, and that Delta Live Tables pipelines write to in streaming and batch ETL jobs.

SourceMash's DataBricks practice covers the full platform: workspace and Unity Catalog architecture, migration from Snowflake, Redshift, Synapse, BigQuery, Hadoop, and on-premise data warehouses, Delta Lake data modelling and medallion architecture, Delta Live Tables pipeline engineering, Delta Sharing and the DataBricks Marketplace, MLflow ML operations and Mosaic AI for generative AI, Unity Catalog governance with dynamic data masking and lineage, and the FinOps programme that aligns DBU consumption to business value.

Workspace Architecture Cloud Migration Delta Lake Modelling DLT Pipelines Delta Sharing ML & Mosaic AI Unity Catalog Governance FinOps & DBUs BI Integration Auto Loader & Streaming

DataBricks Compute Selection Guide

⚡

All-Purpose Compute

Interactive development, notebooks, ad-hoc analysis, data exploration, and ML experimentation with auto-scaling and spot instance support

📊

SQL Warehouses

Serverless and classic SQL endpoints optimised for BI concurrency, with Photon engine acceleration, predictive IO, and connection pooling for Power BI / Tableau

🔧

Jobs Compute

Cost-optimised clusters for Delta Live Tables, scheduled notebooks, and workflow jobs with automatic termination, spot instances, and job-level DBU pricing

🧠

ML / GPU Compute

Machine learning and GPU-enabled clusters for model training, LLM fine-tuning, and deep learning with pre-configured ML runtime and Mosaic AI integration

DataBricks Certifications

DataBricks Certified Data Engineer DataBricks Certified ML Associate DataBricks Certified Lakehouse Architect DataBricks Certified SQL Analyst Delta Lake Certified Developer DataBricks Partner Network

Service 01

DataBricks Workspace Architecture & Unity Catalog Design

DataBricks workspace architecture — the decisions that determine how data is organised in Unity Catalog, how compute is configured for each workload type, how costs are tracked and controlled at the workspace and account level, and how security is enforced across cloud accounts — is the foundation that determines whether the Lakehouse delivers unified analytics and AI or becomes a collection of disconnected notebooks with ungoverned data copies. DataBricks' flexibility in workspace design, metastore topology, and cloud networking requires deliberate architecture: a single workspace with default catalog permissions and no cluster policies produces the same governance gaps and budget surprises that the Lakehouse architecture is designed to eliminate.

The foundational architecture decisions — single workspace vs. multi-workspace topology (separate workspaces for production, development, and sandbox experimentation), Unity Catalog metastore design with catalog and schema hierarchy reflecting data domains and access boundaries, cluster and SQL warehouse configuration for each workload type (all-purpose clusters for development, jobs compute for ETL, SQL warehouses for BI, ML compute for training), and cluster policies that enforce auto-termination, instance types, and cost tags — must be made before any production workload lands in the platform.

Design Your DataBricks Architecture Request an Architecture Review

Architecture — Design Scope

SourceMash DataBricks architecture practice

Workspace
Topology Single vs. multi-workspace design

Compute
Types All-purpose / Jobs / SQL / ML GPU

Unity
Catalog Metastore, catalog, schema hierarchy

Cloud /
Region AWS, Azure, GCP — data residency

Cluster
Policies Auto-termination, tags, spot instances

Network
Policy Private Link / VPC peering / IP access

Cluster & SQL Warehouse Design

Compute configuration for each workload class — the independently-scalable clusters that are the primary lever for both performance and cost management in DataBricks. All-purpose clusters: interactive development and ad-hoc analysis (auto-scaling enabled, auto-termination after 15–30 minutes of inactivity, spot instance mixing for cost reduction, single-user or shared access modes depending on Unity Catalog isolation requirements). SQL Warehouses: serverless or classic endpoints for BI tools (Power BI, Tableau, Looker) with Photon engine enabled for vectorized query execution, predictive IO for cloud storage caching, and scaling settings optimised for concurrency (max 10–20 queries per cluster with auto-scaling). Jobs compute: cost-optimised clusters for Delta Live Tables and scheduled workflows (automatic termination on job completion, spot instances, smallest viable instance type for the workload, job-level DBU pricing that is significantly cheaper than all-purpose compute).

Compute Design

Unity Catalog & Schema Hierarchy

Unity Catalog hierarchy design that reflects both the data's logical organisation and the governance model. Metastore: the top-level container that manages all data assets across workspaces in a region (one metastore per region, shared across production and development workspaces with access control providing isolation). Catalogs: the logical boundary for data domains (RAW for landed source data, TRANSFORMED for cleansed and enriched data, ANALYTICS for dimensional models and aggregated tables, ML_FEATURES for feature store tables, SANDBOX for experimental data). Schemas: subject-area or source-system boundaries within each catalog (RAW.SALESFORCE, RAW.SAP, ANALYTICS.CUSTOMER_360). Volumes: non-tabular data storage (raw files, images, model artefacts) governed by the same Unity Catalog permissions as tables. Three-layer namespace (catalog.schema.table) that replaces the legacy Hive metastore two-layer namespace and enables cross-catalog data sharing without data movement.

Unity Catalog

Cluster Policies & Cost Guards

Cluster policy design — the DataBricks-native mechanism for preventing runaway cost before it appears on the monthly bill. Cluster policies enforce: auto-termination timeout (critical for all-purpose clusters where developers leave notebooks running overnight), instance type restrictions (preventing provisioning of expensive GPU or high-memory instances for standard ETL workloads), maximum node count limits (preventing auto-scaling beyond a defined boundary), and predefined cost tags that allocate every DBU to a cost centre. Workspace-level budget alerts: configuring billing alerts at the account level when DBU consumption exceeds defined thresholds. Instance pool pre-warming: keeping a pool of idle instances ready for fast cluster startup without paying DBUs for the idle time (only cloud VM costs, not DataBricks DBUs) — ideal for development environments where fast startup is more important than minimising every rupee.

Cost Guards

Network Policy & Private Connectivity

DataBricks network security configuration — controlling how workspaces connect to cloud infrastructure and how clients connect to DataBricks. VPC / VNet injection: deploying the DataBricks control plane and data plane within the customer's own cloud network rather than the default multi-tenant network — required for financial services and healthcare compliance. AWS PrivateLink, Azure Private Link, and GCP Private Service Connect for private connectivity between customer networks and DataBricks workspaces without traversing the public internet. IP access lists: restricting workspace access to corporate VPN egress ranges and office IPs. Token and credential management: personal access tokens with scoped permissions, service principals for CI/CD and automation, and OAuth integration with Azure AD / AWS IAM for SSO. Credential passthrough: cloud-native identity federation that allows clusters to access cloud storage using the identity of the user running the notebook rather than a shared service account — enabling fine-grained storage-level access control through Unity Catalog.

Private Link

Delta Time Travel & Restore

Delta Lake Time Travel for data recovery and audit — one of the most valuable operational features that requires no backup infrastructure and imposes minimal storage overhead. Time Travel (configurable retention, default 30 days): querying any Delta table as it existed at any point within the retention window using VERSION AS OF or TIMESTAMP AS OF syntax — enabling recovery from accidental DELETE or MERGE operations, comparison of current data to historical snapshots, and reproducible ML training by training on the exact data state that produced a specific model version. CLONE: creating a deep or shallow clone of a Delta table at a specific version — deep clone copies data files (useful for environment isolation), shallow clone copies only the transaction log pointer (instant, zero-copy clone for testing or development that shares underlying data files until modified). VACUUM: removing old data files beyond the retention period to control storage cost while preserving Time Travel history within the window.

Time Travel

Multi-Cloud & Multi-Region

DataBricks multi-cloud and multi-region deployment for organisations with data residency requirements, cloud provider diversification strategies, or the need to bring analytics and AI compute close to data consumers in different geographies. Data replication via Delta Sharing: sharing live Delta tables between DataBricks workspaces in different regions or clouds without copying data — the recipient queries against live shared data using their own compute. Cross-region metastore federation: Unity Catalog can manage data assets across regions while respecting data residency boundaries. Cloud-agnostic architecture: because DataBricks runs on all three major clouds with an identical API and SQL interface, organisations can maintain a single platform skill set while deploying workloads to the cloud provider most appropriate for each geography or data source. Disaster recovery: cloning critical Delta tables to a secondary region and maintaining a warm standby workspace for business continuity.

Multi-Cloud

Service 02

Lakehouse Migration — Snowflake, Redshift, Synapse, BigQuery, Hadoop & On-Premise

Migration to DataBricks from a legacy data platform — Snowflake, Amazon Redshift, Azure Synapse Analytics, Google BigQuery, on-premise Hadoop (HDFS, Hive, Impala), Teradata, or SQL Server data warehouses — is a data platform transformation project, not a lift-and-shift migration. Each source platform has its own SQL dialect, its own storage format, its own approach to transactions and concurrency, and its own governance model — none of which translate directly to DataBricks' Lakehouse architecture. A Snowflake table with micro-partitioning has no direct equivalent in Delta Lake (where Z-ORDER and liquid clustering handle data organisation); a Hadoop Parquet file without transaction support must be converted to Delta Lake format to gain ACID guarantees; and a Teradata stored procedure with procedural logic must be evaluated for whether it becomes a Delta Live Tables pipeline, a Python wheel on DataBricks, or a SQL UDF in Unity Catalog.

Plan Your DataBricks Migration Get a Migration Assessment

Migration — Source Platforms

SourceMash Databricks migration practice

Snowflake Micro-partition to Delta Lake translation

Amazon
Redshift DISTKEY / SORTKEY to Z-ORDER

Azure
Synapse Dedicated SQL Pool + Spark pool unify

Google
BigQuery Partitioning + clustering translation

Hadoop /
Hive HDFS to cloud storage + Delta format

Teradata / Oracle
DW BTEQ / PL-SQL to DLT + Python

Migration Assessment & SQL Inventory

Migration assessment covering the full inventory of objects in the source platform: tables (count, size, partitioning strategy, compression, column data types, primary keys), views (complexity classification — simple pass-through vs. complex multi-join with window functions), stored procedures and UDFs (procedural complexity, Spark SQL compatibility, Python migration candidates), ETL pipelines and data loading procedures (current orchestration tool and DataBricks equivalent approach), and the BI reports that must be validated against the migrated data. Automated compatibility analysis using DataBricks' SQL translation capabilities and Spark SQL syntax comparison to identify objects that translate automatically vs. objects requiring manual rewrite — producing the remediation effort estimate that drives the migration project timeline and cost. Data format assessment: evaluating whether source data is in formats natively readable by DataBricks (Parquet, Delta, CSV, JSON) or requires conversion (proprietary columnar formats, mainframe extracts, XML).

Assessment

Snowflake to DataBricks Migration

Snowflake to DataBricks migration — driven by organisations seeking to unify their data warehouse and data lake on a single platform, reduce proprietary storage costs, or leverage DataBricks' AI and ML capabilities on the same data their analysts query. Schema translation: Snowflake's micro-partitioning and clustering keys have no direct equivalent in Delta Lake — replaced by Delta Lake's Z-ORDER optimisation (co-locating column values in the same set of files to improve query pruning) and liquid clustering (automatically maintaining data layout without manual Z-ORDER commands). SQL dialect differences: Snowflake's QUALIFY clause → Spark SQL window functions with ROW_NUMBER() OVER(), Snowflake's VARIANT → Spark STRUCT or MAP types, Snowflake's semi-structured flattening → Spark SQL's explode() and from_json(). Data transfer via Snowflake UNLOAD to cloud storage in Parquet format, followed by a one-time Delta Lake conversion (CREATE TABLE USING DELTA LOCATION ...) with OPTIMIZE and Z-ORDER — or via a continuous sync using Fivetran, Airbyte, or Spark JDBC until cutover.

Snowflake Migration

Hadoop & On-Premise Data Lake Migration

Hadoop (HDFS, Hive, Impala, Spark on-premise) to DataBricks migration — driven by organisations retiring on-premise Hadoop clusters due to unsupported operational complexity, hardware refresh costs, or the need to unify batch and streaming on a modern platform. HDFS to cloud storage: migrating petabyte-scale data from on-premise HDFS to S3, ADLS, or GCS using DistCp, WANdisco, or cloud-native transfer appliances — followed by format conversion from Hive-managed tables (often stored as text, SequenceFile, or ORC) to Delta Lake Parquet for ACID reliability and performance. Hive metastore translation: converting Hive database and table definitions to Unity Catalog catalogs and schemas, mapping Hive partition columns to Delta Lake partitioning, and translating Hive SerDe configurations to Spark read options. Spark job migration: on-premise Spark jobs (PySpark, Scala) typically require minimal code changes to run on DataBricks — primarily configuration changes (replacing HDFS paths with cloud storage URIs, updating metastore connections to Unity Catalog) and dependency management (migrating from cluster-installed libraries to DataBricks cluster-scoped libraries or Unity Catalog volumes).

Hadoop Migration

Redshift, Synapse & Cloud DW Migration

Amazon Redshift and Azure Synapse to DataBricks migration — driven by the desire to move from proprietary cloud data warehouses to an open Lakehouse architecture where data is stored in open Delta Lake format rather than a vendor-locked columnar format. Redshift schema translation: Redshift DISTKEY, SORTKEY, and INTERLEAVED SORTKEY table attributes have no equivalent in Delta Lake — replaced by Z-ORDER on frequently filtered columns and liquid clustering for automatic optimisation. Synapse HASH DISTRIBUTION and REPLICATED table strategies are irrelevant in Delta Lake's cloud storage model; the performance equivalent is file layout optimisation through partitioning and Z-ORDER. T-SQL to Spark SQL translation: proprietary T-SQL functions (STRING_SPLIT, OPENJSON, FOR JSON, HASHBYTES) require rewriting in Spark SQL or Python UDFs. Data export via Redshift UNLOAD or Synapse CETAS to cloud storage Parquet, then Delta conversion and OPTIMIZE.

Cloud DW Migration

Data Validation & Reconciliation

Post-migration data validation — the most important phase of any platform migration and the one most often compressed when project timelines are under pressure. Automated row count reconciliation across every migrated table (source count = target count for every date partition and dimension key value), aggregate validation (source SUM, MIN, MAX, COUNT DISTINCT for key business metrics compared to DataBricks values within a defined tolerance), and sample-level row-by-row comparison for a representative subset of each table (verifying data type casting, decimal precision, date/time handling, and NULL semantics between the source and DataBricks). Business metric reconciliation: running the equivalent of the business's critical financial or operational reports against both the source and DataBricks and comparing output — because raw table data can match while aggregated business metrics diverge due to join behaviour differences or filter predicate translation errors. Data quality profiling: using DataBricks' built-in data profiling and Great Expectations to validate that migrated data meets the same quality standards as the source.

Validation

Cutover Strategy & Parallel Run

Migration cutover strategy — the plan for transitioning the production analytics workload from the source platform to DataBricks with minimum disruption. Parallel run: running both the source and DataBricks environments simultaneously for a validation period (typically 2–4 weeks), with the source as the authoritative system but DataBricks producing the same reports for comparison. Parallel run enables business stakeholders to compare DataBricks report outputs to the existing reports they trust, building confidence before the cutover. Big-bang cutover: on the agreed date, the source is retired, all BI tools are reconnected to DataBricks SQL warehouses, and ETL pipelines are switched to write to Delta Lake — with a defined rollback path (reconnect BI tools to source) if a critical issue is found in the first 24 hours. Phased migration workloads sequentially (analytics first, operational reporting second, ML and AI last) to reduce the risk of any single cutover event.

Cutover

Service 03

Delta Lake Data Modelling — Medallion Architecture on DataBricks

Delta Lake is the open-source storage layer that brings ACID transactions, schema enforcement, time travel, and scalable metadata handling to data lakes built on cloud object storage. Before Delta Lake, data lakes were fundamentally unreliable for analytics: a Spark job writing Parquet files could fail halfway through leaving partial, unreadable data; concurrent reads and writes produced inconsistent results; and there was no mechanism to enforce schema or prevent data type corruption. Delta Lake solves these problems with a transaction log (the _delta_log) that records every operation atomically, optimistic concurrency control that allows multiple writers to append to the same table simultaneously, and schema enforcement that rejects writes violating the defined schema (with schema evolution available when deliberate changes are needed). DataBricks' implementation of Delta Lake adds proprietary optimisations — Predictive IO, Liquid Clustering, and Predictive Optimisation — that automate the maintenance operations (OPTIMIZE, VACUUM, ANALYZE) that keep Delta tables performing at warehouse-grade speed on lake-scale data volumes.

Build Your Delta Lake Analytics Layer Assess Your Modelling Maturity

Delta Lake — Platform Coverage

SourceMash data modelling practice

Storage
Format Open Parquet + Transaction Log

ACID Atomic commits, optimistic concurrency

Time
Travel VERSION AS OF / TIMESTAMP AS OF

Schema Enforcement + evolution modes

Optimization Z-ORDER, Liquid Clustering, OPTIMIZE

Streaming ReadStream / WriteStream Delta source

Medallion Architecture in Delta Lake

Medallion architecture implementation in DataBricks: Bronze layer (raw data landed from source systems without transformation — ingested via Auto Loader, Kafka, or JDBC, stored as Delta tables with schema inference or enforcement, partitioned by ingestion date for efficient time-based queries), Silver layer (cleaned, deduplicated, normalised, and enriched data merged into slowly changing dimension tables, deduplicated via MERGE INTO with match conditions, joined across source systems to create unified entity views, and quality-tested with Great Expectations or Delta Expectations), and Gold layer (business-aggregated, dimensional-modelled data ready for BI and ML consumption — star schema fact and dimension tables, summary aggregates, feature engineering outputs, and the curated datasets that power executive dashboards). Unity Catalog manages access control at each layer: analysts have SELECT on Gold only, data engineers have WRITE on Bronze and Silver, and ML engineers have SELECT on Silver and Gold plus WRITE on ML_FEATURES.

Medallion

Incremental Processing & MERGE Strategy

Incremental data processing for large tables where full table recreation on every pipeline run is prohibitively slow or expensive — Delta Lake's MERGE INTO operation enables upserts that insert new rows and update changed rows in a single atomic transaction. Incremental patterns in DataBricks: append-only (new rows only, no updates — simplest, fastest, uses INSERT INTO for append-only sources like clickstream), MERGE INTO upsert (match on business key, update changed attributes, insert new rows — correct for dimension tables and fact tables where late-arriving updates occur), and SCD Type 2 (slowly changing dimensions tracking historical values by adding effective_date and is_current columns, implemented via MERGE INTO that expires the old row and inserts the new row). Change Data Capture (CDC) from source databases: reading Debezium or native CDC feeds into Delta Lake via Spark Structured Streaming and applying changes incrementally with MERGE INTO rather than full table reloads.

Incremental

Data Quality & Delta Expectations

Data quality framework for Delta Lake tables: Delta Expectations (built into Delta Live Tables) for declarative data quality rules defined as Python decorators on pipeline nodes (@dlt.expect('valid_order_amount', 'order_amount > 0')), Great Expectations integration for comprehensive data validation suites, and Unity Catalog data quality monitors that track freshness, volume, and schema drift over time. Quality rule types: completeness (not_null constraints on critical columns), validity (range checks, regex patterns for email/phone formats, referential integrity against dimension tables), uniqueness (primary key uniqueness verified before MERGE INTO operations), and timeliness (lag detection ensuring source data arrival within SLA). Quarantine pattern: routing rows failing quality checks to a separate Delta table (Bronze_Quarantine) for manual review rather than failing the entire pipeline — enabling incremental pipeline progress while isolating bad data for remediation.

Data Quality

Table Optimisation & File Layout

Delta Lake table optimisation — the maintenance operations that keep query performance consistent as tables grow to terabyte and petabyte scale. OPTIMIZE: coalescing small files (produced by frequent streaming micro-batches or small JDBC extracts) into larger files (target 128MB-1GB per file) that match the read throughput of cloud object storage and Spark's parallel read model. Without OPTIMIZE, tables accumulate millions of tiny files that cause query planning to dominate execution time. Z-ORDER: multi-dimensional clustering that co-locates related column values in the same files — enabling data skipping (reading only files whose min/max statistics overlap the query filter) rather than full table scans. Liquid Clustering (DataBricks proprietary, 2024): automatic maintenance of clustering without manual Z-ORDER commands — the system continuously reorganises data as it is written. VACUUM: removing old Parquet files that are no longer referenced by the current Delta table state (after Time Travel retention expires) to control storage cost.

Optimisation

CI/CD for Delta Lake on DataBricks

CI/CD pipeline for Delta Lake schema and pipeline deployment — applying software development release practices to data platform changes. Development workflow: DataBricks Repos (Git integration within the DataBricks workspace) or external IDEs (VS Code with DataBricks extension) for notebook and wheel development, Git-based branching (feature branch per change, pull request for peer review), and DataBricks Asset Bundles (YAML-defined resource configurations for jobs, pipelines, and clusters that are deployed via CI). DataBricks CLI and API integration: linting SQL with SQLFluff, running unit tests on Python UDFs and transformation logic, and deploying to production via CI/CD pipelines on merge to main. Blue-green deployment for large schema changes: creating a parallel Gold schema (Gold_v2), backfilling with historical data, validating BI tool connectivity, and swapping the schema reference in Unity Catalog rather than running a potentially disruptive in-place migration.

CI/CD

Streaming Tables & Materialized Views

DataBricks Streaming Tables and Materialized Views (powered by Delta Live Tables) for declarative data pipeline definition — the shift from imperative Spark code ("read this, filter that, join these, write there") to declarative SQL or Python statements that define what the output should look like, with DataBricks handling the execution, refresh, and optimisation automatically. Streaming Table: a Delta table that is continuously updated from a streaming source (Kafka, Kinesis, Event Hubs, Auto Loader on cloud storage) — defined with CREATE STREAMING TABLE and refreshed automatically as new data arrives. Materialized View: a pre-computed aggregation or join result that refreshes incrementally when source tables change — defined with CREATE MATERIALIZED VIEW and automatically kept in sync by DataBricks without manual pipeline scheduling. Both integrate with Unity Catalog for governance and can be queried identically to standard tables by BI tools and analysts.

Streaming / MV

Service 04

Delta Live Tables & Pipeline Engineering — Auto Loader, Streaming & Orchestration

Delta Live Tables (DLT) is DataBricks' declarative pipeline framework that brings software engineering best practices — unit testing, data quality expectations, automatic error handling, and pipeline observability — to data pipeline development on DataBricks. Before DLT, data pipelines on DataBricks were typically built as imperative Spark notebooks: cells of PySpark or Scala code that read sources, applied transformations, and wrote to Delta tables, scheduled via DataBricks Jobs. These notebooks were difficult to test, prone to failure on data quality issues, and required manual maintenance of dependencies between pipeline steps. DLT replaces imperative notebook pipelines with declarative SQL or Python definitions: the developer defines the target table and the transformation logic, and DLT handles the execution graph construction, incremental processing, data quality enforcement, and failure recovery automatically. DLT pipelines integrate with Unity Catalog for lineage tracking, enforce data quality expectations that quarantine bad records without failing the pipeline, and provide automatic scaling and optimised execution through the DLT execution engine.

Build Your DataBricks Data Pipelines Assess Your Pipeline Architecture

DLT — Pipeline Coverage

SourceMash pipeline engineering practice

Delta Live
Tables Declarative SQL + Python pipelines

Auto
Loader Cloud storage ingestion — incremental

Structured
Streaming Kafka, Kinesis, Event Hubs

Workflows Job orchestration + task dependencies

dbt +
DataBricks dbt Core / Cloud on DataBricks SQL

Change Data
Capture Debezium + Kafka + Delta Lake MERGE

Delta Live Tables Implementation

DLT deployment and pipeline configuration for the declarative ETL approach — where the developer defines target tables and transformations, and DLT manages the execution graph, incremental processing, and data quality automatically. DLT pipeline modes: triggered (batch execution on a schedule — appropriate for traditional ETL workloads that run once per hour or once per day) and continuous (streaming execution that processes new data as it arrives — appropriate for near-real-time use cases with latency requirements of minutes). DLT SQL and Python APIs: @dlt.table() decorators in Python or CREATE LIVE TABLE statements in SQL that define each pipeline node, with automatic dependency resolution from the graph of references between tables. Data quality expectations: @dlt.expect(), @dlt.expect_or_fail(), and @dlt.expect_or_drop() for defining constraints that are evaluated on every pipeline run, with failed rows routed to quarantine tables. Unity Catalog integration: DLT tables are automatically registered in Unity Catalog with full lineage tracking from source to target.

DLT

Auto Loader & Cloud Storage Ingestion

Auto Loader for incremental ingestion of files from cloud storage (S3, ADLS, GCS) into Delta Lake — the DataBricks-native mechanism for efficiently loading new files as they arrive without expensive LIST operations on the entire bucket. Auto Loader uses cloud notification queues (SQS for AWS, Event Grid for Azure, Pub/Sub for GCP) to detect new files and incrementally ingests them with exactly-once processing guarantees. Schema inference and evolution: Auto Loader can infer schema from the first batch of files and evolve the schema automatically as new columns appear in subsequent files — eliminating pipeline failures from upstream schema changes. Supported formats: JSON, CSV, Parquet, Avro, XML, and text files. Rescued data: rows or columns that do not conform to the expected schema are routed to a _rescued_data column rather than failing the ingestion — enabling pipeline continuity while capturing non-conforming data for investigation. CloudFiles source: the Spark readStream source that powers Auto Loader, configured with .format("cloudFiles") and options for schema location, checkpoint location, and file notification mode.

Auto Loader

Spark Structured Streaming

Spark Structured Streaming for real-time and micro-batch data processing on DataBricks — the continuous query engine that processes unbounded data streams with the same DataFrame API used for batch processing. Streaming sources: Kafka (readStream.format("kafka")), Kinesis (readStream.format("kinesis")), Event Hubs, cloud storage via Auto Loader, and Delta Lake (readStream.format("delta") for reading change feeds from other Delta tables). Streaming sinks: Delta Lake (writeStream.format("delta") with mergeSchema option), Kafka, and foreachBatch for custom logic. Stateful operations: aggregations over time windows (tumbling, sliding, session windows), stream-stream joins (joining two streaming sources with watermarking to handle late data), and deduplication (dropDuplicates within a watermark boundary). Checkpointing: maintaining state in cloud storage to enable exactly-once processing and fault tolerance across cluster restarts. Trigger intervals: processingTime triggers for micro-batch execution (every 30 seconds, every 5 minutes) or continuous triggers for millisecond-latency processing.

Streaming

DataBricks Workflows & Orchestration

DataBricks Workflows for pipeline orchestration — the native job scheduler that chains notebooks, Delta Live Tables pipelines, Python wheels, SQL queries, and external tasks into multi-step workflows with dependency management. Workflow tasks: notebook tasks (running a specific notebook with parameters), DLT pipeline tasks (triggering a DLT pipeline and waiting for completion), Python wheel tasks (running a packaged Python application), SQL tasks (executing SQL statements against a SQL warehouse), and conditional tasks (IF/ELSE branching based on task output). Task dependencies: defining the DAG of task execution with upstream/downstream relationships, retry policies for transient failures, and timeout configurations. External orchestration integration: Apache Airflow (via the DataBricks Airflow provider), Azure Data Factory, AWS Step Functions, and Prefect for cross-system orchestration where DataBricks is one component in a wider pipeline. Workflow monitoring: the DataBricks Jobs UI provides execution history, run duration, output logs, and alerting via email or webhooks on failure.

Workflows

Real-Time & Change Data Capture

Real-time data ingestion into Delta Lake for operational analytics use cases where data freshness of minutes rather than hours is required. Debezium CDC pipeline: Debezium captures row-level changes from the source database transaction log (PostgreSQL WAL, MySQL binlog, SQL Server CDC, Oracle LogMiner) and publishes them to Apache Kafka topics, from which Spark Structured Streaming reads and applies changes to Delta Lake via MERGE INTO. Change Data Feed (CDF): a Delta Lake feature that captures row-level changes (INSERT, UPDATE, DELETE) on a Delta table and exposes them as a queryable change stream — enabling downstream consumers to process only changed rows without requiring a CDC tool at the source database. Confluent Cloud + DataBricks Kafka Connector for managed Kafka ingestion. DataBricks Streaming Tables for continuous ingestion from Kafka or Auto Loader with automatic schema evolution and exactly-once guarantees.

CDC / Streaming

Pipeline Monitoring & Observability

Data pipeline observability — the monitoring and alerting that ensures pipeline failures are detected and resolved before they affect business reports and dashboards. DataBricks Job and Workflow monitoring: execution times, DBU consumption, error messages, and task success/failure status from the DataBricks Jobs UI and REST API. DataBricks SQL dashboards for pipeline monitoring: custom dashboards built in DataBricks SQL that query the DataBricks system tables (system.information_schema, system.billing, system.compute) to display pipeline health, data freshness, and cost trends. Data quality monitoring: Unity Catalog data quality monitors that track schema drift, volume anomalies, and freshness SLA breaches across Delta tables automatically. Integration with Monte Carlo, Acceldata, or open-source observability tools for full data observability across the Lakehouse. DataBricks Lakehouse Monitoring (2024): the native observability feature that automatically monitors data quality, model performance, and pipeline health across the Lakehouse without requiring external tools.

Observability

Service 05

Delta Sharing & DataBricks Marketplace

Delta Sharing is an open protocol for secure data sharing that enables organisations to share live data in Delta Lake format with any recipient — whether they use DataBricks, Snowflake, Apache Spark, Pandas, Power BI, or any other platform that implements the Delta Sharing protocol. Unlike traditional data exchange that requires extracting data to files, transferring via SFTP or API, and loading into the recipient's system, Delta Sharing allows the recipient to query the provider's live data directly from cloud storage using their own compute resources. The provider defines a share (a collection of tables and volumes), grants access to the recipient's identity, and the recipient mounts the share in their own catalog. The data remains in the provider's cloud storage account; the recipient reads only the files they query, paying only for their own compute. This open approach breaks down the platform lock-in that has historically made data sharing expensive and slow to set up.

Design Your Delta Sharing Architecture Explore DataBricks Marketplace

Delta Sharing — Patterns

SourceMash Delta Sharing practice

Delta Sharing Open protocol — any recipient platform

DataBricks Sharing Account-to-account — live Delta tables

Marketplace Listing Free or paid data products

Recipient Platforms Snowflake, Spark, Pandas, Power BI

Secure Views Column / row masking on shares

Clean Rooms Privacy-preserving data collaboration

Delta Sharing — Provider & Consumer Implementation

Delta Sharing implementation — provider side: creating a Share object in Unity Catalog, registering Delta tables and volumes to the share, and defining recipient access (named recipient with authentication token, or open sharing via token URL). Recipient configuration: the recipient installs the Delta Sharing connector (available for DataBricks, Apache Spark, Pandas, Snowflake via Delta Sharing Connector, and Power BI via the Delta Sharing connector) and accesses shared data using their preferred query engine. Cross-platform sharing: a DataBricks provider can share Delta tables with a Snowflake consumer (via the Snowflake Delta Sharing connector), a pandas user in a Jupyter notebook, or a Power BI analyst — all querying the same live data without copies. Partition and column filtering: providers can define partition filters on shares (share only the last 90 days of data) and column filters (exclude sensitive columns) to limit what the recipient can access without creating a separate physical copy.

Delta Sharing

DataBricks Marketplace — Listings & Data Products

DataBricks Marketplace listing implementation — for organisations that want to monetise or freely distribute data products to other DataBricks customers globally. Marketplace listing creation: describing the data product (title, description, sample data, data dictionary, refresh frequency), configuring the Delta Sharing share that backs the listing (which tables and volumes are included), setting the listing as free (for open data sharing, brand building, or commercial lead generation) or paid (requesting access and managing commercial agreements through DataBricks' Marketplace framework). Consumer acquisition: organisations can discover your Marketplace listing through the DataBricks Marketplace portal, request access, and mount the data in their Unity Catalog within minutes — without any data movement, pipeline setup, or API integration. Complementary Marketplace consumption: integrating third-party data (financial market data, weather data, geolocation reference data, demographic enrichment) directly from DataBricks Marketplace listings into Delta Lake queries without extracting data from external APIs.

Marketplace

Data Clean Rooms & Privacy-Preserving Collaboration

DataBricks Clean Rooms for privacy-preserving data collaboration — the mechanism that enables two or more organisations to compute queries over a join of their respective datasets without any party seeing the other's raw data. Clean Room implementation: each party contributes data to a shared Clean Room environment where only approved SQL or Python queries can run, and only aggregated or differentially-private results are returned. Common Clean Room use case: a retailer and an FMCG brand want to understand which product promotions drive the most incremental purchase across the brand's customer base — possible in a Clean Room without the retailer exposing individual customer purchase records to the brand. Output restrictions: the Clean Room provider defines which query outputs are permitted (aggregated counts, statistical summaries) and which are blocked (row-level results, unique identifier joins). Unity Catalog enforces access control within the Clean Room, ensuring each party can only access their own data and the approved computed results.

Clean Rooms

Cross-Platform Data Sharing

Delta Sharing's open protocol enables sharing beyond the DataBricks ecosystem — to Snowflake (via the Snowflake Delta Sharing connector), to Apache Spark clusters (via the Delta Sharing Spark connector), to Python data scientists (via the delta-sharing PyPI package that returns Pandas DataFrames), and to BI tools (via the Power BI Delta Sharing connector or JDBC/ODBC bridges). Provider considerations: because Delta Sharing serves Parquet files directly from the provider's cloud storage, the provider must ensure that egress costs and request charges are accounted for in their pricing model when sharing with high-volume consumers. Recipient authentication: named recipients (identified by email and organisation) receive persistent access tokens; open recipients receive time-limited tokens that can be rotated. Audit and monitoring: Unity Catalog captures all access to shared data, enabling providers to monitor which recipients are querying which tables and how frequently — providing the usage data needed for commercial data product pricing.

Cross-Platform

Service 06

Mosaic AI & ML Operations — MLflow, Feature Store & Model Serving

DataBricks' Mosaic AI (formerly DataBricks ML) is the integrated machine learning and generative AI platform that enables organisations to build, train, tune, and deploy ML models and LLMs on the same Lakehouse infrastructure that powers their analytics. Before Mosaic AI, machine learning on enterprise data required extracting data from the warehouse to a separate ML platform, training models in an isolated environment, and deploying them separately from the data pipeline — creating security boundaries, data drift, and operational complexity. Mosaic AI eliminates this separation: Data Scientists query Delta tables directly via Spark DataFrames or Pandas-on-Spark, engineer features using the same SQL transformations the BI team uses, register models in MLflow (the open-source ML lifecycle platform, hosted natively in DataBricks), and deploy models as real-time REST endpoints or batch inference pipelines within the same Unity Catalog governance boundary. For generative AI, Mosaic AI provides pre-trained LLMs, fine-tuning infrastructure, vector search, and model serving for RAG (Retrieval-Augmented Generation) applications — all within the DataBricks platform.

Build with Mosaic AI Explore ML & GenAI Use Cases

Mosaic AI — Runtime Coverage

SourceMash ML & AI practice

MLflow Tracking, Registry, Model Serving

Feature Store Offline + online features — Unity Catalog

AutoML Classification, regression, forecasting

Mosaic AI LLMs, fine-tuning, RAG, vector search

Model Serving Real-time REST + batch inference

GPU Compute A10, A100, H100 clusters for training

MLflow Tracking & Model Registry

MLflow on DataBricks for the full machine learning lifecycle — experiment tracking, model versioning, and production deployment within the Lakehouse governance boundary. MLflow Tracking: logging parameters, metrics, artefacts (feature importance plots, confusion matrices, model files), and training data references for every experiment run — enabling reproducibility and comparison across hundreds of model iterations. MLflow Model Registry: staging models through development, staging, and production stages with approval gates, version tagging, and automated transition webhooks that trigger CI/CD pipelines. Unity Catalog integration: MLflow models registered in Unity Catalog as first-class data assets with lineage tracking (which Delta tables and features were used to train each model version), access control (who can transition a model to production), and audit history. Model flavours: scikit-learn, XGBoost, LightGBM, TensorFlow, PyTorch, Spark MLlib, and LangChain — all supported natively with automatic environment packaging.

MLflow

Feature Store & Feature Engineering

DataBricks Feature Store for managing, discovering, and serving ML features — the engineered columns that ML models use as inputs — as governed Unity Catalog assets that are shared between data engineering and data science teams. Offline Feature Store: features computed in batch via Spark SQL or Delta Live Tables and stored in Delta tables (feature tables) with point-in-time correctness for training (no feature leakage from the future). Online Feature Store: low-latency feature serving via DataBricks' online store (backed by DynamoDB, Cosmos DB, or Redis) for real-time model inference where features must be retrieved in milliseconds. Feature discovery: data scientists search Unity Catalog for features created by the data engineering team, viewing descriptions, statistics, and lineage rather than recreating features independently. Feature tables are versioned and time-travelled like any Delta table, enabling training on historical feature values and detection of feature drift over time. Feature monitoring: automatic tracking of feature distributions and drift alerts when production feature statistics deviate from training distributions.

Feature Store

Mosaic AI & Generative AI

Mosaic AI for generative AI and large language model operations on DataBricks — enabling organisations to build RAG applications, fine-tune open-source LLMs, and deploy AI applications without sending data to third-party APIs. Foundation Model APIs: calling pre-trained LLMs (Llama, Mistral, BGE embeddings) via DataBricks-hosted endpoints without managing infrastructure — billed per token with data remaining within the customer's cloud account. Fine-tuning: adapting open-source LLMs to proprietary data using DataBricks' LLM fine-tuning runtime (QLoRA, full fine-tuning) on GPU clusters, with training data sourced directly from Delta tables. Vector Search: creating vector indexes on Delta tables for semantic search and RAG retrieval, with automatic index maintenance as source data changes. Model Serving: deploying fine-tuned LLMs and embedding models as auto-scaling REST endpoints behind Unity Catalog governance. AI Playground: the interactive UI for testing prompts, comparing model outputs, and prototyping RAG applications before production deployment.

Mosaic AI

Model Training & AutoML

DataBricks model training infrastructure for classical machine learning and deep learning: AutoML for rapid baseline model generation (classification, regression, forecasting — DataBricks automatically tries multiple algorithms, hyperparameter combinations, and preprocessing pipelines, presenting the best models in a leaderboard with explainability), distributed training for large datasets (Spark MLlib for distributed ML on terabyte-scale data, Horovod for distributed deep learning across GPU clusters), and hyperparameter tuning (Hyperopt with distributed execution across cluster workers for efficient hyperparameter search). Training data access: direct query of Delta tables via Spark DataFrames, Pandas-on-Spark (distributed Pandas operations on Spark clusters), or Spark SQL — eliminating data extraction and ensuring training data is as fresh as the analytics data. Experiment management: all training runs logged to MLflow with automatic artefact capture, enabling reproducibility and model lineage tracking back to the exact data version used for training.

Training / AutoML

Model Serving & Inference

DataBricks Model Serving for deploying ML models and LLMs as production REST endpoints or batch inference pipelines within the Lakehouse security boundary. Real-time serving: deploying registered MLflow models as auto-scaling REST endpoints (CPU-based for classical ML, GPU-based for LLMs) with Unity Catalog authentication, request logging, and A/B testing support. Batch inference: applying trained models to large Delta tables via Spark UDFs or the DataBricks model scoring API for offline predictions that write results back to Delta Lake. Model serving endpoints scale from zero (no cost when not receiving requests) to thousands of requests per second with automatic load balancing. Endpoint monitoring: tracking request latency, throughput, error rates, and model prediction distributions to detect drift and performance degradation. Integration with DataBricks Apps and external applications: REST endpoints can be called from Power BI, custom web applications, or mobile apps via standard HTTP with token authentication.

Model Serving

DataBricks Notebooks & Collaborative Development

DataBricks Notebooks for collaborative data science, ML engineering, and analytics development — the interactive environment that supports Python, SQL, Scala, and R in the same notebook with mixed-language execution. Notebook use cases: exploratory data analysis on Delta tables (plotting with Matplotlib, Seaborn, or Plotly inline), ML model development and testing with MLflow experiment tracking, feature engineering pipeline prototyping, and the data quality investigation workflows where SQL and Python are interleaved. Collaborative features: real-time co-editing (multiple users editing the same notebook simultaneously), comments and annotations, revision history with automatic versioning, and integration with Git repositories (DataBricks Repos) for branch-based development. Notebooks run on DataBricks compute (all-purpose clusters, serverless, or jobs compute), keeping all data within the Lakehouse security boundary. Parameterised notebooks: defining input parameters that enable the same notebook to be reused across different datasets or time periods via DataBricks Jobs or Workflows.

Notebooks

Service 07

Unity Catalog Governance — Masking, Lineage, Access Control & Compliance

Unity Catalog is DataBricks' unified governance solution for the Lakehouse — providing centralized access control, auditing, lineage, and data discovery across all data assets (Delta tables, volumes, models, notebooks, dashboards) in a DataBricks account. Before Unity Catalog, DataBricks workspaces used the Hive metastore, which provided only database/table-level access control with no column or row-level security, no data lineage, no audit logging, and no cross-workspace sharing. Unity Catalog replaces this with a three-layer namespace (catalog.schema.table), attribute-based access control (ABAC) via tags, dynamic data masking and row filters, column-level lineage, and comprehensive audit logging — all managed through SQL statements and integrated with enterprise identity providers (Azure AD, AWS IAM Identity Center, Okta). For regulated industries (BFSI, healthcare, insurance, government), Unity Catalog enables implementation of granular data access controls that operate at the column and row level, enforce data masking for sensitive attributes based on the querying user's role, and produce the data lineage and access audit records that regulatory compliance programmes require.

Implement Unity Catalog Governance Assess Your Governance Posture

Governance — Framework

SourceMash Unity Catalog practice

Dynamic Data Masking Column-level — role-aware masking

Row Filters Row-level filtering by identity

Object Tagging PII / sensitivity classification

Role Hierarchy RBAC + ABAC via account roles

Audit system.access.audit — column-level

Compliance GDPR, HIPAA, PCI DSS, DPDP

Dynamic Data Masking

Unity Catalog Dynamic Data Masking for column-level data protection that applies masking rules at query time based on the querying user's role — without modifying the underlying data or requiring multiple copies of the table. Masking function creation: a SQL UDF that returns the column value for authorised roles and a masked or null value for all other roles (CASE WHEN is_account_group_member('ANALYST_PII') THEN credit_card_number ELSE '****-****-****-' || RIGHT(credit_card_number, 4) END). Policy assignment: the masking policy is assigned to a column in CREATE TABLE or via ALTER COLUMN — from that point, every query against that column is masked for unauthorised users transparently. Masking policy types: full masking (NULL), partial masking (first 4 / last 4 of credit card), hash masking (deterministic but irreversible for join-compatible pseudonymisation), and conditional masking (different masking for different roles). Unity Catalog applies masking consistently across SQL warehouses, all-purpose clusters, and DLT pipelines.

DDM

Row Filters & Row-Level Security

Unity Catalog Row Filters for row-level data filtering that restricts which rows a user can see in a table based on their identity or role membership — the DataBricks equivalent of row-level security in SQL Server or Oracle Virtual Private Database. Row filter design: a SQL UDF that returns TRUE for rows the current user is authorised to see and FALSE for rows that should be filtered out. Implementation patterns: region-based access (each regional manager sees only their region's rows — a lookup table mapping username to authorised region codes, referenced in the filter function), customer-level access (a B2B portal scenario where each customer account sees only their own transaction rows — the filter function joins to a customer-user mapping table), and classification-level access (users with CONFIDENTIAL role see all rows; users without see only rows tagged as PUBLIC). A single row filter can be applied to multiple tables simultaneously — enabling consistent access control across the data model without per-table repetition. Row filters combine with column masks for comprehensive cell-level security.

Row Filters

Object Tagging & Data Classification

Unity Catalog Object Tags for attaching metadata to data assets (catalogs, schemas, tables, columns, volumes) — the foundation for automated governance policy application based on data sensitivity classification. Tag creation and assignment: SENSITIVITY_LEVEL (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED), DATA_DOMAIN (FINANCIAL / HEALTH / PII / OPERATIONAL), RETENTION_PERIOD, GDPR_APPLICABLE. Tag-based masking policies: instead of assigning a masking policy to each column individually, tag all PII columns with the PII tag and apply a masking policy to the SENSITIVITY_LEVEL tag — the masking policy automatically applies to every future column tagged as PII without requiring a manual ALTER COLUMN for each. Unity Catalog data classification: the automated PII detection that scans table contents and recommends sensitivity tags based on column name patterns and data patterns (credit card numbers, email addresses, phone numbers, Aadhaar formats) — accelerating the classification of large legacy schemas. Tags are inherited down the hierarchy (catalog → schema → table → column) and can be queried via INFORMATION_SCHEMA for data catalogue browsing.

RBAC & Account Role Design

Unity Catalog Role-Based Access Control (RBAC) architecture — the account role hierarchy that determines which users can access which objects, perform which operations, and consume which compute resources. Standard role design pattern: ACCOUNT ADMIN (manages account-level settings and billing), METASTORE ADMIN (manages Unity Catalog metastore and governance), CATALOG ADMIN (manages a specific catalog), and custom functional roles (ANALYST_FINANCE, DATA_ENGINEER_BRONZE, ML_SCIENTIST) with object-level BROWSE, READ, SELECT, MODIFY, and CREATE grants. Privilege inheritance: Unity Catalog privileges do not inherit hierarchically by default (unlike some databases) — USAGE on a catalog is required to access any schema within it, and USAGE on a schema is required to access any table within it, but SELECT on a table must be granted explicitly. Service principals for automation: dedicated service principals for CI/CD pipelines, ETL jobs, and BI tool connections with minimal required privileges rather than using personal access tokens. Identity federation: synchronising Azure AD, AWS IAM Identity Center, or Okta groups to DataBricks account groups for SSO and automated role assignment.

RBAC

Audit Logging & Access Monitoring

Unity Catalog audit logging via system.access.audit — the system table that records every query, data access, privilege grant, and governance policy change in the DataBricks account. Compliance audit queries: "which users accessed the PII columns in the CUSTOMER table in the last 90 days?", "which queries accessed CONFIDENTIAL-tagged tables outside business hours?", "which service principals modified row filter policies?" — all answerable from system.access.audit without requiring a separate audit log system. system.billing.usage for operational audit: DBU consumption by workspace, cluster, job, and user — enabling cost allocation and anomaly detection (queries consuming abnormally high DBUs, clusters running at unexpected hours). system.compute.clusters and system.compute.node_types for infrastructure audit: tracking cluster configurations, policy compliance, and auto-termination effectiveness. system.information_schema for data governance audit: querying table schemas, column tags, masking policies, and row filter assignments programmatically.

Audit

Data Lineage & Discovery

Unity Catalog data lineage — the automatic tracking of data flow from source tables through transformations to downstream tables, dashboards, and ML models. Lineage capture: Unity Catalog records lineage for SQL queries, Delta Live Tables pipelines, Spark DataFrame operations, and notebooks — showing which tables were read, which were written, which columns were transformed, and which notebooks or jobs executed the transformation. Impact analysis: answering "which downstream tables, dashboards, and ML models would be affected if this source table column changed?" by traversing the lineage graph in Unity Catalog. Data discovery: the DataBricks Data Explorer and Unity Catalog search enable analysts and data scientists to find tables by name, column name, tag, or description — browsing the three-layer namespace and viewing table schemas, sample data, statistics, and lineage without writing SQL. Integration with external data catalogues: Alation, Collibra, and Microsoft Purview connect to Unity Catalog via APIs to pull metadata, lineage, and usage statistics into enterprise catalogues.

Lineage

Service 08

DataBricks FinOps — DBU Optimisation & Cost Control

DataBricks' DBU (DataBricks Unit) pricing model charges for compute consumption based on the instance type, cluster size, and runtime used — with different DBU rates for all-purpose compute, jobs compute, SQL warehouses, and GPU-enabled ML compute. While this model provides transparency and flexibility, it also creates opportunities for runaway cost: all-purpose clusters left running without auto-termination accumulate DBUs continuously even when no notebooks are executing; SQL warehouses provisioned for peak concurrency but running at 10% utilisation waste DBUs during off-peak hours; and Spark jobs with excessive shuffling or skewed partitions consume 5–10x more DBUs than necessary due to inefficient execution rather than genuine data volume. DataBricks FinOps is the continuous practice of identifying and closing the gap between what the organisation is paying for DataBricks and what the organisation needs to pay for the business value it is extracting.

Start Your DataBricks FinOps Programme Get a DBU Audit

FinOps — Optimisation Areas

SourceMash DataBricks FinOps practice

Auto-Termination 15-min dev | 5-min ad-hoc | job completion

Avg. DBU Savings 25–40% within 90 days

Query Optimisation Photon, Z-ORDER, partition pruning

Storage VACUUM, Time Travel retention tuning

Monitoring system.billing.usage dashboards

Commitment Plans DBU commitment discount 20–40%

Auto-Termination & Cluster Policies

Auto-termination configuration audit — the most impactful single change in most DataBricks cost optimisation exercises. An all-purpose cluster with 4 workers running continuously for a month costs approximately 4× the driver DBU rate × 24 × 30, or potentially ₹8–12 lakh per month for a standard instance type — for zero analytical value if nobody is running notebooks. Optimal auto-termination settings by cluster type: development all-purpose clusters (15–30 minutes — developers need fast restart for iterative work but should not leave clusters idle overnight), ad-hoc analysis clusters (5–10 minutes — analysts generate intermittent load), and ETL job clusters (terminate immediately on job completion — never leave a jobs cluster running after the pipeline finishes). Cluster policies enforce these settings organisation-wide, preventing individual users from overriding auto-termination for convenience.

SQL Warehouse Rightsizing

SQL warehouse size and scaling audit — matching warehouse configuration to actual BI concurrency and query complexity rather than defaulting to Large or X-Large for all workloads. system.billing.usage analysis: identifying SQL warehouses with high DBU consumption but low query concurrency (indicating over-provisioning), warehouses where query queuing is negligible (indicating that a smaller size would suffice), and warehouses running 24/7 for intermittent BI load (serverless SQL warehouses auto-scale to zero, eliminating idle cost). Serverless vs. classic: serverless SQL warehouses start instantly, scale automatically, and pause when idle — ideal for variable BI load; classic warehouses require manual sizing and start-up time but offer slightly lower per-query DBU rates for predictable, high-volume workloads. Connection pooling: configuring max concurrent connections to match actual BI tool demand rather than the default.

Query & Spark Performance

Spark job optimisation for DBU reduction — the most technically complex FinOps work but the highest-ROI for organisations with expensive ETL pipelines. Spark UI analysis identifying the top 20 jobs by total DBUs consumed (the product of DBU rate × execution time). Common DBU-heavy patterns: excessive shuffling (joins on skewed keys causing one partition to process most of the data), small file problems (Delta tables with millions of tiny files causing query planning to dominate execution), lack of partition pruning (queries filtering on a column that is not the partition key or Z-ORDER column, causing full table scans), and non-Photon execution (queries running on the legacy Spark SQL engine rather than the vectorized Photon engine). Photon enablement: turning on Photon for SQL warehouses and all-purpose clusters provides 2–5x query speedup on analytical queries, directly reducing DBU consumption per query.

Delta Lake File Layout Optimisation

Delta table optimisation for query cost reduction on large tables where full file scans are the primary source of DBU consumption. OPTIMIZE (file compaction): coalescing small files produced by streaming micro-batches into larger files (target 128MB–1GB) that match Spark's parallel read throughput. Without OPTIMIZE, tables accumulate millions of tiny files that cause query planning and task scheduling overhead to dominate execution time. Z-ORDER and Liquid Clustering: co-locating related column values in the same files to enable data skipping (reading only files whose min/max statistics overlap the query filter) rather than full table scans. Automatic vs. manual: enabling Predictive Optimisation (DataBricks proprietary, 2024) automates OPTIMIZE, VACUUM, and ANALYZE based on query patterns rather than requiring manual scheduling. Partition strategy review: ensuring partition columns have low cardinality (date, region) rather than high cardinality (timestamp, user_id) which creates the small file problem.

Storage Cost Optimisation

Cloud storage cost management for Delta Lake — DataBricks DBUs cover compute only; cloud storage (S3, ADLS, GCS) is billed separately by the cloud provider and can accumulate significant cost at petabyte scale. Time Travel retention tuning: Delta Lake's default Time Travel retention is 30 days; for high-churn tables (staging tables refreshed daily), reducing retention to 7 days removes old Parquet files from storage significantly faster. VACUUM scheduling: regularly removing unreferenced old files that are no longer needed for Time Travel. Table clones: shallow clones share underlying data files with their source until modified — dropping stale development environment clones that have diverged significantly from production eliminates independent storage costs. Cloud storage tiering: configuring lifecycle policies on the underlying object storage to move old Delta files to infrequent access or archive tiers after a defined period, reducing storage cost for historical data that is queried rarely but must be retained for compliance.

DBU Monitoring & Budgeting

DBU consumption monitoring using DataBricks system tables — the gold standard for DataBricks cost analysis. Custom DBU dashboard built on system.billing.usage (DBU consumption by workspace, cluster, job, user, and time), system.billing.list_prices (current DBU rates for each compute type), system.compute.clusters (cluster configurations and runtime versions), and system.access.audit (query history correlated with user and role). Budget alerts: configuring billing alerts at the account level when monthly DBU consumption exceeds defined thresholds, with notifications to finance and engineering leadership. DBU commitment planning: DataBricks offers significant discounts (20–40%) for pre-purchased DBU commitments over 1–3 years; accurate consumption forecasting from our monitoring programme enables right-sizing commitments to avoid over-purchase (unused committed DBUs) or under-purchase (paying premium on-demand rates for excess consumption).

35%

Avg. DataBricks DBU cost reduction within 90 days of FinOps programme

40%

Maximum discount on pre-purchased DBU capacity vs. on-demand rates

95%

DBU reduction typical for clusters with auto-termination 60+ min → 15 min

Query speedup from Photon engine + Z-ORDER on analytical workloads

INDUSTRY DATABRICKS USE CASES

DataBricks Lakehouse by Industry.

DataBricks’ combination of open storage, unified governance, real-time streaming, and integrated AI makes it the platform of choice for data-intensive and AI-forward industries.

BFSI

Financial Data & AI Platform

Unified Lakehouse for retail banking, markets, and insurance data with Unity Catalog governance for RBI, SEBI, and PCI DSS compliance
Real-time fraud detection via Spark Structured Streaming on transaction events with MLflow-managed anomaly detection models
Risk analytics Lakehouse consolidating market data, counterparty exposure, and collateral positions across trading books
Regulatory reporting (Basel III, IFRS 9, RBI filings) via Delta Lake tables with row filters by legal entity and dynamic data masking
Delta Sharing with credit bureaus and payment networks — live data without extract/load pipeline

RETAIL & E-COMMERCE

Customer Analytics & Personalisation

360° customer data platform unifying transactional, loyalty, and behavioural data across online and offline channels in Delta Lake
Inventory analytics with demand forecasting (AutoML + Prophet) across 50,000+ SKUs and 300+ locations
Marketing attribution and media mix modelling across Google, Meta, and affiliate channels with Delta Sharing Clean Rooms
Real-time personalisation feature store: customer segment and affinity features served via DataBricks Feature Store to recommendation API
Supplier analytics with Delta Sharing — sharing sell-out data with key suppliers without monthly extract files

MANUFACTURING

Industrial Data Intelligence

IoT sensor data landing via Auto Loader + Kafka — billions of events per day stored and queryable in Delta Lake
Predictive maintenance model (XGBoost via MLflow) trained on sensor time-series and maintenance history
Supply chain analytics integrating SAP, supplier EDI data, and logistics systems with Delta Live Tables pipelines
OEE analytics dashboard powered by DataBricks SQL — production line performance with automated alerting
Delta Sharing with Tier-1 OEM customers — sharing quality analytics on supplied components without monthly reports

HEALTHCARE & LIFE SCIENCES

Health Data & AI Platform

HIPAA-compliant patient data platform with Unity Catalog row filters, dynamic data masking, and private cloud connectivity
Clinical trial data analytics with de-identification via Unity Catalog masking policies — analysts query anonymised data
Revenue cycle analytics: claims, AR, denial rate, and reimbursement trend from billing system to Delta Lake via Fivetran
Genomic data analysis with Spark distributed compute and Mosaic AI for population health research at scale
Health data exchange between hospital networks via Delta Sharing — FHIR-formatted records shared live

SAAS & TECHNOLOGY

Product & Customer Analytics

Product analytics Lakehouse: event data from Segment/Mixpanel/custom tracking landed via Fivetran into Delta Lake
Customer success analytics: health scoring, churn prediction (MLflow + Feature Store), and expansion scenarios
Multi-tenant data isolation with Unity Catalog row filters — each customer account sees only data they are entitled to
Embedded analytics via Delta Sharing — customers query their product usage data in their own DataBricks or Snowflake account
dbt-modelled Delta Lake warehouse as the source of truth for all product, financial, and operational metrics

MEDIA & ADVERTISING

Audience & Campaign Analytics

Audience data platform: first-party identity data enriched with third-party demographic data from DataBricks Marketplace
Campaign performance analytics across walled gardens and open web inventory with unified attribution in Delta Lake
Publisher audience sharing with advertisers via clean rooms — DataBricks Clean Room for privacy-preserving audience overlap
Real-time bidding analytics: impression, click, and conversion event data streamed via Kafka + Structured Streaming at 100M+ events/day
Reach and frequency analytics with Approximate Aggregations via Spark SQL’s native HLL functions

Integration Ecosystem

Tools That Connect to DataBricks in Our Practice.

DataBricks' open Lakehouse architecture integrates with every major data ingestion, transformation, BI, orchestration, ML, and observability tool. Key systems we integrate regularly:

📥 Ingestion & ELT

📦

Fivetran

Managed ELT

🔧

Airbyte

Open-source ELT

⚡

Auto Loader

Cloud storage ingest

📰

Kafka / Kinesis

Real-time stream

📋

dbt Core / Cloud

SQL transformation

🧱

AWS Glue

Managed Spark ETL

📊 BI & Analytics

📊

Power BI

DirectQuery / Spark

📈

Tableau

Spark SQL connector

🔎

Looker

JDBC / LookML

📰

Sigma Computing

Spreadsheet BI

🧮

DataBricks SQL

Native dashboards

🎓

Hex

Notebook analytics

🛠️ Orchestration, ML & Observability

✈️

Apache Airflow

Orchestration

🌀

Prefect / Dagster

Dataflow orchestration

🦿

MLflow

ML lifecycle

📋

Delta Live Tables

Declarative pipelines

🔒

Alation / Collibra

Data catalogue

💾

Microsoft Purview

Governance

Ready to Build, Migrate, or Optimise Your DataBricks Lakehouse?

Whether you are designing a DataBricks workspace architecture from scratch, migrating from Hadoop, Snowflake, Redshift, Synapse, or BigQuery, implementing Delta Lake and Delta Live Tables for a modern pipeline layer, setting up Delta Sharing for partners or the Marketplace, developing ML models with MLflow and Mosaic AI, implementing Unity Catalog governance with dynamic data masking, or running a FinOps audit to control DBU costs — our certified DataBricks team will respond within 24 hours with an honest assessment and a practical path forward.

Start Your DataBricks Engagement Add Power BI Analytics on DataBricks

CLIENT TESTIMONIALS

What Our DataBricks Clients Say

We had been on an on-premise Hadoop cluster for 7 years. The operational overhead had become unsustainable — our small platform team was spending 60% of their time on cluster maintenance, patching, and hardware failures rather than building data products. Analysts waited 2–3 days for data engineering tickets to extract data from Hive into Excel because they could not query across databases themselves. SourceMash’s DataBricks migration took 22 weeks: they migrated 800+ Hive tables to Delta Lake via Spark, rebuilt 45 Pig and Hive ETL jobs as Delta Live Tables pipelines, and implemented Unity Catalog with row filters so that each business unit sees only their authorised data. The ETL time reduction is 72% — jobs that took 6 hours on Hadoop finish in 90 minutes on DataBricks. But the bigger transformation is self-service: analysts now query Gold-layer Delta tables directly from Power BI and DataBricks SQL without submitting tickets. Unity Catalog’s dynamic data masking on all PII columns gives us the governance posture our auditors require. And the total cost including DataBricks DBUs and cloud storage is 28% below what we were spending on Hadoop hardware maintenance and data centre costs alone.

Sanjay Kumar

Head of Data Engineering, Meridian Bank

We operate 340 retail stores across four formats with three different ERP systems and two different POS systems — which meant our data landscape was five separate databases that nobody could query across simultaneously without a manual extract-and-join in Excel. SourceMash built a DataBricks Lakehouse that consolidated all five sources via Fivetran into Bronze Delta tables, applied standard transformations using Delta Live Tables (consistent product hierarchies, consistent customer identifiers, consistent date definitions across all five source systems), and produced a single Gold-layer analytics warehouse that the whole organisation queries from the same schema. The inventory forecasting model they built using DataBricks AutoML improved our stock availability by 19 percentage points on promoted lines — we were consistently running out of promotional stock before because our forecast was based on a subset of sales data, not the full cross-format picture. The Delta Sharing implementation for our top 10 suppliers took 2 days per supplier compared to the 6-week data extract and SFTP setup process we had been running for the previous generation of supplier data sharing.

Priya Rao

Chief Data Officer, IndiaRetail Group

We were spending ₹1.55 crore per year on DataBricks DBUs and did not have a clear picture of where the cost was going. Our engineering team had grown the platform organically over 3 years and nobody had audited the cluster configuration or job efficiency in that time. SourceMash’s FinOps audit took 3 weeks. The findings: 6 of our 8 all-purpose clusters had auto-termination disabled or set to 60 minutes — fixing this alone was ₹18 lakh of annual savings. Our three most expensive Spark jobs were scanning full tables on our 600GB event Delta table because the filter column was not in the Z-ORDER or partition key — adding Z-ORDER on the event_date column and enabling Liquid Clustering made the same jobs 8–25x faster and reduced their DBU consumption by 88%. Our SQL warehouses were running 24/7 on Classic mode for BI tools that were only used during business hours — switching to Serverless with auto-scaling reduced warehouse DBUs by 62%. Total annual saving from the FinOps programme: ₹52 lakh — 34% of our total DBU spend. The programme paid for itself in less than 8 weeks.

Vikram Khanna

VP Engineering, DataStack SaaS

One Platform for Data, Analytics & AI. Built on the Lakehouse Architecture.

Lakehouse Architecture Unifying Data, Analytics & AI. Open, Governed, and Scalable.

DataBricks Compute Selection Guide

DataBricks Certifications

DataBricks Workspace Architecture & Unity Catalog Design

Cluster & SQL Warehouse Design

Unity Catalog & Schema Hierarchy

Cluster Policies & Cost Guards

Network Policy & Private Connectivity

Delta Time Travel & Restore

Multi-Cloud & Multi-Region

Lakehouse Migration — Snowflake, Redshift, Synapse, BigQuery, Hadoop & On-Premise

Migration Assessment & SQL Inventory

Snowflake to DataBricks Migration

Hadoop & On-Premise Data Lake Migration

Redshift, Synapse & Cloud DW Migration

Data Validation & Reconciliation

Cutover Strategy & Parallel Run

Delta Lake Data Modelling — Medallion Architecture on DataBricks

Medallion Architecture in Delta Lake

Incremental Processing & MERGE Strategy

Data Quality & Delta Expectations

Table Optimisation & File Layout

CI/CD for Delta Lake on DataBricks

Streaming Tables & Materialized Views

Delta Live Tables & Pipeline Engineering — Auto Loader, Streaming & Orchestration

Delta Live Tables Implementation

Auto Loader & Cloud Storage Ingestion

Spark Structured Streaming

DataBricks Workflows & Orchestration

Real-Time & Change Data Capture

Pipeline Monitoring & Observability

Delta Sharing & DataBricks Marketplace

Delta Sharing — Provider & Consumer Implementation

DataBricks Marketplace — Listings & Data Products

Data Clean Rooms & Privacy-Preserving Collaboration

Cross-Platform Data Sharing

Mosaic AI & ML Operations — MLflow, Feature Store & Model Serving

MLflow Tracking & Model Registry

Feature Store & Feature Engineering

Mosaic AI & Generative AI

Model Training & AutoML

Model Serving & Inference

DataBricks Notebooks & Collaborative Development

Unity Catalog Governance — Masking, Lineage, Access Control & Compliance

Dynamic Data Masking

Row Filters & Row-Level Security

Object Tagging & Data Classification

RBAC & Account Role Design

Audit Logging & Access Monitoring

Data Lineage & Discovery

DataBricks FinOps — DBU Optimisation & Cost Control

DataBricks Lakehouse by Industry.

Tools That Connect to DataBricks in Our Practice.

📥 Ingestion & ELT

📊 BI & Analytics

🛠️ Orchestration, ML & Observability

Ready to Build, Migrate, or Optimise Your DataBricks Lakehouse?

What Our DataBricks Clients Say

Frequently Asked Questions