AI Development Services - AI App & Software Solutions
Generative AI Development Services - AI Software Experts
Conversational AI Agents for Businesses - SourceMash Technologies
Applied AI Solutions by SourceMash Technologies
AI & Data Engineering Solutions Delivered by Expert AI Data Engineers
Responsible AI & Governance for Ethical AI Systems
Expert AI Strategy Consulting & Roadmap Services
Salesforce CRM
Microsoft Dynamics 365
Oracle CX
AS400 PKMS/WMS
CRM Implementation
CRM Integrations and Executions
Microsoft Dynamics 365 System for Business Advanced Solutions
Oracle ERP Cloud System for Modern Businesses
Manhattan PKMS/WMS
SAP S/4HANA ERP Software, Implementation & Migration Services
iSeries/AS400
Marketing Technology Services
Digital Marketing Services
SOC Setup and Operations
Cloud Infrastructure Management Services
24/7 Expert IT Support
Data Analytics
Data Integration
Full Stack Development
Shopify
WooCommerce
Salesforce Commerce Cloud
Magento
DataBricks is the unified data, analytics, and AI platform built on the Lakehouse architecture — the paradigm that combines the low-cost, open-format storage of data lakes with the transactional reliability, performance, and governance of data warehouses. By storing all data in open Delta Lake format on cloud object storage (S3, ADLS, GCS) and processing it through Apache Spark, SQL, and AI-optimised compute, DataBricks eliminates the architectural split between data lakes and data warehouses that has forced organisations to maintain duplicate data copies, inconsistent governance, and separate engineering teams for ETL and BI workloads. DataBricks' Unity Catalog provides unified governance across all data and AI assets, Delta Live Tables brings declarative pipeline engineering to ETL, and the platform's AI runtime (powered by Mosaic AI) enables organisations to train, tune, and deploy LLMs and ML models on the same data infrastructure their analysts query. SourceMash delivers DataBricks engagements covering workspace architecture, cloud migration, Delta Lake modelling, Delta Live Tables pipeline engineering, Delta Sharing, MLflow ML operations, Unity Catalog governance, and FinOps DBU optimisation for enterprise clients.
DataBricks solves the fundamental architectural split that has plagued enterprise data platforms for a decade: the separation between data lakes (cheap, open-format storage for raw data, ML, and streaming) and data warehouses (reliable, high-performance SQL analytics with governance). Maintaining both means duplicate storage costs, inconsistent data quality, separate governance models, and engineering teams that cannot collaborate because they work on different copies of the same data. The Lakehouse architecture stores all data in Delta Lake — an open, transactional storage layer built on Parquet files with ACID guarantees, time travel, and schema enforcement — sitting on standard cloud object storage. DataBricks SQL warehouses provide high-performance BI querying on the same Delta tables that DataBricks ML runtime uses for feature engineering and model training, and that Delta Live Tables pipelines write to in streaming and batch ETL jobs.
SourceMash's DataBricks practice covers the full platform: workspace and Unity Catalog architecture, migration from Snowflake, Redshift, Synapse, BigQuery, Hadoop, and on-premise data warehouses, Delta Lake data modelling and medallion architecture, Delta Live Tables pipeline engineering, Delta Sharing and the DataBricks Marketplace, MLflow ML operations and Mosaic AI for generative AI, Unity Catalog governance with dynamic data masking and lineage, and the FinOps programme that aligns DBU consumption to business value.
DataBricks workspace architecture — the decisions that determine how data is organised in Unity Catalog, how compute is configured for each workload type, how costs are tracked and controlled at the workspace and account level, and how security is enforced across cloud accounts — is the foundation that determines whether the Lakehouse delivers unified analytics and AI or becomes a collection of disconnected notebooks with ungoverned data copies. DataBricks' flexibility in workspace design, metastore topology, and cloud networking requires deliberate architecture: a single workspace with default catalog permissions and no cluster policies produces the same governance gaps and budget surprises that the Lakehouse architecture is designed to eliminate.
The foundational architecture decisions — single workspace vs. multi-workspace topology (separate workspaces for production, development, and sandbox experimentation), Unity Catalog metastore design with catalog and schema hierarchy reflecting data domains and access boundaries, cluster and SQL warehouse configuration for each workload type (all-purpose clusters for development, jobs compute for ETL, SQL warehouses for BI, ML compute for training), and cluster policies that enforce auto-termination, instance types, and cost tags — must be made before any production workload lands in the platform.
Compute configuration for each workload class — the independently-scalable clusters that are the primary lever for both performance and cost management in DataBricks. All-purpose clusters: interactive development and ad-hoc analysis (auto-scaling enabled, auto-termination after 15–30 minutes of inactivity, spot instance mixing for cost reduction, single-user or shared access modes depending on Unity Catalog isolation requirements). SQL Warehouses: serverless or classic endpoints for BI tools (Power BI, Tableau, Looker) with Photon engine enabled for vectorized query execution, predictive IO for cloud storage caching, and scaling settings optimised for concurrency (max 10–20 queries per cluster with auto-scaling). Jobs compute: cost-optimised clusters for Delta Live Tables and scheduled workflows (automatic termination on job completion, spot instances, smallest viable instance type for the workload, job-level DBU pricing that is significantly cheaper than all-purpose compute).
Unity Catalog hierarchy design that reflects both the data's logical organisation and the governance model. Metastore: the top-level container that manages all data assets across workspaces in a region (one metastore per region, shared across production and development workspaces with access control providing isolation). Catalogs: the logical boundary for data domains (RAW for landed source data, TRANSFORMED for cleansed and enriched data, ANALYTICS for dimensional models and aggregated tables, ML_FEATURES for feature store tables, SANDBOX for experimental data). Schemas: subject-area or source-system boundaries within each catalog (RAW.SALESFORCE, RAW.SAP, ANALYTICS.CUSTOMER_360). Volumes: non-tabular data storage (raw files, images, model artefacts) governed by the same Unity Catalog permissions as tables. Three-layer namespace (catalog.schema.table) that replaces the legacy Hive metastore two-layer namespace and enables cross-catalog data sharing without data movement.
Cluster policy design — the DataBricks-native mechanism for preventing runaway cost before it appears on the monthly bill. Cluster policies enforce: auto-termination timeout (critical for all-purpose clusters where developers leave notebooks running overnight), instance type restrictions (preventing provisioning of expensive GPU or high-memory instances for standard ETL workloads), maximum node count limits (preventing auto-scaling beyond a defined boundary), and predefined cost tags that allocate every DBU to a cost centre. Workspace-level budget alerts: configuring billing alerts at the account level when DBU consumption exceeds defined thresholds. Instance pool pre-warming: keeping a pool of idle instances ready for fast cluster startup without paying DBUs for the idle time (only cloud VM costs, not DataBricks DBUs) — ideal for development environments where fast startup is more important than minimising every rupee.
DataBricks network security configuration — controlling how workspaces connect to cloud infrastructure and how clients connect to DataBricks. VPC / VNet injection: deploying the DataBricks control plane and data plane within the customer's own cloud network rather than the default multi-tenant network — required for financial services and healthcare compliance. AWS PrivateLink, Azure Private Link, and GCP Private Service Connect for private connectivity between customer networks and DataBricks workspaces without traversing the public internet. IP access lists: restricting workspace access to corporate VPN egress ranges and office IPs. Token and credential management: personal access tokens with scoped permissions, service principals for CI/CD and automation, and OAuth integration with Azure AD / AWS IAM for SSO. Credential passthrough: cloud-native identity federation that allows clusters to access cloud storage using the identity of the user running the notebook rather than a shared service account — enabling fine-grained storage-level access control through Unity Catalog.
Delta Lake Time Travel for data recovery and audit — one of the most valuable operational features that requires no backup infrastructure and imposes minimal storage overhead. Time Travel (configurable retention, default 30 days): querying any Delta table as it existed at any point within the retention window using VERSION AS OF or TIMESTAMP AS OF syntax — enabling recovery from accidental DELETE or MERGE operations, comparison of current data to historical snapshots, and reproducible ML training by training on the exact data state that produced a specific model version. CLONE: creating a deep or shallow clone of a Delta table at a specific version — deep clone copies data files (useful for environment isolation), shallow clone copies only the transaction log pointer (instant, zero-copy clone for testing or development that shares underlying data files until modified). VACUUM: removing old data files beyond the retention period to control storage cost while preserving Time Travel history within the window.
DataBricks multi-cloud and multi-region deployment for organisations with data residency requirements, cloud provider diversification strategies, or the need to bring analytics and AI compute close to data consumers in different geographies. Data replication via Delta Sharing: sharing live Delta tables between DataBricks workspaces in different regions or clouds without copying data — the recipient queries against live shared data using their own compute. Cross-region metastore federation: Unity Catalog can manage data assets across regions while respecting data residency boundaries. Cloud-agnostic architecture: because DataBricks runs on all three major clouds with an identical API and SQL interface, organisations can maintain a single platform skill set while deploying workloads to the cloud provider most appropriate for each geography or data source. Disaster recovery: cloning critical Delta tables to a secondary region and maintaining a warm standby workspace for business continuity.
Migration to DataBricks from a legacy data platform — Snowflake, Amazon Redshift, Azure Synapse Analytics, Google BigQuery, on-premise Hadoop (HDFS, Hive, Impala), Teradata, or SQL Server data warehouses — is a data platform transformation project, not a lift-and-shift migration. Each source platform has its own SQL dialect, its own storage format, its own approach to transactions and concurrency, and its own governance model — none of which translate directly to DataBricks' Lakehouse architecture. A Snowflake table with micro-partitioning has no direct equivalent in Delta Lake (where Z-ORDER and liquid clustering handle data organisation); a Hadoop Parquet file without transaction support must be converted to Delta Lake format to gain ACID guarantees; and a Teradata stored procedure with procedural logic must be evaluated for whether it becomes a Delta Live Tables pipeline, a Python wheel on DataBricks, or a SQL UDF in Unity Catalog.
Migration assessment covering the full inventory of objects in the source platform: tables (count, size, partitioning strategy, compression, column data types, primary keys), views (complexity classification — simple pass-through vs. complex multi-join with window functions), stored procedures and UDFs (procedural complexity, Spark SQL compatibility, Python migration candidates), ETL pipelines and data loading procedures (current orchestration tool and DataBricks equivalent approach), and the BI reports that must be validated against the migrated data. Automated compatibility analysis using DataBricks' SQL translation capabilities and Spark SQL syntax comparison to identify objects that translate automatically vs. objects requiring manual rewrite — producing the remediation effort estimate that drives the migration project timeline and cost. Data format assessment: evaluating whether source data is in formats natively readable by DataBricks (Parquet, Delta, CSV, JSON) or requires conversion (proprietary columnar formats, mainframe extracts, XML).
Snowflake to DataBricks migration — driven by organisations seeking to unify their data warehouse and data lake on a single platform, reduce proprietary storage costs, or leverage DataBricks' AI and ML capabilities on the same data their analysts query. Schema translation: Snowflake's micro-partitioning and clustering keys have no direct equivalent in Delta Lake — replaced by Delta Lake's Z-ORDER optimisation (co-locating column values in the same set of files to improve query pruning) and liquid clustering (automatically maintaining data layout without manual Z-ORDER commands). SQL dialect differences: Snowflake's QUALIFY clause → Spark SQL window functions with ROW_NUMBER() OVER(), Snowflake's VARIANT → Spark STRUCT or MAP types, Snowflake's semi-structured flattening → Spark SQL's explode() and from_json(). Data transfer via Snowflake UNLOAD to cloud storage in Parquet format, followed by a one-time Delta Lake conversion (CREATE TABLE USING DELTA LOCATION ...) with OPTIMIZE and Z-ORDER — or via a continuous sync using Fivetran, Airbyte, or Spark JDBC until cutover.
Hadoop (HDFS, Hive, Impala, Spark on-premise) to DataBricks migration — driven by organisations retiring on-premise Hadoop clusters due to unsupported operational complexity, hardware refresh costs, or the need to unify batch and streaming on a modern platform. HDFS to cloud storage: migrating petabyte-scale data from on-premise HDFS to S3, ADLS, or GCS using DistCp, WANdisco, or cloud-native transfer appliances — followed by format conversion from Hive-managed tables (often stored as text, SequenceFile, or ORC) to Delta Lake Parquet for ACID reliability and performance. Hive metastore translation: converting Hive database and table definitions to Unity Catalog catalogs and schemas, mapping Hive partition columns to Delta Lake partitioning, and translating Hive SerDe configurations to Spark read options. Spark job migration: on-premise Spark jobs (PySpark, Scala) typically require minimal code changes to run on DataBricks — primarily configuration changes (replacing HDFS paths with cloud storage URIs, updating metastore connections to Unity Catalog) and dependency management (migrating from cluster-installed libraries to DataBricks cluster-scoped libraries or Unity Catalog volumes).
Amazon Redshift and Azure Synapse to DataBricks migration — driven by the desire to move from proprietary cloud data warehouses to an open Lakehouse architecture where data is stored in open Delta Lake format rather than a vendor-locked columnar format. Redshift schema translation: Redshift DISTKEY, SORTKEY, and INTERLEAVED SORTKEY table attributes have no equivalent in Delta Lake — replaced by Z-ORDER on frequently filtered columns and liquid clustering for automatic optimisation. Synapse HASH DISTRIBUTION and REPLICATED table strategies are irrelevant in Delta Lake's cloud storage model; the performance equivalent is file layout optimisation through partitioning and Z-ORDER. T-SQL to Spark SQL translation: proprietary T-SQL functions (STRING_SPLIT, OPENJSON, FOR JSON, HASHBYTES) require rewriting in Spark SQL or Python UDFs. Data export via Redshift UNLOAD or Synapse CETAS to cloud storage Parquet, then Delta conversion and OPTIMIZE.
Post-migration data validation — the most important phase of any platform migration and the one most often compressed when project timelines are under pressure. Automated row count reconciliation across every migrated table (source count = target count for every date partition and dimension key value), aggregate validation (source SUM, MIN, MAX, COUNT DISTINCT for key business metrics compared to DataBricks values within a defined tolerance), and sample-level row-by-row comparison for a representative subset of each table (verifying data type casting, decimal precision, date/time handling, and NULL semantics between the source and DataBricks). Business metric reconciliation: running the equivalent of the business's critical financial or operational reports against both the source and DataBricks and comparing output — because raw table data can match while aggregated business metrics diverge due to join behaviour differences or filter predicate translation errors. Data quality profiling: using DataBricks' built-in data profiling and Great Expectations to validate that migrated data meets the same quality standards as the source.
Migration cutover strategy — the plan for transitioning the production analytics workload from the source platform to DataBricks with minimum disruption. Parallel run: running both the source and DataBricks environments simultaneously for a validation period (typically 2–4 weeks), with the source as the authoritative system but DataBricks producing the same reports for comparison. Parallel run enables business stakeholders to compare DataBricks report outputs to the existing reports they trust, building confidence before the cutover. Big-bang cutover: on the agreed date, the source is retired, all BI tools are reconnected to DataBricks SQL warehouses, and ETL pipelines are switched to write to Delta Lake — with a defined rollback path (reconnect BI tools to source) if a critical issue is found in the first 24 hours. Phased migration workloads sequentially (analytics first, operational reporting second, ML and AI last) to reduce the risk of any single cutover event.
Delta Lake is the open-source storage layer that brings ACID transactions, schema enforcement, time travel, and scalable metadata handling to data lakes built on cloud object storage. Before Delta Lake, data lakes were fundamentally unreliable for analytics: a Spark job writing Parquet files could fail halfway through leaving partial, unreadable data; concurrent reads and writes produced inconsistent results; and there was no mechanism to enforce schema or prevent data type corruption. Delta Lake solves these problems with a transaction log (the _delta_log) that records every operation atomically, optimistic concurrency control that allows multiple writers to append to the same table simultaneously, and schema enforcement that rejects writes violating the defined schema (with schema evolution available when deliberate changes are needed). DataBricks' implementation of Delta Lake adds proprietary optimisations — Predictive IO, Liquid Clustering, and Predictive Optimisation — that automate the maintenance operations (OPTIMIZE, VACUUM, ANALYZE) that keep Delta tables performing at warehouse-grade speed on lake-scale data volumes.
Medallion architecture implementation in DataBricks: Bronze layer (raw data landed from source systems without transformation — ingested via Auto Loader, Kafka, or JDBC, stored as Delta tables with schema inference or enforcement, partitioned by ingestion date for efficient time-based queries), Silver layer (cleaned, deduplicated, normalised, and enriched data merged into slowly changing dimension tables, deduplicated via MERGE INTO with match conditions, joined across source systems to create unified entity views, and quality-tested with Great Expectations or Delta Expectations), and Gold layer (business-aggregated, dimensional-modelled data ready for BI and ML consumption — star schema fact and dimension tables, summary aggregates, feature engineering outputs, and the curated datasets that power executive dashboards). Unity Catalog manages access control at each layer: analysts have SELECT on Gold only, data engineers have WRITE on Bronze and Silver, and ML engineers have SELECT on Silver and Gold plus WRITE on ML_FEATURES.
Incremental data processing for large tables where full table recreation on every pipeline run is prohibitively slow or expensive — Delta Lake's MERGE INTO operation enables upserts that insert new rows and update changed rows in a single atomic transaction. Incremental patterns in DataBricks: append-only (new rows only, no updates — simplest, fastest, uses INSERT INTO for append-only sources like clickstream), MERGE INTO upsert (match on business key, update changed attributes, insert new rows — correct for dimension tables and fact tables where late-arriving updates occur), and SCD Type 2 (slowly changing dimensions tracking historical values by adding effective_date and is_current columns, implemented via MERGE INTO that expires the old row and inserts the new row). Change Data Capture (CDC) from source databases: reading Debezium or native CDC feeds into Delta Lake via Spark Structured Streaming and applying changes incrementally with MERGE INTO rather than full table reloads.
Data quality framework for Delta Lake tables: Delta Expectations (built into Delta Live Tables) for declarative data quality rules defined as Python decorators on pipeline nodes (@dlt.expect('valid_order_amount', 'order_amount > 0')), Great Expectations integration for comprehensive data validation suites, and Unity Catalog data quality monitors that track freshness, volume, and schema drift over time. Quality rule types: completeness (not_null constraints on critical columns), validity (range checks, regex patterns for email/phone formats, referential integrity against dimension tables), uniqueness (primary key uniqueness verified before MERGE INTO operations), and timeliness (lag detection ensuring source data arrival within SLA). Quarantine pattern: routing rows failing quality checks to a separate Delta table (Bronze_Quarantine) for manual review rather than failing the entire pipeline — enabling incremental pipeline progress while isolating bad data for remediation.
Delta Lake table optimisation — the maintenance operations that keep query performance consistent as tables grow to terabyte and petabyte scale. OPTIMIZE: coalescing small files (produced by frequent streaming micro-batches or small JDBC extracts) into larger files (target 128MB-1GB per file) that match the read throughput of cloud object storage and Spark's parallel read model. Without OPTIMIZE, tables accumulate millions of tiny files that cause query planning to dominate execution time. Z-ORDER: multi-dimensional clustering that co-locates related column values in the same files — enabling data skipping (reading only files whose min/max statistics overlap the query filter) rather than full table scans. Liquid Clustering (DataBricks proprietary, 2024): automatic maintenance of clustering without manual Z-ORDER commands — the system continuously reorganises data as it is written. VACUUM: removing old Parquet files that are no longer referenced by the current Delta table state (after Time Travel retention expires) to control storage cost.
CI/CD pipeline for Delta Lake schema and pipeline deployment — applying software development release practices to data platform changes. Development workflow: DataBricks Repos (Git integration within the DataBricks workspace) or external IDEs (VS Code with DataBricks extension) for notebook and wheel development, Git-based branching (feature branch per change, pull request for peer review), and DataBricks Asset Bundles (YAML-defined resource configurations for jobs, pipelines, and clusters that are deployed via CI). DataBricks CLI and API integration: linting SQL with SQLFluff, running unit tests on Python UDFs and transformation logic, and deploying to production via CI/CD pipelines on merge to main. Blue-green deployment for large schema changes: creating a parallel Gold schema (Gold_v2), backfilling with historical data, validating BI tool connectivity, and swapping the schema reference in Unity Catalog rather than running a potentially disruptive in-place migration.
DataBricks Streaming Tables and Materialized Views (powered by Delta Live Tables) for declarative data pipeline definition — the shift from imperative Spark code ("read this, filter that, join these, write there") to declarative SQL or Python statements that define what the output should look like, with DataBricks handling the execution, refresh, and optimisation automatically. Streaming Table: a Delta table that is continuously updated from a streaming source (Kafka, Kinesis, Event Hubs, Auto Loader on cloud storage) — defined with CREATE STREAMING TABLE and refreshed automatically as new data arrives. Materialized View: a pre-computed aggregation or join result that refreshes incrementally when source tables change — defined with CREATE MATERIALIZED VIEW and automatically kept in sync by DataBricks without manual pipeline scheduling. Both integrate with Unity Catalog for governance and can be queried identically to standard tables by BI tools and analysts.
Delta Live Tables (DLT) is DataBricks' declarative pipeline framework that brings software engineering best practices — unit testing, data quality expectations, automatic error handling, and pipeline observability — to data pipeline development on DataBricks. Before DLT, data pipelines on DataBricks were typically built as imperative Spark notebooks: cells of PySpark or Scala code that read sources, applied transformations, and wrote to Delta tables, scheduled via DataBricks Jobs. These notebooks were difficult to test, prone to failure on data quality issues, and required manual maintenance of dependencies between pipeline steps. DLT replaces imperative notebook pipelines with declarative SQL or Python definitions: the developer defines the target table and the transformation logic, and DLT handles the execution graph construction, incremental processing, data quality enforcement, and failure recovery automatically. DLT pipelines integrate with Unity Catalog for lineage tracking, enforce data quality expectations that quarantine bad records without failing the pipeline, and provide automatic scaling and optimised execution through the DLT execution engine.
DLT deployment and pipeline configuration for the declarative ETL approach — where the developer defines target tables and transformations, and DLT manages the execution graph, incremental processing, and data quality automatically. DLT pipeline modes: triggered (batch execution on a schedule — appropriate for traditional ETL workloads that run once per hour or once per day) and continuous (streaming execution that processes new data as it arrives — appropriate for near-real-time use cases with latency requirements of minutes). DLT SQL and Python APIs: @dlt.table() decorators in Python or CREATE LIVE TABLE statements in SQL that define each pipeline node, with automatic dependency resolution from the graph of references between tables. Data quality expectations: @dlt.expect(), @dlt.expect_or_fail(), and @dlt.expect_or_drop() for defining constraints that are evaluated on every pipeline run, with failed rows routed to quarantine tables. Unity Catalog integration: DLT tables are automatically registered in Unity Catalog with full lineage tracking from source to target.
Auto Loader for incremental ingestion of files from cloud storage (S3, ADLS, GCS) into Delta Lake — the DataBricks-native mechanism for efficiently loading new files as they arrive without expensive LIST operations on the entire bucket. Auto Loader uses cloud notification queues (SQS for AWS, Event Grid for Azure, Pub/Sub for GCP) to detect new files and incrementally ingests them with exactly-once processing guarantees. Schema inference and evolution: Auto Loader can infer schema from the first batch of files and evolve the schema automatically as new columns appear in subsequent files — eliminating pipeline failures from upstream schema changes. Supported formats: JSON, CSV, Parquet, Avro, XML, and text files. Rescued data: rows or columns that do not conform to the expected schema are routed to a _rescued_data column rather than failing the ingestion — enabling pipeline continuity while capturing non-conforming data for investigation. CloudFiles source: the Spark readStream source that powers Auto Loader, configured with .format("cloudFiles") and options for schema location, checkpoint location, and file notification mode.
Spark Structured Streaming for real-time and micro-batch data processing on DataBricks — the continuous query engine that processes unbounded data streams with the same DataFrame API used for batch processing. Streaming sources: Kafka (readStream.format("kafka")), Kinesis (readStream.format("kinesis")), Event Hubs, cloud storage via Auto Loader, and Delta Lake (readStream.format("delta") for reading change feeds from other Delta tables). Streaming sinks: Delta Lake (writeStream.format("delta") with mergeSchema option), Kafka, and foreachBatch for custom logic. Stateful operations: aggregations over time windows (tumbling, sliding, session windows), stream-stream joins (joining two streaming sources with watermarking to handle late data), and deduplication (dropDuplicates within a watermark boundary). Checkpointing: maintaining state in cloud storage to enable exactly-once processing and fault tolerance across cluster restarts. Trigger intervals: processingTime triggers for micro-batch execution (every 30 seconds, every 5 minutes) or continuous triggers for millisecond-latency processing.
DataBricks Workflows for pipeline orchestration — the native job scheduler that chains notebooks, Delta Live Tables pipelines, Python wheels, SQL queries, and external tasks into multi-step workflows with dependency management. Workflow tasks: notebook tasks (running a specific notebook with parameters), DLT pipeline tasks (triggering a DLT pipeline and waiting for completion), Python wheel tasks (running a packaged Python application), SQL tasks (executing SQL statements against a SQL warehouse), and conditional tasks (IF/ELSE branching based on task output). Task dependencies: defining the DAG of task execution with upstream/downstream relationships, retry policies for transient failures, and timeout configurations. External orchestration integration: Apache Airflow (via the DataBricks Airflow provider), Azure Data Factory, AWS Step Functions, and Prefect for cross-system orchestration where DataBricks is one component in a wider pipeline. Workflow monitoring: the DataBricks Jobs UI provides execution history, run duration, output logs, and alerting via email or webhooks on failure.
Real-time data ingestion into Delta Lake for operational analytics use cases where data freshness of minutes rather than hours is required. Debezium CDC pipeline: Debezium captures row-level changes from the source database transaction log (PostgreSQL WAL, MySQL binlog, SQL Server CDC, Oracle LogMiner) and publishes them to Apache Kafka topics, from which Spark Structured Streaming reads and applies changes to Delta Lake via MERGE INTO. Change Data Feed (CDF): a Delta Lake feature that captures row-level changes (INSERT, UPDATE, DELETE) on a Delta table and exposes them as a queryable change stream — enabling downstream consumers to process only changed rows without requiring a CDC tool at the source database. Confluent Cloud + DataBricks Kafka Connector for managed Kafka ingestion. DataBricks Streaming Tables for continuous ingestion from Kafka or Auto Loader with automatic schema evolution and exactly-once guarantees.
Data pipeline observability — the monitoring and alerting that ensures pipeline failures are detected and resolved before they affect business reports and dashboards. DataBricks Job and Workflow monitoring: execution times, DBU consumption, error messages, and task success/failure status from the DataBricks Jobs UI and REST API. DataBricks SQL dashboards for pipeline monitoring: custom dashboards built in DataBricks SQL that query the DataBricks system tables (system.information_schema, system.billing, system.compute) to display pipeline health, data freshness, and cost trends. Data quality monitoring: Unity Catalog data quality monitors that track schema drift, volume anomalies, and freshness SLA breaches across Delta tables automatically. Integration with Monte Carlo, Acceldata, or open-source observability tools for full data observability across the Lakehouse. DataBricks Lakehouse Monitoring (2024): the native observability feature that automatically monitors data quality, model performance, and pipeline health across the Lakehouse without requiring external tools.
DataBricks' Mosaic AI (formerly DataBricks ML) is the integrated machine learning and generative AI platform that enables organisations to build, train, tune, and deploy ML models and LLMs on the same Lakehouse infrastructure that powers their analytics. Before Mosaic AI, machine learning on enterprise data required extracting data from the warehouse to a separate ML platform, training models in an isolated environment, and deploying them separately from the data pipeline — creating security boundaries, data drift, and operational complexity. Mosaic AI eliminates this separation: Data Scientists query Delta tables directly via Spark DataFrames or Pandas-on-Spark, engineer features using the same SQL transformations the BI team uses, register models in MLflow (the open-source ML lifecycle platform, hosted natively in DataBricks), and deploy models as real-time REST endpoints or batch inference pipelines within the same Unity Catalog governance boundary. For generative AI, Mosaic AI provides pre-trained LLMs, fine-tuning infrastructure, vector search, and model serving for RAG (Retrieval-Augmented Generation) applications — all within the DataBricks platform.
MLflow on DataBricks for the full machine learning lifecycle — experiment tracking, model versioning, and production deployment within the Lakehouse governance boundary. MLflow Tracking: logging parameters, metrics, artefacts (feature importance plots, confusion matrices, model files), and training data references for every experiment run — enabling reproducibility and comparison across hundreds of model iterations. MLflow Model Registry: staging models through development, staging, and production stages with approval gates, version tagging, and automated transition webhooks that trigger CI/CD pipelines. Unity Catalog integration: MLflow models registered in Unity Catalog as first-class data assets with lineage tracking (which Delta tables and features were used to train each model version), access control (who can transition a model to production), and audit history. Model flavours: scikit-learn, XGBoost, LightGBM, TensorFlow, PyTorch, Spark MLlib, and LangChain — all supported natively with automatic environment packaging.
DataBricks Feature Store for managing, discovering, and serving ML features — the engineered columns that ML models use as inputs — as governed Unity Catalog assets that are shared between data engineering and data science teams. Offline Feature Store: features computed in batch via Spark SQL or Delta Live Tables and stored in Delta tables (feature tables) with point-in-time correctness for training (no feature leakage from the future). Online Feature Store: low-latency feature serving via DataBricks' online store (backed by DynamoDB, Cosmos DB, or Redis) for real-time model inference where features must be retrieved in milliseconds. Feature discovery: data scientists search Unity Catalog for features created by the data engineering team, viewing descriptions, statistics, and lineage rather than recreating features independently. Feature tables are versioned and time-travelled like any Delta table, enabling training on historical feature values and detection of feature drift over time. Feature monitoring: automatic tracking of feature distributions and drift alerts when production feature statistics deviate from training distributions.
Mosaic AI for generative AI and large language model operations on DataBricks — enabling organisations to build RAG applications, fine-tune open-source LLMs, and deploy AI applications without sending data to third-party APIs. Foundation Model APIs: calling pre-trained LLMs (Llama, Mistral, BGE embeddings) via DataBricks-hosted endpoints without managing infrastructure — billed per token with data remaining within the customer's cloud account. Fine-tuning: adapting open-source LLMs to proprietary data using DataBricks' LLM fine-tuning runtime (QLoRA, full fine-tuning) on GPU clusters, with training data sourced directly from Delta tables. Vector Search: creating vector indexes on Delta tables for semantic search and RAG retrieval, with automatic index maintenance as source data changes. Model Serving: deploying fine-tuned LLMs and embedding models as auto-scaling REST endpoints behind Unity Catalog governance. AI Playground: the interactive UI for testing prompts, comparing model outputs, and prototyping RAG applications before production deployment.
DataBricks model training infrastructure for classical machine learning and deep learning: AutoML for rapid baseline model generation (classification, regression, forecasting — DataBricks automatically tries multiple algorithms, hyperparameter combinations, and preprocessing pipelines, presenting the best models in a leaderboard with explainability), distributed training for large datasets (Spark MLlib for distributed ML on terabyte-scale data, Horovod for distributed deep learning across GPU clusters), and hyperparameter tuning (Hyperopt with distributed execution across cluster workers for efficient hyperparameter search). Training data access: direct query of Delta tables via Spark DataFrames, Pandas-on-Spark (distributed Pandas operations on Spark clusters), or Spark SQL — eliminating data extraction and ensuring training data is as fresh as the analytics data. Experiment management: all training runs logged to MLflow with automatic artefact capture, enabling reproducibility and model lineage tracking back to the exact data version used for training.
DataBricks Model Serving for deploying ML models and LLMs as production REST endpoints or batch inference pipelines within the Lakehouse security boundary. Real-time serving: deploying registered MLflow models as auto-scaling REST endpoints (CPU-based for classical ML, GPU-based for LLMs) with Unity Catalog authentication, request logging, and A/B testing support. Batch inference: applying trained models to large Delta tables via Spark UDFs or the DataBricks model scoring API for offline predictions that write results back to Delta Lake. Model serving endpoints scale from zero (no cost when not receiving requests) to thousands of requests per second with automatic load balancing. Endpoint monitoring: tracking request latency, throughput, error rates, and model prediction distributions to detect drift and performance degradation. Integration with DataBricks Apps and external applications: REST endpoints can be called from Power BI, custom web applications, or mobile apps via standard HTTP with token authentication.
DataBricks Notebooks for collaborative data science, ML engineering, and analytics development — the interactive environment that supports Python, SQL, Scala, and R in the same notebook with mixed-language execution. Notebook use cases: exploratory data analysis on Delta tables (plotting with Matplotlib, Seaborn, or Plotly inline), ML model development and testing with MLflow experiment tracking, feature engineering pipeline prototyping, and the data quality investigation workflows where SQL and Python are interleaved. Collaborative features: real-time co-editing (multiple users editing the same notebook simultaneously), comments and annotations, revision history with automatic versioning, and integration with Git repositories (DataBricks Repos) for branch-based development. Notebooks run on DataBricks compute (all-purpose clusters, serverless, or jobs compute), keeping all data within the Lakehouse security boundary. Parameterised notebooks: defining input parameters that enable the same notebook to be reused across different datasets or time periods via DataBricks Jobs or Workflows.
Unity Catalog is DataBricks' unified governance solution for the Lakehouse — providing centralized access control, auditing, lineage, and data discovery across all data assets (Delta tables, volumes, models, notebooks, dashboards) in a DataBricks account. Before Unity Catalog, DataBricks workspaces used the Hive metastore, which provided only database/table-level access control with no column or row-level security, no data lineage, no audit logging, and no cross-workspace sharing. Unity Catalog replaces this with a three-layer namespace (catalog.schema.table), attribute-based access control (ABAC) via tags, dynamic data masking and row filters, column-level lineage, and comprehensive audit logging — all managed through SQL statements and integrated with enterprise identity providers (Azure AD, AWS IAM Identity Center, Okta). For regulated industries (BFSI, healthcare, insurance, government), Unity Catalog enables implementation of granular data access controls that operate at the column and row level, enforce data masking for sensitive attributes based on the querying user's role, and produce the data lineage and access audit records that regulatory compliance programmes require.
Unity Catalog Dynamic Data Masking for column-level data protection that applies masking rules at query time based on the querying user's role — without modifying the underlying data or requiring multiple copies of the table. Masking function creation: a SQL UDF that returns the column value for authorised roles and a masked or null value for all other roles (CASE WHEN is_account_group_member('ANALYST_PII') THEN credit_card_number ELSE '****-****-****-' || RIGHT(credit_card_number, 4) END). Policy assignment: the masking policy is assigned to a column in CREATE TABLE or via ALTER COLUMN — from that point, every query against that column is masked for unauthorised users transparently. Masking policy types: full masking (NULL), partial masking (first 4 / last 4 of credit card), hash masking (deterministic but irreversible for join-compatible pseudonymisation), and conditional masking (different masking for different roles). Unity Catalog applies masking consistently across SQL warehouses, all-purpose clusters, and DLT pipelines.
Unity Catalog Row Filters for row-level data filtering that restricts which rows a user can see in a table based on their identity or role membership — the DataBricks equivalent of row-level security in SQL Server or Oracle Virtual Private Database. Row filter design: a SQL UDF that returns TRUE for rows the current user is authorised to see and FALSE for rows that should be filtered out. Implementation patterns: region-based access (each regional manager sees only their region's rows — a lookup table mapping username to authorised region codes, referenced in the filter function), customer-level access (a B2B portal scenario where each customer account sees only their own transaction rows — the filter function joins to a customer-user mapping table), and classification-level access (users with CONFIDENTIAL role see all rows; users without see only rows tagged as PUBLIC). A single row filter can be applied to multiple tables simultaneously — enabling consistent access control across the data model without per-table repetition. Row filters combine with column masks for comprehensive cell-level security.
Unity Catalog Object Tags for attaching metadata to data assets (catalogs, schemas, tables, columns, volumes) — the foundation for automated governance policy application based on data sensitivity classification. Tag creation and assignment: SENSITIVITY_LEVEL (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED), DATA_DOMAIN (FINANCIAL / HEALTH / PII / OPERATIONAL), RETENTION_PERIOD, GDPR_APPLICABLE. Tag-based masking policies: instead of assigning a masking policy to each column individually, tag all PII columns with the PII tag and apply a masking policy to the SENSITIVITY_LEVEL tag — the masking policy automatically applies to every future column tagged as PII without requiring a manual ALTER COLUMN for each. Unity Catalog data classification: the automated PII detection that scans table contents and recommends sensitivity tags based on column name patterns and data patterns (credit card numbers, email addresses, phone numbers, Aadhaar formats) — accelerating the classification of large legacy schemas. Tags are inherited down the hierarchy (catalog → schema → table → column) and can be queried via INFORMATION_SCHEMA for data catalogue browsing.
Unity Catalog Role-Based Access Control (RBAC) architecture — the account role hierarchy that determines which users can access which objects, perform which operations, and consume which compute resources. Standard role design pattern: ACCOUNT ADMIN (manages account-level settings and billing), METASTORE ADMIN (manages Unity Catalog metastore and governance), CATALOG ADMIN (manages a specific catalog), and custom functional roles (ANALYST_FINANCE, DATA_ENGINEER_BRONZE, ML_SCIENTIST) with object-level BROWSE, READ, SELECT, MODIFY, and CREATE grants. Privilege inheritance: Unity Catalog privileges do not inherit hierarchically by default (unlike some databases) — USAGE on a catalog is required to access any schema within it, and USAGE on a schema is required to access any table within it, but SELECT on a table must be granted explicitly. Service principals for automation: dedicated service principals for CI/CD pipelines, ETL jobs, and BI tool connections with minimal required privileges rather than using personal access tokens. Identity federation: synchronising Azure AD, AWS IAM Identity Center, or Okta groups to DataBricks account groups for SSO and automated role assignment.
Unity Catalog audit logging via system.access.audit — the system table that records every query, data access, privilege grant, and governance policy change in the DataBricks account. Compliance audit queries: "which users accessed the PII columns in the CUSTOMER table in the last 90 days?", "which queries accessed CONFIDENTIAL-tagged tables outside business hours?", "which service principals modified row filter policies?" — all answerable from system.access.audit without requiring a separate audit log system. system.billing.usage for operational audit: DBU consumption by workspace, cluster, job, and user — enabling cost allocation and anomaly detection (queries consuming abnormally high DBUs, clusters running at unexpected hours). system.compute.clusters and system.compute.node_types for infrastructure audit: tracking cluster configurations, policy compliance, and auto-termination effectiveness. system.information_schema for data governance audit: querying table schemas, column tags, masking policies, and row filter assignments programmatically.
Unity Catalog data lineage — the automatic tracking of data flow from source tables through transformations to downstream tables, dashboards, and ML models. Lineage capture: Unity Catalog records lineage for SQL queries, Delta Live Tables pipelines, Spark DataFrame operations, and notebooks — showing which tables were read, which were written, which columns were transformed, and which notebooks or jobs executed the transformation. Impact analysis: answering "which downstream tables, dashboards, and ML models would be affected if this source table column changed?" by traversing the lineage graph in Unity Catalog. Data discovery: the DataBricks Data Explorer and Unity Catalog search enable analysts and data scientists to find tables by name, column name, tag, or description — browsing the three-layer namespace and viewing table schemas, sample data, statistics, and lineage without writing SQL. Integration with external data catalogues: Alation, Collibra, and Microsoft Purview connect to Unity Catalog via APIs to pull metadata, lineage, and usage statistics into enterprise catalogues.
DataBricks' DBU (DataBricks Unit) pricing model charges for compute consumption based on the instance type, cluster size, and runtime used — with different DBU rates for all-purpose compute, jobs compute, SQL warehouses, and GPU-enabled ML compute. While this model provides transparency and flexibility, it also creates opportunities for runaway cost: all-purpose clusters left running without auto-termination accumulate DBUs continuously even when no notebooks are executing; SQL warehouses provisioned for peak concurrency but running at 10% utilisation waste DBUs during off-peak hours; and Spark jobs with excessive shuffling or skewed partitions consume 5–10x more DBUs than necessary due to inefficient execution rather than genuine data volume. DataBricks FinOps is the continuous practice of identifying and closing the gap between what the organisation is paying for DataBricks and what the organisation needs to pay for the business value it is extracting.
DataBricks’ combination of open storage, unified governance, real-time streaming, and integrated AI makes it the platform of choice for data-intensive and AI-forward industries.
DataBricks' open Lakehouse architecture integrates with every major data ingestion, transformation, BI, orchestration, ML, and observability tool. Key systems we integrate regularly:
We had been on an on-premise Hadoop cluster for 7 years. The operational overhead had become unsustainable — our small platform team was spending 60% of their time on cluster maintenance, patching, and hardware failures rather than building data products. Analysts waited 2–3 days for data engineering tickets to extract data from Hive into Excel because they could not query across databases themselves. SourceMash’s DataBricks migration took 22 weeks: they migrated 800+ Hive tables to Delta Lake via Spark, rebuilt 45 Pig and Hive ETL jobs as Delta Live Tables pipelines, and implemented Unity Catalog with row filters so that each business unit sees only their authorised data. The ETL time reduction is 72% — jobs that took 6 hours on Hadoop finish in 90 minutes on DataBricks. But the bigger transformation is self-service: analysts now query Gold-layer Delta tables directly from Power BI and DataBricks SQL without submitting tickets. Unity Catalog’s dynamic data masking on all PII columns gives us the governance posture our auditors require. And the total cost including DataBricks DBUs and cloud storage is 28% below what we were spending on Hadoop hardware maintenance and data centre costs alone.
We operate 340 retail stores across four formats with three different ERP systems and two different POS systems — which meant our data landscape was five separate databases that nobody could query across simultaneously without a manual extract-and-join in Excel. SourceMash built a DataBricks Lakehouse that consolidated all five sources via Fivetran into Bronze Delta tables, applied standard transformations using Delta Live Tables (consistent product hierarchies, consistent customer identifiers, consistent date definitions across all five source systems), and produced a single Gold-layer analytics warehouse that the whole organisation queries from the same schema. The inventory forecasting model they built using DataBricks AutoML improved our stock availability by 19 percentage points on promoted lines — we were consistently running out of promotional stock before because our forecast was based on a subset of sales data, not the full cross-format picture. The Delta Sharing implementation for our top 10 suppliers took 2 days per supplier compared to the 6-week data extract and SFTP setup process we had been running for the previous generation of supplier data sharing.
We were spending ₹1.55 crore per year on DataBricks DBUs and did not have a clear picture of where the cost was going. Our engineering team had grown the platform organically over 3 years and nobody had audited the cluster configuration or job efficiency in that time. SourceMash’s FinOps audit took 3 weeks. The findings: 6 of our 8 all-purpose clusters had auto-termination disabled or set to 60 minutes — fixing this alone was ₹18 lakh of annual savings. Our three most expensive Spark jobs were scanning full tables on our 600GB event Delta table because the filter column was not in the Z-ORDER or partition key — adding Z-ORDER on the event_date column and enabling Liquid Clustering made the same jobs 8–25x faster and reduced their DBU consumption by 88%. Our SQL warehouses were running 24/7 on Classic mode for BI tools that were only used during business hours — switching to Serverless with auto-scaling reduced warehouse DBUs by 62%. Total annual saving from the FinOps programme: ₹52 lakh — 34% of our total DBU spend. The programme paid for itself in less than 8 weeks.
Everything you need to know before reaching out to us.
How does DataBricks' pricing model work and why can it surprise organisations?
DataBricks uses a DBU (DataBricks Unit) pricing model combined with cloud infrastructure costs. DBUs are consumed based on the instance type, cluster size, and compute mode: all-purpose clusters (highest DBU rate, for interactive development), jobs compute (lower DBU rate, for scheduled ETL), SQL warehouses (rate depends on serverless vs. classic and size), and ML/GPU compute (specialised rates for GPU instances). The cloud provider bills the underlying VMs separately; DataBricks bills the DBUs. Surprises occur in three common patterns: first, all-purpose clusters left running without auto-termination accumulate DBUs continuously even when no notebooks are executing — a 4-worker cluster on standard instances left running for a month can cost ₹8–12 lakh for zero productive work. Second, SQL warehouses provisioned for peak BI concurrency but running at low utilisation waste DBUs during off-peak hours — serverless warehouses solve this by scaling to zero but classic warehouses do not. Third, Spark jobs with data skew, excessive shuffling, or small file problems consume 5–10x more DBUs than necessary due to inefficient execution rather than data volume. The solution is the combination of cluster policies (enforcing auto-termination and instance restrictions), system.billing.usage monitoring dashboards, and Spark performance optimisation — all components of our FinOps programme.
How does DataBricks compare to Snowflake, Azure Synapse, and Amazon Redshift?
Each platform has genuine strengths and the right choice depends on the organisation's workload mix, existing skills, and strategic priorities. DataBricks' primary advantages are: the Lakehouse architecture that unifies data warehousing, data engineering, streaming, and AI/ML on a single platform with open Delta Lake format (avoiding vendor lock-in), the integrated ML and generative AI capabilities (MLflow, Mosaic AI, Feature Store) that run on the same data as analytics, the flexibility of Apache Spark for complex data engineering and streaming workloads, and the open Delta Sharing protocol for cross-platform data exchange. Snowflake is the strongest alternative for organisations primarily focused on SQL analytics and BI, with superior BI tool connectivity, a simpler SQL-only operational model, and the most mature data sharing marketplace ecosystem. Snowflake is weaker at ML/AI integration (requires separate platforms), streaming ETL, and open-format portability. Azure Synapse is appropriate for organisations fully committed to the Azure ecosystem (Microsoft 365, Azure Data Factory, Power BI Premium Gen2) but lacks DataBricks' multi-cloud portability and integrated AI capabilities. Amazon Redshift is appropriate for organisations heavily invested in AWS native services but is at a competitive disadvantage for ML, streaming, and open-format workloads. DataBricks is the best default choice for organisations that: need to unify analytics and AI/ML on one platform, have complex data engineering or streaming requirements, want open-format storage to avoid vendor lock-in, or are building data products that will be shared via Delta Sharing to recipients on other platforms.
What is Delta Lake and do we need it if we already have a data warehouse?
Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, time travel, and scalable metadata handling to data lakes built on cloud object storage. It is not a replacement for a data warehouse query engine — rather, it is the storage format that enables DataBricks (and other engines like Spark, Presto, Trino, and even Snowflake via Delta Sharing) to query data lake storage with the reliability and performance previously associated only with proprietary data warehouse formats. You need Delta Lake if: your organisation wants to store data in an open format (Parquet) that is not locked to a single vendor, you need ACID transactions and schema enforcement on data lake storage (previously impossible with plain Parquet or CSV), you want time travel and rollback capabilities for data recovery, or you need to support both SQL analytics and ML/Spark processing on the same data without copying it between systems. You might not need Delta Lake if: your data volumes are small enough that proprietary warehouse storage costs are negligible, your entire workload is SQL BI with no ML or streaming requirements, or your organisation is fully satisfied with a single proprietary platform and has no multi-engine or data sharing requirements. For most organisations building a serious modern data platform, Delta Lake is the right storage foundation — and DataBricks is the most mature execution environment for Delta Lake.
How long does a DataBricks migration from Hadoop or a cloud warehouse take?
Migration timelines depend primarily on the volume and complexity of the data objects and ETL pipelines in the source platform, not the raw data volume — because data migration itself (copying files from HDFS to cloud storage, or exporting warehouse tables to Parquet) is typically faster than the SQL translation and pipeline rebuild work. A typical Hadoop to DataBricks migration with 500–1000 Hive tables, 50–100 Pig/Spark ETL jobs, and 10–20 source systems takes 16–26 weeks. The phases: Assessment (3–4 weeks — Hive metastore inventory, Spark job analysis, format conversion requirements, Unity Catalog architecture design), Schema and Data Migration (4–6 weeks — HDFS to cloud storage transfer, Hive table conversion to Delta Lake, partition strategy redesign), ETL Pipeline Rebuild (4–8 weeks — converting Pig/Hive scripts to Delta Live Tables or Spark SQL, unit testing each pipeline), Data Validation and Reconciliation (2–4 weeks — row count, aggregate, and business metric validation), and Parallel Run and Cutover (2–4 weeks — running both platforms simultaneously, validating BI outputs, switching traffic). Cloud warehouse migrations (Snowflake, Redshift, Synapse, BigQuery) to DataBricks are typically faster (12–20 weeks) because the SQL dialect gap is smaller than the Hadoop-to-Spark gap, though stored procedures and proprietary functions still require manual translation.