Databricks Cost Optimization:
12 Proven Strategies to Cut Your Bill
Most Databricks deployments waste 30-50% of their spend on misconfigured workloads, idle clusters, and suboptimal instance choices. These 12 strategies are ordered by impact, starting with the changes that deliver the largest savings with the least effort. No product recommendations, no sales pitch, just engineering-focused optimization guidance.
The Biggest Cost Levers
Not all optimization strategies are equal. The priority order matters because the top two changes alone can reduce most Databricks bills by 40-60%. Focus here first before fine-tuning smaller optimizations.
- 1. Workload type selection (Jobs vs All-Purpose) delivers 60-75% savings on the Databricks platform portion. This is the single highest-impact change for most teams.
- 2. Spot / preemptible instances deliver 60-80% savings on the cloud infrastructure portion. Combined with workload type optimization, these two changes address both halves of the bill.
- 3. Auto-termination and right-sizing address waste and overprovisioning, typically saving 15-40%.
- 4. Everything else (Photon, storage optimization, serverless, policies, monitoring) are valuable but secondary to the first three.
Fixing your workload type classification alone can cut costs by 60-75% on the Databricks platform bill.
The 12 Strategies
Jobs Compute instead of All-Purpose
60-75%Effort: LowSwitch production ETL from interactive clusters to Jobs Compute
All-Purpose Compute ($0.55/DBU on AWS) is designed for interactive notebooks where you need quick iteration. Jobs Compute ($0.15/DBU) is designed for production pipelines that run on a schedule. Many teams develop in All-Purpose notebooks and then schedule those same notebooks without switching to Jobs Compute. The fix is straightforward: in your Databricks job configuration, select a Jobs Compute cluster instead of an interactive cluster. The code runs identically.
On a workload consuming 10,000 DBUs/month: All-Purpose costs $4,000/month in platform fees. Jobs Compute costs $1,000/month. That is $3,000/month saved with a 5-minute configuration change.
Auto-termination (15-30 min)
20-40%Effort: LowPrevent overnight cluster burn with aggressive idle timeouts
Idle clusters are the most common source of Databricks waste. The default auto-termination timeout is often set to 120 minutes (2 hours), meaning a forgotten notebook session costs hours of unnecessary compute. Setting auto-termination to 15 minutes for development clusters and 10 minutes for production clusters eliminates most idle waste.
For a 4-node cluster at $1.50/hour total: reducing idle time from 2 hours to 15 minutes per session, across 5 daily sessions, saves approximately $9/day or $270/month per cluster. Across 10 clusters, that is $2,700/month.
Spot/preemptible instances
60-80%Effort: LowUse spot instances for fault-tolerant batch and training workloads
Spot instances save on the cloud infrastructure portion of your bill (not the DBU portion). For batch ETL jobs that run through Databricks Jobs, spot instances are ideal because the Jobs scheduler automatically retries tasks if spot capacity is reclaimed. ML training workloads with checkpointing also benefit significantly. Avoid spot for streaming workloads or interactive notebooks where interruptions disrupt work.
Right-size clusters
15-30%Effort: MediumMatch instance families to workload profiles
Adopt Photon engine
50-70%Effort: Medium3-8x faster queries means proportionally fewer DBUs consumed
OPTIMIZE and Z-ORDER
10-25%Effort: LowReduce scan time with file compaction and data skipping
SQL Serverless for bursty BI
30-60%Effort: LowEliminate forced Classic SQL warehouse uptime
Compute policies
15-25%Effort: MediumCap max cluster size and restrict expensive instance types
Cost tagging and attribution
10-20%Effort: MediumVisibility drives accountability and reduction
Committed-use discounts
20-40%Effort: LowVolume commitments for predictable workloads
System table monitoring
5-15%Effort: MediumSet budget alerts before overruns, not after
Serverless for bursty workloads
30-50%Effort: LowHigher DBU rate but zero idle cost for intermittent jobs
Cost Monitoring Setup
Visibility is the foundation of cost control. Databricks provides system tables and budget alerts that give you real-time insight into spend patterns. Setting these up correctly is a prerequisite for all other optimization work.
Unity Catalog System Tables
Query system.billing.usage for detailed DBU consumption by workspace, cluster, user, and custom tags. This is the most granular cost attribution data available and enables per-team chargeback models.
Budget Alerts
Configure budget alerts in the Databricks account console to notify workspace admins and team leads when spend approaches defined thresholds. Set alerts at 50%, 75%, and 90% of monthly budget to give teams time to adjust before overruns.
Chargeback Model
Implement cost tagging with cluster tags that map to business units, projects, and cost centres. Join billing data with tag metadata to produce per-team cost reports. Teams that see their own costs consistently reduce spend by 10-20% through self-policing.
Optimization Impact Summary
All 12 strategies at a glance, sorted by estimated savings impact.
| # | Strategy | Savings | Effort |
|---|---|---|---|
| 1 | Jobs Compute instead of All-Purpose | 60-75% | Low |
| 2 | Auto-termination (15-30 min) | 20-40% | Low |
| 3 | Spot/preemptible instances | 60-80% | Low |
| 4 | Right-size clusters | 15-30% | Medium |
| 5 | Adopt Photon engine | 50-70% | Medium |
| 6 | OPTIMIZE and Z-ORDER | 10-25% | Low |
| 7 | SQL Serverless for bursty BI | 30-60% | Low |
| 8 | Compute policies | 15-25% | Medium |
| 9 | Cost tagging and attribution | 10-20% | Medium |
| 10 | Committed-use discounts | 20-40% | Low |
| 11 | System table monitoring | 5-15% | Medium |
| 12 | Serverless for bursty workloads | 30-50% | Low |
Frequently Asked Questions
What is the single biggest cost optimization for Databricks?
Switching production ETL workloads from All-Purpose Compute ($0.55/DBU) to Jobs Compute ($0.15/DBU) on AWS. This single change can reduce the Databricks platform portion of your bill by 60-75% for those workloads. Many teams start with All-Purpose clusters for development and forget to migrate production pipelines to Jobs Compute, which is specifically designed for scheduled, non-interactive workloads.
How much can spot instances save on Databricks?
Spot instances can save 60-80% on the cloud infrastructure portion of your Databricks bill (not the DBU portion). For a typical deployment where infrastructure is 40-60% of the total bill, this translates to 25-45% total cost reduction. Spot instances are recommended for batch ETL, ML training, and any workload that can handle occasional interruptions through checkpointing.
Is Photon worth the higher DBU rate?
Usually yes. Photon-enabled Jobs Compute costs $0.20/DBU vs $0.15/DBU for standard Jobs Compute (a 33% premium), but Photon typically delivers 3-8x query performance improvement. This means the same query consumes 62-87% fewer DBUs. For SQL and ETL workloads with large data scans, Photon almost always reduces total cost despite the higher per-DBU rate.
How do I track Databricks costs by team or project?
Use Unity Catalog system tables for cost attribution. Configure cluster tags to associate compute usage with teams, projects, or cost centres. Set up budget alerts in the Databricks account console to notify stakeholders before they exceed allocated budgets. For chargeback models, query the system.billing.usage table which records DBU consumption per workspace, cluster, and tag.
Should I use serverless or classic compute to save money?
It depends on your utilisation patterns. Serverless has higher per-DBU rates but zero idle cost and includes infrastructure. If your workloads run less than 40-50% of the time, serverless is likely cheaper. For steady-state 24/7 workloads, classic compute with spot instances will be more cost-effective. Our serverless pricing page has detailed break-even analysis.
How do I know if my clusters are right-sized?
Monitor cluster utilisation through the Databricks Spark UI metrics and system tables. Look for clusters with consistently low CPU utilisation (under 30%), excessive memory headroom, or frequent driver memory pressure. A well-sized cluster should run at 50-80% average CPU utilisation during active workloads. Right-sizing can save 15-30% on both DBU consumption and infrastructure costs.