Query Optimization
Comprehensive guide to writing fast, efficient SQL queries
Query optimization is the art and science of writing SQL that executes efficiently. Understanding how databases process queries enables you to write performant code from the start.
How Query Processing Works
Parsing
The database parses your SQL into an abstract syntax tree (AST), checking for syntax errors.
Rewriting
Query is transformed: views expanded, rules applied, subqueries flattened.
Optimization
The optimizer generates multiple execution plans and chooses the most efficient one based on statistics.
Execution
The chosen plan is executed, accessing data through the most efficient path.
Quick Navigation
EXPLAIN Basics
Reading and understanding execution plans
Join Optimization
Optimizing JOIN performance
Subquery Optimization
Efficient subqueries and CTEs
Common Anti-patterns
Mistakes to avoid
The Optimizer's Decision Process
Quick Reference: Optimization Checklist
Planning Phase
- Understand the data model and relationships
- Know the data volumes (rows, sizes)
- Identify which indexes exist
- Understand the query's purpose and frequency
- Consider if real-time is required or if caching is acceptable
Writing Phase
- Select only needed columns (avoid
SELECT *) - Use appropriate JOIN types
- Add WHERE clauses to limit rows early
- Put most restrictive conditions first
- Use EXISTS instead of COUNT for existence checks
- Consider LIMIT if you need only a subset
Verification Phase
- Run EXPLAIN ANALYZE to see actual execution
- Check if indexes are being used
- Compare estimated vs actual row counts
- Look for sequential scans on large tables
- Verify no unnecessary sorting or hashing
- Test with production-like data volumes
Essential Optimization Techniques
1. Use Indexes Effectively
-- ✅ Index can be used
SELECT * FROM orders WHERE customer_id = 123;
-- ❌ Index cannot be used (function on indexed column)
SELECT * FROM orders WHERE YEAR(created_at) = 2024;
-- ✅ Rewritten to use index
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01';2. Limit Data Early
-- ❌ Filters after joining all data
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01';
-- ✅ Filter before join (if optimizer doesn't do it)
SELECT o.*, c.name
FROM (
SELECT * FROM orders WHERE created_at > '2024-01-01'
) o
JOIN customers c ON o.customer_id = c.id;3. Choose the Right Join
-- Use EXISTS for existence checks (can short-circuit)
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);
-- Use JOIN when you need data from both tables
SELECT c.*, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id;4. Avoid N+1 Queries
-- ❌ N+1 Pattern (in application code)
customers = db.query("SELECT * FROM customers")
for customer in customers:
orders = db.query(f"SELECT * FROM orders WHERE customer_id = {customer.id}")
-- ✅ Single query with JOIN
SELECT c.*, o.*
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id;
-- ✅ Or batch fetch
SELECT * FROM orders WHERE customer_id IN (1, 2, 3, 4, 5);Understanding Costs
What is Cost?
Cost is an arbitrary unit representing the estimated resources (CPU, I/O, memory) needed to execute a query. Lower is better, but absolute values are meaningless—only compare costs for the same query.
| Scan Type | Relative Cost | When Used |
|---|---|---|
| Index-Only Scan | Lowest | All columns in index |
| Index Scan | Low | Few rows, random access OK |
| Bitmap Scan | Medium | Many rows, combines indexes |
| Sequential Scan | High | Most/all rows needed |
Statistics Matter
The optimizer relies on table statistics to make decisions:
-- PostgreSQL: Update statistics
ANALYZE table_name;
-- MySQL: Update statistics
ANALYZE TABLE table_name;
-- SQL Server: Update statistics
UPDATE STATISTICS table_name;
-- View statistics
-- PostgreSQL
SELECT * FROM pg_stats WHERE tablename = 'orders';
-- Check when stats were last updated
SELECT relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables;Stale Statistics
Stale statistics lead to poor query plans. The optimizer might choose a sequential scan when an index scan would be faster, or vice versa. Run ANALYZE after large data changes.
Common Performance Problems
Database-Specific Optimization
-- Enable timing
\timing on
-- Detailed EXPLAIN
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;
-- See actual I/O
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Force specific plan (for testing)
SET enable_seqscan = off;
SET enable_indexscan = off;
SET enable_hashjoin = off;
-- Auto-explain for slow queries
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';-- Analyze query
EXPLAIN SELECT ...;
EXPLAIN ANALYZE SELECT ...; -- MySQL 8.0.18+
-- Show query profile
SET profiling = 1;
SELECT ...;
SHOW PROFILE FOR QUERY 1;
-- Optimizer hints
SELECT /*+ INDEX(orders idx_customer) */ *
FROM orders WHERE customer_id = 123;
-- Force index
SELECT * FROM orders FORCE INDEX (idx_customer)
WHERE customer_id = 123;-- Show execution plan
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
-- Estimated plan
SET SHOWPLAN_ALL ON;
SELECT ...;
SET SHOWPLAN_ALL OFF;
-- Actual plan (graphical in SSMS)
SET STATISTICS PROFILE ON;
SELECT ...;
-- Query hints
SELECT * FROM orders WITH (INDEX(idx_customer))
WHERE customer_id = 123;
OPTION (RECOMPILE);
OPTION (OPTIMIZE FOR (@param = 'value'));Performance Tuning Workflow
Identify Slow Queries
Enable slow query logging or use monitoring tools.
-- PostgreSQL: pg_stat_statements
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;Analyze with EXPLAIN
Get the execution plan for the slow query.
EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;Identify Bottlenecks
Look for:
- Sequential scans on large tables
- High loop counts
- Large sort operations
- Estimate vs actual mismatches
Apply Optimizations
- Add missing indexes
- Rewrite query logic
- Update statistics
- Adjust database parameters
Verify Improvement
Re-run EXPLAIN ANALYZE to confirm the fix.