cost-optimizationOLAPstorage

A Playbook to Reduce OLAP Costs: Compression, Compaction, and Query Patterns

UUnknown

2026-02-22

9 min read

A hands-on playbook to cut OLAP TCO via compression, compaction, and query patterns across ClickHouse and cloud warehouses.

Hook: The cost leak you can fix this quarter

Cloud OLAP costs are quietly draining budgets. You see it as unexpectedly high monthly warehouse bills, slow ad-hoc queries that scan terabytes, and unpredictable spikes when analysts run broad joins. The good news: most of those costs are structural — not fundamental. With targeted compression, smarter compaction, and modest query pattern changes, teams routinely cut OLAP TCO by 30–70% within weeks.

Executive summary: A practical playbook

This playbook gives an actionable, prioritized plan to reduce OLAP costs across ClickHouse and modern cloud warehouses (Snowflake, BigQuery, Redshift, and data-lake architectures using Parquet/Delta/Iceberg). Follow the audit -> quick wins -> structural changes sequence below, and you’ll control both storage and compute spend.

What you get

How to select compression codecs and per-column settings for ClickHouse and Parquet
How to tune compaction and merge behavior in ClickHouse and data lakes
Query-pattern edits to reduce scanned bytes and compute time
A 30/60/90 day action plan and KPIs to measure success

1. Measure first: baseline the true TCO

Before changing anything, collect metrics so you can prove savings. Measure across storage and compute and break costs down by workload.

Essential metrics

Bytes scanned per query and per-user cohort
Storage bytes by table, partition, and hot vs cold tier
Query latency and CPU per query class
Merge/compaction I/O and background job costs
Number and size of files/parts (small-file problem)

Action: export the last 30–90 days of query logs and storage metadata. Build a Pareto: which 20% of tables and queries cause 80% of spend.

Rule: You can’t optimize what you can’t measure. Baseline first, then change one lever at a time.

2. Compression strategies (biggest immediate leverage)

Compression reduces bytes on disk and I/O during scans — that directly lowers storage fees and compute. Apply compression at three layers: physical column codecs (ClickHouse), file formats (Parquet/ORC), and warehouse-level settings (Snowflake micro-partitions or BigQuery column compression).

ClickHouse: per-column codecs and engine choices

Use per-column CODEC() hints for types: integers often benefit from Delta / DoubleDelta, strings from ZSTD with dictionary encoding for high-cardinality repeating values, and low-precision floats from specialized codecs where available.
Set aggressive ZSTD levels for cold columns (e.g., CODEC(ZSTD(3))) and LZ4 for hot columns where CPU matters.
Prefer AggregatingMergeTree or materialized pre-aggregations for high-cardinality rollups to store smaller, pre-computed blocks.
Example create table (safe, widely used pattern):

CREATE TABLE events (
  event_time DateTime,
  user_id UInt64,
  page_id UInt32 CODEC(Delta, ZSTD(3)),
  url String CODEC(ZSTD(1))
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time)
SETTINGS index_granularity = 8192;

Action: run a small experiment on a representative table: test a ZSTD(3) rewrite vs current codec, measure scan I/O and query latency.

Cloud warehouses and file formats

For data lakes and warehouses backed by object storage, use columnar formats: Parquet or ORC with ZSTD or Snappy. ZSTD typically gives better compression ratios for analytic workloads; Snappy trades CPU for faster decompression.
Set Parquet row-group size and Parquet dictionary thresholds to favor larger, more compressible blocks (but avoid excessively large files).
Snowflake and BigQuery manage compression automatically, but you can influence it by writing optimized Parquet/ORC files and using clustering/partitioning to reduce scanned bytes.

Action: run a rewrite job to convert hot tables to Parquet+ZSTD and compare storage/scan costs. Many teams see 2x–4x storage reduction vs raw JSON/CSV.

3. Compaction tuning: reduce overhead from many small files/parts

Compaction eliminates small files and optimizes read paths. Small files create metadata overhead, increase per-request object operations (S3 GETs), and raise scan amplification.

ClickHouse MergeTree compaction

ClickHouse merges small parts into larger ones automatically — tune your partitioning and insert patterns to avoid an explosion of tiny parts (e.g., avoid one-row inserts without buffering).
Batch inserts where possible. Use Buffer tables for high-cardinality, high-ingest bursts.
Use TTLs and ALTER TABLE ... DELETE to remove cold data and allow merges to coalesce remaining parts.
Monitor parts count per table. A high parts count correlates with higher read amplification and slower merges.

Data lakes: compaction + file sizing

Target Parquet/ORC file sizes in the 128MB–512MB compressed range for optimal read throughput. Too small -> high S3 request rates; too large -> slow parallel reads for small queries.
Use Delta Lake / Iceberg / Hudi compaction jobs (or Spark jobs) to compact many small files into properly-sized files and update table metadata atomically.
Example Spark approach: read recent partitions, repartition/coalesce into N files, and overwrite the partition to reduce file count.

Action: schedule regular compaction for streaming/nearline ETL paths. Run a pilot on the busiest partitions and validate both query performance and cost reductions.

4. Query pattern changes: reduce scanned bytes and CPU

Many cost-savvy changes require only SQL edits or lightweight engineering work. Focus on patterns that reduce scanned data and avoid expensive joins/aggregations.

Top query optimizations

Selective projections: avoid SELECT *. Read only the columns you need — columnar systems pay per column scanned.
Predicate pushdown & partition pruning: partition by date and cluster by high-cardinality filters (user_id, account_id) to minimize scanned micro-partitions.
Materialized views and pre-aggregations: pre-compute heavy rollups for dashboards and common slices. In ClickHouse, AggregatingMergeTree and materialized views reduce query CPU dramatically.
Approximate algorithms: use HyperLogLog, reservoir sampling, or APPROX_COUNT_DISTINCT where exact answers are not required.
Join strategy: prefer broadcast joins for small dimension tables; push filters before joins; pre-shard or denormalize high-cardinality joins into lookup tables.
Limit scope for ad-hoc queries: set query controls like mandatory time filters or result-size limits for interactive users to avoid full-table scans.

Action: implement a query linting policy in the BI tool and add a pre-query validator that rejects or warns on queries scanning > X TB.

5. Cold vs hot: lifecycle and tiering

Not all data needs the same SLAs. Classify data as hot (frequent queries, sub-second needs) vs cold (infrequent, historical). Tier accordingly.

Hot: keep in ClickHouse or warm cloud warehouse compute. Use LZ4 or low-latency codecs and smaller parts for fast reads.
Warm: store in cloud warehouse with partitioning and clustering; rely on result caches and materialized views for frequent slices.
Cold: move to object storage (S3 Glacier/Deep Archive or equivalent) or convert to heavily compressed Parquet and store only as external tables with explicit costs for retrieval.

Action: implement automated lifecycle policies. Use object storage lifecycle rules to move older partitions to cheaper storage and maintain a cataloged pointer for on-demand restores.

6. Operational controls and guarding rails

Control spend with limits and visibility.

Set per-role query limits and result-size caps in BI tools and warehouses.
Implement alerting on bytes-scanned growth and parts count anomalies.
Use sampling-based cost estimation before running heavy queries in ad-hoc environments (preview the estimated bytes scanned).

7. Real-world examples and outcomes (2024–2026 patterns)

Example A: ClickHouse analytics for a SaaS product

Problem: rising SSD and storage costs after 2024–25, heavy small writes and many tiny parts.
Actions: switched high-cardinality strings to dictionary + ZSTD, consolidated parts via partition-aware compaction, created AggregatingMergeTree rollups for dashboards.
Result: 55% reduction in storage, 40% lower query CPU, and 60% lower monthly TCO for analytics in 90 days.

Example B: Data-lake on S3 feeding a Snowflake warehouse

Problem: small Parquet files from streaming ETL and many full-table scans by BI users.
Actions: scheduled micro-batch compaction to 256MB target files, rewrote hot partitions with ZSTD Parquet, introduced materialized views for common dashboards, enforced partition filters for ad-hoc queries.
Result: scanned bytes per dashboard dropped 4x and Snowflake compute credits for analytics dropped 45% in three months.

Trend note (late 2025–early 2026): As ClickHouse continued rapid growth and enterprise adoption in late 2025 — fueled in part by significant funding rounds — the ecosystem matured faster around operational best practices for compression and compaction. At the same time, hardware trends such as innovations in SSD manufacturing (e.g., PLC advancements) promise falling storage prices over the next 12–24 months; but expect compute and request costs to remain the dominant bill drivers.

8. Monitoring and KPIs: prove the impact

Track these KPIs weekly:

Bytes scanned per 1,000 queries
Storage bytes and effective compression ratio (raw -> on-disk)
Average query CPU sec and latency for top 50 queries
Parts/file count per table and compaction throughput
Monthly OLAP TCO and cost per dashboard or query class

9. 30/60/90 day action plan

Day 0–30: Audit and quick wins

Collect logs and build baseline dashboards (scan bytes, storage, latency).
Implement query guardrails (mandatory date filter, max scanned bytes).
Rewrite top-5 largest tables with conservative compression and run A/B queries.

Day 31–60: Compaction and query changes

Schedule compaction jobs for streaming partitions and rewrite file layout.
Introduce materialized views/rollups for heavy dashboards.
Train analysts on cost-aware SQL patterns and add linting rules.

Day 61–90: Architecture and governance

Implement lifecycle tiering for cold data and automate moves to cheaper storage.
Set up recurring cost reviews and KPIs in finance + data teams.
Invest in observability (query profiler, merge/compaction telemetry).

10. Advanced strategies & future-proofing (2026 and beyond)

Look beyond single-lever wins:

Compute-storage separation: optimize capacity classes — use very small compute for ad-hoc analysis and larger clusters for scheduled reports.
Hybrid architectures: keep hot indices in ClickHouse for sub-second lookups, use warehouse/warehouseless lake queries for deep historical analysis.
Model-driven cost allocation: tag queries and tables by product/feature to allocate TCO to engineering teams and incentivize cost-aware design.
Observability & automated tuning: invest in tools that detect inefficient scans and suggest compression/partitioning changes automatically.

Checklist: quick reference

Baseline: export 30–90 days of query and storage logs
Compression: test ZSTD for cold, LZ4 for hot; use per-column codecs in ClickHouse
Compaction: compact to 128–512MB file targets; schedule compaction jobs
Queries: enforce projections, partition pruning, pre-aggregations
Cold vs hot: implement lifecycle and automated moves to cheaper tiers
Governance: add query guards, cost alerts, and KPI dashboards

Final notes and predictions

In 2026, practitioners who combine low-level storage techniques (compression and compaction) with high-level query pattern discipline will control the cost curve even as data volumes and ML workloads grow. Hardware improvements — including lower-cost SSDs emerging from PLC innovations — will help, but the biggest, fastest wins remain in engineering: schema design, codecs, and query hygiene.

Call to action

Start with one table: run a compression experiment and track scanned bytes and cost before and after. If you want a ready-to-run plan, download the 30/60/90 playbook template and the query lint ruleset we use to protect production warehouses. Want help running the audit and implementation? Contact our team for a tailored operational review and a cost-saving roadmap.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.