Query Systems for Liquid‑Cooled AI Racks

Practical patterns for placing query engines, budgeting I/O, and scheduling GPU workloads in DLC and RDHx high-density racks.

As AI models scale, the physical layer of computing — power delivery, cooling, and rack density — becomes an active constraint for query system design. Direct-to-chip (DLC) liquid cooling and rear-door heat exchangers (RDHx) make ultra-high rack densities viable, but they also shift trade-offs for where to place query engines, how to budget I/O, and how to schedule GPU/accelerator workloads. This article translates those physical realities into practical patterns developers and IT teams can implement today.

Why liquid cooling changes the architecture conversation

Air-cooled traditional racks impose soft limits on sustained power; when you move to liquid cooling (DLC or RDHx), sustained power per rack can jump by multiple kilowatts. That unlocks denser GPU configurations, faster training/inference, and new cost profiles — but it also concentrates heat, power draw, and I/O demands into a smaller physical footprint.

Key differences developers and platform architects must internalize:

Higher sustained TDP per rack: More GPUs per rack increases aggregate I/O (PCIe, NVMe, Ethernet/RDMA) and power spikes during synchronous workloads.
Different failure modes: Liquid-cooling reduces thermal throttling but increases the importance of pump/coolant failures, coolant path topology, and leak detection.
Cooling topologies matter: DLC (direct-to-chip) provides localized delta-T control and supports higher density than RDHx, which cools the rear door and is generally less aggressive on per-chip temps.

High-level design principles

Translate physical constraints into software guarantees. Aim for these principles when designing query systems that target liquid-cooled racks:

Locality-aware placement: Put the data-plane (GPU-accelerated query kernels, NN inference engines) as close as possible to the hardware they consume to minimize PCIe/NVLink latency and host-to-host network hops.
Control-plane separation: Keep orchestration, metadata services, and user-facing control-plane components off dense racks to reduce thermal spikes caused by management plane activity.
Explicit I/O and power budgets: Treat I/O (PCIe, NVMe, NIC bandwidth) and power draw as first-class resources in admission control and scheduling.
Thermal-aware scheduling: Integrate temperature, coolant flow, and rack-level thermal headroom into scheduling decisions.

Pattern 1 — Query engine placement: hybrid control-plane, local data-plane

One pragmatic pattern is to split the query engine into two roles: a lightweight control-plane component that manages query planning and metadata, and a tight data-plane component that executes on GPUs inside liquid-cooled racks.

Why this works

Keeping the control-plane off-dense racks reduces operational noise on the racks and allows the control-plane to scale independently in a more conventional colo/cloud environment (useful for edge vs. colocation trade-offs). The data-plane benefits from local high-bandwidth connectivity (NVLink, PCIe) and lower network hops for NVMe access.

Implementation checklist

Expose a minimal RPC surface for the control-plane to dispatch operators to data-plane nodes.
Deploy data-plane microservices as co-located containers/agents per server with GPU bindings and local NVMe mounts.
Use shared memory or GPUDirect RDMA where possible to reduce host CPU overhead.
Keep stateful metadata in the control-plane; cache ephemeral tensor/feature stores in the rack-local NVMe pool.

Pattern 2 — I/O budgeting and QoS at the platform level

In high-density racks, aggregate I/O can exceed switch or PDU capacity if not constrained. Make I/O budgeting explicit:

Practical knobs

Reserve NIC and PCIe bandwidth per query class (e.g., heavy model inference vs. lightweight analytics).
Enforce NVMe QoS via cgroups/blkio or vendor NVMe QoS features to prevent starvation of small, latency-sensitive queries.
Implement token-bucket admission for RDMA/NVMeops to smooth bursts across a rack.

Example rules

Max 60% sustained NIC bandwidth per rack for model checkpoints to leave room for serving traffic.
Limit per-GPU NVMe throughput to a conservative ceiling during synchronous distributed training to avoid switch oversubscription.

Pattern 3 — Thermal-aware orchestration and scheduling

Tightly integrate telemetry from hardware into the scheduler. Modern software can consume chassis and GPU temperatures, coolant inlet/outlet temps, pump speeds, and BMC health via Redfish, DCGM, or Prometheus exporters. Use that data to make scheduling decisions.

Scheduler policies (practical options)

Headroom-based admission: Only schedule additional GPU work if rack-level coolant delta and GPU temps indicate X% headroom.
Thermal spread minimization: Prefer spreading hot jobs across racks with cool headroom rather than concentrating into a single saturated rack.
Graceful throttling: When thermal thresholds are crossed, prefer throttling batch sizes, reducing concurrency, or temporarily migrating non-critical jobs off the rack.

Actionable integrations

Collect telemetry with exporters (NVIDIA DCGM, Redfish BMC exporters) and store in Prometheus. Add alert rules for coolant flow, inlet/outlet delta, and per-GPU temps.
Attach thermal labels to nodes in your scheduler (Kubernetes node labels or custom cluster manager tags). Use admission controllers that read these labels at scheduling time.
Build simple policies in the control-plane like: if rack_inlet_delta > 10°C OR avg_gpu_temp > 75°C, deny new high-power jobs for 10 minutes.

Operational patterns: diagnostics, safety, and runbooks

Physical hardware adds operational complexity. Standardize diagnostics and runbooks for quick reactions.

Telemetry to collect

GPU temperatures, power draw (per-GPU), fan and pump RPMs
Rack inlet/outlet coolant temps and differential
PDU current per phase and per-rack power
BMC/Redfish health and leak detection sensors

Runbook essentials

Automatic graceful quiesce: mark affected nodes unschedulable and start live migration of stateful operators where possible.
Fallback plan: switch to a secondary rack or cloud fallback if coolant path fails.
SLA communication: expose a degradation state to users (e.g., reduced throughput, degraded latency) with clear remediation ETA.

Edge vs. colocation: choosing the right environment

When deciding whether to push GPU-intensive query workloads to edge sites or colocations, consider infrastructure readiness.

Edge: Typically limited by space, chilled-water connections, and PDU capacity. Best for latency-sensitive, lower-density inference deployments where liquid cooling is impractical.
Colocation: Colos increasingly offer DLC and RDHx racks with immediate power. They are preferable for large model training, high-throughput inference, and consolidated NVMe pools.

Map your query classes to locations: place stateful, heavy training and large-batch inference into colo racks with DLC; keep small, latency-critical routing and preprocessing at the edge to reduce egress latencies.

Observability and feedback loops

Design feedback loops that close the gap between physical metrics and software behavior. Observability is a core enabler of every pattern described above.

Expose physical metrics in the same observability stack used by your query engine.
- Example: correlate query latency spikes with inlet temperature deltas or NVMe queue depth increases.

Automate policy updates using short-term ML models that predict thermal headroom based on workload patterns.

Concrete implementation steps (getting started checklist)

Inventory hardware capabilities: catalog which racks support DLC vs. RDHx and capture node-level telemetry endpoints (Redfish, DCGM).
Separate the query engine into control and data planes; deploy control-plane to conventional nodes and data-plane to liquid-cooled racks.
Implement I/O and power budgets per workload class; enforce with cgroups, NVMe QoS, and network QoS for RDMA/Ethernet.
Instrument telemetry into Prometheus and add scheduler admission rules that use thermal labels and headroom thresholds.
Create runbooks for coolant failure, leak detection, and power oversubscription events. Test via drills in non-production racks.

Closing thoughts

Liquid cooling and ultra-high rack density change more than data center economics — they change how developers and platform teams must think about system behavior. Treat cooling and power as first-class signals in your design, and you can unlock much higher density and sustained performance while avoiding the pitfalls of thermal and I/O contention. Start with hybrid placement, explicit I/O/power budgets, and thermal-aware scheduling: these patterns convert physical constraints into predictable, manageable software behaviors.

Alex Mercer

Senior SEO Editor, AI Infrastructure

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Designing Query Systems for Liquid‑Cooled AI Racks: Practical Patterns for Developers