Deploying ClickHouse on Kubernetes: From Anti-Patterns to Production-Ready Solutions
Introduction
For years, conventional wisdom suggested that stateful databases and Kubernetes didn't mix. The early narratives warned of unpredictable pod scheduling, ephemeral storage, and operational complexity. Yet here we are: distributed databases like ClickHouse are not just running on Kubernetes—they're thriving there, powering real-time analytics at organizations like Vimeo, eBay, Twilio, Lyft, PostHog, and Pulumi.
What changed? The answer lies not in Kubernetes itself, but in how we deploy databases on it. The journey from "StatefulSet + manual configuration" to "production-grade operator" represents a fundamental shift in abstraction level—from managing infrastructure primitives to encoding operational knowledge as software.
This document explains the deployment methods available for running ClickHouse on Kubernetes, evaluates them against production requirements, and details why the Altinity ClickHouse Operator is the industry standard for self-hosted ClickHouse.
The Deployment Maturity Spectrum
Running a distributed, stateful system like ClickHouse on Kubernetes presents a spectrum of approaches, each representing a different level of operational maturity.
Level 0: The Anti-Pattern (Manual StatefulSets)
What it is: Deploying ClickHouse using raw Kubernetes StatefulSets, manually configuring replication, sharding, and cluster membership.
The appeal: StatefulSets are Kubernetes' native primitive for stateful workloads, providing stable pod identities and persistent storage. New users are naturally drawn to this "pure Kubernetes" approach.
The fatal flaw: StatefulSets are infrastructure-level primitives. They understand Pods and PersistentVolumeClaims, but have zero domain knowledge of ClickHouse's application-level logic—replication, sharding, schema propagation, distributed DDL, or user management.
The evidence: The ClickHouse Cloud team learned this lesson the hard way. They initially "leveraged StatefulSets to control server pods," which worked fine for basic operations. However, their decision to manage all replicas with a single StatefulSet created a critical bottleneck: vertically scaling the cluster (changing instance sizes) required a slow, one-at-a-time rolling restart of the entire StatefulSet. This architecture proved "operationally crippling at scale."
A mature operator avoids this pitfall by composing multiple StatefulSet resources (one per shard), enabling granular and parallelized operations.
Verdict: Using only StatefulSets forces human operators to manually implement complex distributed database logic. This approach is provably insufficient for "Day 2" operations like safe upgrades, dynamic scaling, schema propagation, and user management.
Level 1: The Day 0 Stop-Gap (Helm Charts)
What it is: Using general-purpose Helm charts (like the Bitnami ClickHouse chart) to bootstrap a ClickHouse cluster.
What it solves: Helm charts excel at "Day 0" tasks—templating manifests, packaging dependencies, and providing a one-command installation experience. They're effective for getting started quickly.
What it doesn't solve: The "Day 2" operational challenge. A database is not a fire-and-forget deployment; it's a persistent, evolving system requiring ongoing lifecycle management.
As community discussions note, Helm "can't handle the dynamic management required to run databases properly." Helm is not a reconciliation loop. It cannot:
- Manage high-availability failover automatically
- Declaratively update user configurations on live clusters
- Safely orchestrate version upgrades with zero downtime
- Propagate schema changes to newly scaled nodes
- React to drift between desired and actual state
The tacit admission: Even the Bitnami chart's own documentation discourages using its resourcesPreset values for production, stating they "may not fully adapt to your specific needs." This effectively abandons users at the most critical juncture—configuring the database for production reliability.
Verdict: Helm charts are useful bootstrapping tools, but they're fundamentally Day 0 solutions. For a stateful, production database requiring continuous operational intelligence, they're necessary but insufficient.
Level 2: The Production Standard (The Kubernetes Operator Pattern)
What it is: A purpose-built Kubernetes Operator that encodes operational expertise as software, providing a high-level Custom Resource API and running a continuous reconciliation loop.
What it solves: The Operator Pattern bridges the gap between infrastructure primitives and application semantics. An operator:
- Watches for high-level Custom Resources (like
ClickHouseInstallation) - Compares the desired state in those resources against actual cluster state
- Takes application-specific actions to reconcile the difference
- Does this continuously, automatically handling drift and failures
For ClickHouse, this automated operational logic includes:
- Complex Topology Management: Automatically configuring sharding and replication
- Schema Propagation: Applying
ON CLUSTERDDL statements to all nodes, including newly scaled ones - Declarative Management: Managing users, profiles, quotas, and configuration files as code
- Zero-Downtime Lifecycle: Orchestrating upgrades and scaling operations aware of ClickHouse's internal state
- Intelligent Restarts: Understanding which configuration changes require restarts versus hot-reloads
The paradigm shift: An operator transforms database operations from imperative commands ("run this script to add a replica") to declarative intent ("this cluster should have 3 replicas"). Kubernetes takes care of making reality match the declaration.
Verdict: For production ClickHouse on Kubernetes, the operator pattern is not just best practice—it's the only viable approach for teams that want to sleep at night.
The Altinity ClickHouse Operator: Industry Standard
Among operator implementations for ClickHouse, the Altinity ClickHouse Operator stands out as the definitive, production-grade standard.
Production Pedigree
Maturity: The operator celebrated its 5th birthday in 2024 and has been production-ready for years. It's estimated to manage "tens of thousands of ClickHouse servers worldwide."
Battle-Tested at Scale: Publicly disclosed production users include major technology organizations: Vimeo, eBay, Twilio, Lyft, PostHog, and Pulumi. These aren't pilot projects—these are high-stakes, mission-critical analytics workloads.
Community Consensus: Within the ClickHouse engineering community, the Altinity operator is consistently recommended as "the easiest way" and the default "go-to" solution for running ClickHouse on Kubernetes.
100% Open Source: The Anti-Lock-In Guarantee
The Altinity operator's licensing model is a critical strategic advantage for any organization concerned about vendor lock-in or project abandonment.
The License: The operator's entire Go codebase is licensed under the permissive Apache 2.0 license. This includes the source code, container images, and all features—there are no "enterprise-only" capabilities hidden behind paywalls.
Image Freedom: The operator can manage any ClickHouse container image:
- Official
clickhouse/clickhouse-serverimages (Apache 2.0) - Altinity's own "Stable Builds" (
altinity/clickhouse-server, also Apache 2.0, offering LTS) - Third-party images like Chainguard's hardened builds
There is zero lock-in to Altinity's container images or build pipeline.
The Virtuous Cycle: Altinity's business model is not "open core." They don't withhold features to upsell enterprise licenses. Instead, their revenue comes from:
- Altinity.Cloud: A managed SaaS/BYOC platform that runs the 100% open-source operator on behalf of customers
- Enterprise Support: Commercial support contracts for organizations running the operator themselves
This is "dogfooding" at its purest. Altinity's commercial success is directly dependent on the open-source operator being excellent. Every feature they build for their managed service must be open-sourced because the service literally runs on the operator. This alignment creates a powerful guarantee: the operator will remain actively maintained, feature-rich, and production-grade for as long as Altinity is in business.
This stands in sharp contrast to vendors who develop critical features as cloud-only exclusives or move components into closed-source "enterprise editions."
Core Architecture: CHI and CHK
The operator's power comes from its two key Custom Resource Definitions (CRDs).
ClickHouseInstallation (CHI)
The ClickHouseInstallation (CHI) is the primary CRD. A single CHI manifest declaratively defines an entire ClickHouse cluster—its topology, storage, configuration, users, and more.
The operator's reconciliation loop watches CHI resources and translates them into lower-level Kubernetes primitives:
- StatefulSet(s): To manage ClickHouse server pods (typically one per shard)
- PersistentVolumeClaim(s): For persistent storage, created from
volumeClaimTemplatesin the CHI - Service(s): For stable network endpoints
- ConfigMap(s): To dynamically generate and inject ClickHouse configurations (
config.d,users.d)
ClickHouseKeeperInstallation (CHK)
The ClickHouseKeeperInstallation (CHK) CRD represents a massive architectural evolution and is one of the operator's most significant recent improvements.
The Old Problem: ClickHouse requires a coordination store (like ZooKeeper) for replication. Historically, this was a major operational burden. Users had to separately deploy, secure, and maintain a ZooKeeper ensemble, then configure their CHI to connect to it. This added significant complexity and multiple failure domains.
The New Solution: ClickHouse now offers ClickHouse Keeper—a built-in, more efficient, ZooKeeper-compatible coordination store. As of operator version 0.24.0, Altinity introduced the CHK CRD.
Now, users can:
- Deploy a CHK resource to get a fully-managed Keeper ensemble
- Deploy a CHI resource that simply points to the CHK's Kubernetes service name
The operator manages the entire stack, end-to-end. This eliminates the ZooKeeper operational burden and dramatically simplifies the deployment topology.
Important: Any production-grade IaC tool must support deploying both CHI and CHK resources to be considered feature-complete.
Key Features: Encoded Operational Intelligence
The operator's deep domain knowledge of ClickHouse is evident in its advanced lifecycle management capabilities:
Intelligent Configuration Management: The operator doesn't just apply configuration changes—it's intelligent about the process. It understands which ClickHouse settings support hot-reload versus which require a pod restart, thereby minimizing disruption.
Zero-Downtime Upgrades: Upgrading a cluster is as simple as changing the image: tag in the CHI manifest. The operator orchestrates a safe, rolling upgrade across all nodes, respecting replication and shard topology.
Smart Scaling: The operator manages both horizontal scaling (adding shards/replicas) and includes automatic schema propagation. When new nodes join, the operator ensures they receive the cluster's current schema via ON CLUSTER DDL statements.
Non-Disruptive Storage Resizing: For storage types that support it, the operator can resize PersistentVolumeClaims without requiring pod restarts (as of version 0.20+). This is a significant improvement over manual StatefulSet management.
Declarative User Management: ClickHouse users, profiles, and quotas are defined directly in the CHI manifest. Passwords are securely sourced from Kubernetes Secrets. The operator keeps user state synchronized automatically.
Limitations and Operational Sharp Edges
The operator is powerful, but power comes with responsibility. There are several operational pitfalls users must understand:
Storage Reclaim Policy: The volumeClaimTemplates section in a CHI manifest includes a reclaimPolicy field. If not explicitly set to Retain, it defaults to the StorageClass setting—often Delete. This means a kubectl delete chi my-cluster command will also delete all PersistentVolumeClaims and data. Always set reclaimPolicy: Retain for production clusters.
Deletion Order (Finalizer Issue): CHI resources use Kubernetes finalizers to ensure clean shutdown. The operator removes these finalizers during deletion. If you delete the operator's Deployment before deleting CHIs, those CHI resources (and their namespace) will get stuck in Terminating state forever. Correct deletion order: (1) Delete CHI/CHK resources, (2) Wait for complete deletion, (3) Uninstall operator.
Self-Hosted Responsibility: The operator automates lifecycle management, but you still own the full operational stack: backup strategies, monitoring, alerting, security hardening, and underlying Kubernetes infrastructure.
Alternative: KubeBlocks (The Generalist)
The primary alternative to the Altinity operator is KubeBlocks—a "multi-database operator" that provides a unified API for managing 35+ database types including ClickHouse, PostgreSQL, MySQL, and Kafka.
The Trade-off:
- Altinity (Specialist): Purpose-built for only ClickHouse. Deep, nuanced features like native CHK support and intelligent configuration reloads.
- KubeBlocks (Generalist): Unified API across many databases. Attractive for enterprises managing diverse database fleets from a single control plane.
The Signal: Platforms like Cozystack bundle the Altinity Operator as their "Managed App" for ClickHouse rather than using a generalist tool. This is a strong third-party endorsement of Altinity as the best-in-breed component for production ClickHouse.
Verdict: For users focused on providing a production-grade ClickHouse service (rather than managing a zoo of different databases), the specialist Altinity operator delivers deeper domain expertise and better operational outcomes.
Self-Hosted vs. Managed Services (DBaaS)
The Altinity operator represents the "Build" side of the classic "Build vs. Buy" decision.
Managed Services (ClickHouse Cloud, Aiven, Altinity.Cloud):
- Pros: Full operational abstraction. Vendor manages HA, backups, upgrades, security, observability. Lower Total Cost of Ownership when engineering time is factored in.
- Cons: Higher infrastructure costs. Potential vendor lock-in. Less control over data placement, network architecture, and feature timelines.
Self-Hosted (Altinity Operator):
- Pros: Full control over deployment, data sovereignty, and infrastructure. No vendor lock-in—complete data portability. Open-source foundation aligns with organizational policies.
- Cons: Higher operational responsibility. You own backup strategies, monitoring, security hardening, and infrastructure management.
The Strategic Choice: For organizations with:
- Specific security, compliance, or data sovereignty requirements
- Existing Kubernetes expertise and operations
- Strong preference for open-source, vendor-neutral infrastructure
- Need for complete data portability
...the Altinity operator is the only solution that delivers production-grade ClickHouse while preserving those values.
Installation and Configuration
The Two-Step Installation Process (Critical)
The operator's Helm chart intentionally does not include CRD manifests. This prevents helm uninstall from accidentally deleting CRDs, which would orphan all running ClickHouse clusters.
Correct Installation/Upgrade Procedure:
- Step 1: Manually
kubectl applythe CRD definitions from the new version's GitHub release - Step 2: Run
helm upgrade --installto deploy the operator Deployment
# Step 1: Apply CRDs
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.25.5/deploy/operator/clickhouse-operator-install-crds.yaml
# Step 2: Install/Upgrade Operator
helm upgrade --install clickhouse-operator \
altinity/altinity-clickhouse-operator \
--version 0.25.5 \
--namespace clickhouse \
--create-namespace \
-f values.yaml
IaC Requirement: Any automation must implement this two-step process. A simple
helm installwrapper is insufficient and will fail on future upgrades.
Multi-Tenancy and Security
The most critical security decision is controlled by a single Helm value: rbac.namespaceScoped.
Default (false): The operator installs with cluster-wide permissions (ClusterRole/ClusterRoleBinding). It watches for and manages CHI resources in all namespaces.
Secure Multi-Tenant (true): Set rbac.namespaceScoped: true to create a namespaced Role/RoleBinding. The operator can only watch and manage CHI resources in its own namespace.
For secure, multi-tenant environments, always set rbac.namespaceScoped: true.
Minimal Production Configuration
Here's a minimal, production-grade configuration:
# values.yaml
rbac:
namespaceScoped: false # Set to 'true' for multi-tenant environments
serviceAccount:
create: true
operator:
image:
tag: "0.25.5" # Pin the version
resources: # Essential for production stability
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 1
memory: 512Mi
GitOps Integration
The operator is explicitly designed for GitOps workflows. The standard pattern is a two-level "App of Apps":
- Application 1 (Operator): An ArgoCD/Flux Application targeting the Altinity Helm repository
- Application 2 (Clusters): A second Application targeting a Git repo containing your CHI/CHK manifests as YAML
This allows the entire stack—both the operator and the clusters it manages—to be declared in Git as the source of truth.
Ecosystem Integration
Monitoring (Prometheus/Grafana)
The operator provides "batteries-included" monitoring. With metrics.enabled: true (the default), the operator automatically deploys with a built-in metrics-exporter sidecar that queries ClickHouse system tables and exposes Prometheus-compatible metrics.
For Prometheus Operator:
serviceMonitor:
enabled: true
For Grafana:
dashboards:
enabled: true
The operator will create ServiceMonitor CRs and ConfigMaps with pre-built dashboard JSON.
Backup and Restore
The operator facilitates backups but doesn't perform them directly. The community-standard tool is altinity/clickhouse-backup (also Apache 2.0).
Production Pattern:
- Sidecar: Add the
altinity/clickhouse-backupcontainer to your CHI'spodTemplates, mounting the same data volume as ClickHouse - CronJob: Deploy a separate Kubernetes CronJob that executes backup commands via
kubectl exectargeting the sidecar
This pattern enables scheduled, automated backups to object storage (S3, GCS, etc.).
Multi-Region Deployment
Cross-region replication is supported but requires careful architecture. The coordination store (ClickHouse Keeper) is extremely sensitive to network latency and should not be stretched across regions.
Recommended Architecture:
- Deploy CHK in a single region (e.g.,
us-east-1) - Deploy primary CHI in
us-east-1, pointing to the local Keeper - Deploy a read-replica CHI in
us-west-2, configured to connect back to theus-east-1Keeper
This provides low-latency reads in the secondary region but creates a single point of failure for writes if the primary region fails.
Project Planton's Choice
For Project Planton, the Altinity ClickHouse Operator is the only viable choice for providing ClickHouse on Kubernetes. This decision is based on:
1. Production Maturity: Five years of battle-testing at scale by organizations like eBay, Twilio, and Lyft.
2. Open-Source Alignment: 100% Apache 2.0 licensed, with a business model ("dogfooding") that guarantees long-term project health without vendor lock-in.
3. Complete Lifecycle Management: Native support for both ClickHouse (CHI) and ClickHouse Keeper (CHK), eliminating the ZooKeeper operational burden.
4. Operational Intelligence: Features like smart restarts, schema propagation, and non-disruptive storage resizing demonstrate deep, encoded domain expertise that manual approaches simply cannot match.
5. Community Standard: When other platforms (like Cozystack) choose to bundle the Altinity operator rather than build their own or use a generalist tool, that's a powerful market signal.
The Abstraction Layer
Project Planton's API abstracts the operator's Helm chart complexity into approximately six essential fields:
| Field | Purpose |
|---|---|
namespace | Where to install the operator |
version | Operator version (maps to image.tag) |
namespace_scoped | Multi-tenancy control (maps to rbac.namespaceScoped) |
resources | CPU/Memory for operator pod |
service_account | ServiceAccount configuration |
install_crds | [Value-Add] Automates the two-step CRD installation |
The install_crds field is particularly valuable—it automates the complex "CRD-first, then-Helm" installation process that trips up many users.
Conclusion: From Infrastructure to Intelligence
The evolution from manual StatefulSets to the Altinity Operator represents more than a technical upgrade—it's a fundamental shift in how we think about database operations.
StatefulSets give us infrastructure. The operator gives us intelligence.
Where StatefulSets say "here's a pod with stable storage," the operator says "here's a sharded, replicated ClickHouse cluster that automatically heals, scales, and upgrades itself while preserving data consistency and zero downtime."
That's not automation—that's operational expertise encoded as software. And for production ClickHouse on Kubernetes, that's exactly what you need.
With five years of production validation, tens of thousands of managed servers worldwide, and a virtuous-cycle open-source business model, the Altinity ClickHouse Operator isn't just the best choice for self-hosted ClickHouse on Kubernetes.
It's the only choice for teams that take production seriously.
Next article