Deploying Apache Solr on Kubernetes: From Anti-Patterns to Production
Introduction: The Journey to Production-Ready Solr
For years, the conventional wisdom around deploying stateful applications like Apache Solr on Kubernetes was simple: "Don't." The infrastructure-as-code world treated Kubernetes as a platform designed for ephemeral, stateless workloads—the kind that could be terminated and replaced without consequence. SolrCloud, with its requirements for stable network identities, persistent storage, and careful coordination through Zookeeper, seemed fundamentally at odds with Kubernetes's cattle-not-pets philosophy.
That wisdom is now obsolete.
The maturation of the Kubernetes Operator pattern has transformed how we deploy and manage stateful systems. For Solr specifically, the Apache Solr Operator represents a production-ready, application-aware control plane that not only makes Solr deployments possible on Kubernetes but actually superior to traditional approaches. The Operator understands Solr's coordination semantics, manages safe rolling updates based on shard replica health, and provides declarative APIs for backups and monitoring—capabilities that manual StatefulSet management could never achieve safely.
This document explains the evolution of Solr deployment methods on Kubernetes, from critical anti-patterns to production-ready operator-based solutions. It maps the landscape of deployment approaches, compares the major tools (Apache Solr Operator vs. Bitnami Helm charts), and explains why Project Planton builds upon the Operator pattern to provide both Day 1 simplicity and Day 2 operational safety.
The Deployment Maturity Spectrum
Understanding where different deployment methods sit on the maturity spectrum is essential for making informed architectural decisions.
Level 0: The Anti-Pattern—Deployments and Simple Pods
Using a standard Kubernetes Deployment object for Apache Solr is not just suboptimal—it's a guaranteed path to cluster instability and data loss.
Apache Solr is an inherently stateful application. It stores persistent, indexed data on disk. A Kubernetes Deployment is designed for stateless applications that can be terminated and replaced with a new, identical pod with a new name and no persistent state. This fundamental mismatch creates two critical failure modes:
-
Lack of Stable Network Identity: SolrCloud nodes must have stable, predictable network identities (e.g.,
solr-0,solr-1) to register with the Zookeeper ensemble, participate in leader election, and manage shard replicas. A Deployment-managed pod receives a new, random hostname on every restart. This breaks cluster coordination—Zookeeper loses track of nodes, leading to inconsistent cluster state and eventual service-wide outages. -
Lack of Stable, Unique Storage: Each Solr node must have its own unique persistent volume (PV) to store its index data. A Deployment typically uses ephemeral storage or a shared
ReadWriteMany(RWX) volume. The correct pattern requires a unique PersistentVolumeClaim (PVC) for each pod ordinal (e.g.,data-solr-0,data-solr-1).
Verdict: Never use Deployments for Solr. This approach is a critical anti-pattern that guarantees cluster instability upon pod restarts or node failures.
Level 1: The Foundation—Manual StatefulSet Configuration
The StatefulSet is the correct Kubernetes primitive for stateful applications. It provides the two guarantees that Deployments lack:
- Stable Network Identity: Pods are named with a stable ordinal (e.g.,
solr-0,solr-1). - Stable, Unique Storage: Each pod gets its own unique PersistentVolumeClaim based on a template.
A StatefulSet is always paired with a Headless Service, which provides the stable DNS domain (e.g., solr-0.solr-headless.default.svc.cluster.local) that pods use for discovery.
While this is the correct low-level approach, manually managing StatefulSet YAMLs for a complex, multi-component system like SolrCloud introduces severe operational pitfalls:
-
Zookeeper Bootstrap Complexity: A Solr StatefulSet must not start its pods before the Zookeeper ensemble is fully ready. This requires complex, cross-StatefulSet coordination, typically managed with initContainers that poll the Zookeeper service.
-
Pod Configuration Pain: Manually managing the Zookeeper StatefulSet itself is challenging. Each ZK pod must have a unique
ZOO_MY_IDenvironment variable, which is non-trivial to inject correctly based on the pod's ordinal. -
Application-Unaware Updates: This is the most significant pitfall. A StatefulSet's built-in update strategies (
OnDeleteorRollingUpdate) are Kubernetes-aware, not Solr-aware. The defaultRollingUpdatestrategy terminates pods from the highest ordinal to the lowest (e.g.,solr-Ndown tosolr-0). This "dumb" update can be catastrophic—it might kill a shard leader before its replica is healthy, leading to brief but critical client-side write errors.
Verdict: StatefulSets are necessary but insufficient. They solve the fundamental stability problem but introduce complex operational challenges that are unsafe for production without sophisticated orchestration logic.
Level 2: Packaged Deployments—Helm Charts
Helm charts, such as the popular Bitnami chart, represent the next evolutionary step. They "package" the complex set of StatefulSet, Service, ConfigMap, and PodDisruptionBudget YAMLs into a single, configurable helm install command.
This approach solves the initial creation complexity (Day 1) but does not solve the operational complexity (Day 2+). The chart stamps out the resources, but the cluster's lifecycle is still managed by Kubernetes's default StatefulSet controller. It still lacks the application-aware logic to handle updates, failures, and backups safely.
The Bitnami Solr Chart: Standalone Simplicity
The Bitnami chart is a traditional, self-contained Helm chart. It directly creates StatefulSet, Service, and other Kubernetes primitives. It can optionally bundle the bitnami/zookeeper Helm chart as a dependency, allowing a single helm install to deploy a complete cluster including Zookeeper.
Strengths:
- Simple Day 1 experience
- Single
values.yamlfile for all configuration - Works "out of the box" for initial deployments
Weaknesses:
- No application-aware updates (uses standard StatefulSet rolling updates)
- Backup and restore are entirely manual API calls, not first-class features
- Scaling operations lack Solr-specific safety checks
Verdict: Good for getting started quickly, but lacks the operational safeguards required for production Day 2 operations.
Level 3: The Production Solution—Apache Solr Operator
The Operator pattern is the cloud-native best practice for stateful applications. An operator is a custom controller that runs in the cluster and extends the Kubernetes API with domain-specific knowledge of the application.
The Apache Solr Operator is the official, production-ready implementation of this pattern. It was started by Bloomberg and is now an official Apache Solr project. It is explicitly documented as "Production Ready" and has been "successfully used to manage production SolrClouds for some of the largest users of Solr."
How the Operator Changes Everything
Instead of managing a StatefulSet directly, users create a SolrCloud Custom Resource (CR). The Operator's controller watches for this CR and, in response, generates the necessary StatefulSet, Services, ConfigMaps, and other resources.
Crucially, the Operator's control loop continues to watch the cluster after creation. If a user changes the spec.solrImage.tag in the SolrCloud CR to trigger a version upgrade, the Operator begins a rolling update. However, instead of using the default StatefulSet logic, it intelligently queries the SolrCloud Collections API. It only takes down a pod when it can verify that it is safe for that pod's shards and replicas, preventing data unavailability.
Key Operator Features
The Operator's power comes from its Custom Resource Definitions (CRDs), which provide declarative APIs for the entire Solr ecosystem:
-
SolrCloud CRD: Manages the Solr cluster, including image, replicas, resources, storage, and Zookeeper connection configuration.
-
SolrBackup CRD: Declaratively manages backups of collections. Supports Volume, S3, and GCS repositories, as well as scheduled, recurring backups. This transforms what would be manual
curlcommands into auditable, version-controlled YAML resources. -
SolrPrometheusExporter CRD: Declaratively manages the monitoring exporter for each cluster, enabling seamless integration with Prometheus and Grafana dashboards.
Zookeeper Integration: The Decoupled Architecture
The Solr Operator follows a best-practice, decoupled architecture for Zookeeper management. Its Helm chart recommends and depends on the Zookeeper Operator. The user's SolrCloud CR creates a ZookeeperCluster CR, delegating ZK management to a dedicated, specialized controller. This separation of concerns means Zookeeper's lifecycle can be managed independently, which is essential for large, multi-tenant environments where multiple applications (Solr, Kafka, etc.) share a single ZK ensemble.
Verdict: This is the production-ready standard. The Operator provides application-aware rolling updates, declarative Day 2 operations (backup, monitoring), official Apache support, and a robust CRD-based API. It is the recommended approach for any production deployment.
Comparative Analysis: Apache Solr Operator vs. Bitnami Helm Chart
The choice between deployment tools boils down to a "Managed" (Operator) versus "Packaged" (Helm) philosophy.
| Feature | Apache Solr Operator (Recommended) | Bitnami Helm Chart (Alternative) |
|---|---|---|
| Primary Method | Operator (Control Loop): Deploys a controller that manages Solr-aware logic | Helm Chart (Templating): Deploys StatefulSet YAMLs directly |
| Official Status | Official Apache project | Community chart by Bitnami (VMware/Broadcom) |
| Zookeeper | Decoupled: Manages ZK via the Zookeeper Operator CRD | Bundled: Manages ZK as a Helm sub-chart dependency |
| Scaling & Updates | Application-Aware: Safely drains and manages shard replicas | Kubernetes-Aware: Standard StatefulSet rolling update |
| Backup | Declarative: First-class SolrBackup CRD | Manual: Requires manual curl calls to Solr's API |
| Monitoring | Declarative: First-class SolrPrometheusExporter CRD | Enabled via Flag: metrics.enabled=true in values.yaml |
| Production Ready | Yes, explicitly stated and battle-tested | Yes, but lacks operational safeguards |
| Licensing | Apache License 2.0 | Apache License 2.0 |
The Day 1 vs. Day 2 Trade-off
Real-world experience highlights a critical distinction: the Bitnami chart is easier to get started with (Day 1), while the Operator is safer and more powerful to operate long-term (Day 2+).
Users optimizing solely for Day 1 simplicity might choose Bitnami. A platform engineering team building production infrastructure must optimize for Day 2 operations—stability, safety, and automated lifecycle management. This is why Project Planton builds upon the Operator model.
Production Zookeeper Integration
Zookeeper is the coordination service for SolrCloud. Its stability directly determines the stability of the entire cluster. Production deployments have strict, non-negotiable requirements.
Embedded vs. External Zookeeper
Apache Solr ships with an embedded Zookeeper server. This feature is provided only for development, testing, or getting started.
Using embedded Zookeeper in production is a critical failure point. The embedded ZK instance runs inside a single Solr JVM. If that Solr pod or node fails, the entire cluster's coordination service fails with it, resulting in a total service outage.
Production Requirement: All production SolrCloud deployments must use an external Zookeeper ensemble—a separate set of pods, managed by a separate StatefulSet, running only Zookeeper.
Ensemble Sizing: 3 Nodes vs. 5 Nodes
Zookeeper's high availability is based on a "quorum" mechanism. For the ZK service to be active, a majority of its nodes must be running and able to communicate. The guiding principle is to deploy $2F+1$ nodes to tolerate $F$ failures.
This "majority" rule is why an odd number of nodes is essential:
- 2 Nodes: Quorum is 2. If 1 node fails, 1 remains. 1 is not a majority of 2. Fault tolerance: 0.
- 3 Nodes: Quorum is 2. If 1 node fails, 2 remain. 2 is a majority of 3. Fault tolerance: 1.
- 4 Nodes: Quorum is 3. If 1 node fails, 3 remain. 3 is a majority of 4. If 2 nodes fail, 2 remain. 2 is not a majority of 4. Fault tolerance: 1.
- 5 Nodes: Quorum is 3. If 2 nodes fail, 3 remain. 3 is a majority of 5. Fault tolerance: 2.
A 4-node cluster provides no more fault tolerance than a 3-node cluster but adds coordination overhead.
Recommendations:
- 3 Nodes: Standard for most production clusters. Tolerates a single node failure (e.g., a pod crash or node drain).
- 5 Nodes: Recommended for very large or mission-critical clusters. Provides higher fault tolerance, allowing the cluster to survive one planned maintenance event (node drain) and simultaneously tolerate one unplanned failure (pod crash).
Adding more ZK nodes decreases write performance (as more nodes must coordinate) and only slightly increases read performance. Since ZK handles low-volume, high-importance coordination (not high-throughput data), the choice is driven by high-availability requirements rather than performance.
Deployment Mechanics
There are two primary methods to deploy the external ZK ensemble:
-
Zookeeper Operator (Solr Operator's Preferred Method):
- The
apache-solr/solr-operatorHelm chart lists the Zookeeper Operator as a dependency. - When deploying a SolrCloud CR, users configure the
spec.zookeeperRef.providedblock. - The Solr Operator creates a
ZookeeperClusterCR, which the Zookeeper Operator then reconciles into the necessary StatefulSet, Services, and ConfigMaps. - Benefit: Fully declarative, ecosystem-native approach where ZK lifecycle is managed as part of the Solr deployment.
- The
-
Standalone Helm (Independent/Shared Method):
- Deploy Zookeeper independently using a standalone Helm chart (e.g.,
helm install my-zk bitnami/zookeeper --set replicaCount=3). - Configure the SolrCloud CR to point to the existing ensemble using
spec.zookeeperRef.connectionInfo.externalConnectionString. - Benefit: Correct pattern when ZK's lifecycle must be independent of any single Solr cluster—common for large, multi-tenant ZK ensembles shared by many applications (Solr, Kafka, etc.).
- Deploy Zookeeper independently using a standalone Helm chart (e.g.,
The Project Planton Approach
Project Planton's SolrKubernetes API is designed to provide the Day 1 simplicity of a Helm chart with the Day 2 power and safety of the Apache Solr Operator.
The Operator-as-a-Service Model
The SolrKubernetes resource is a high-level abstraction that generates and manages the underlying SolrCloud and ZookeeperCluster Custom Resources. This approach:
- Hides Complexity: Users interact with a simplified API surface (10-12 core fields) rather than the hundreds of knobs exposed by the underlying CRDs.
- Preserves Power: All the Operator's application-aware logic—safe rolling updates, declarative backups, monitoring—remains intact.
- Ensures Safety: The API prevents users from misconfiguring critical settings (e.g., disabling the Operator's managed update strategy) that could compromise cluster stability.
The 80/20 Configuration Philosophy
Just as Project Planton's APIs focus on the 20% of configuration that 80% of users need, the SolrKubernetes API exposes only the essential fields:
Essential Solr Configuration:
replicas: The number of Solr nodes (maps tospec.replicasin SolrCloud CR)image: Container image repository and tagresources: Standard Kubernetes CPU/memory requests and limitsdisk_size: Persistent storage size (production requires persistent storage)java_mem: Simplified JVM heap configuration (e.g., "8g")
Essential Zookeeper Configuration:
replicas: Number of ZK nodes (default: 3, production minimum for HA)resources: CPU/memory allocationsdisk_size: Persistent storage for ZK data
Ingress Configuration:
ingress: Standard Kubernetes ingress specification for external access
What's Not Exposed (By Design)
The v1 API intentionally excludes "expert-only" fields that could lead to misconfigurations:
- Custom Solr modules and additional libraries (plugin loading)
- Update strategy overrides (the Operator's default "Managed" strategy is optimal)
- Advanced GC tuning parameters (basic heap sizing covers most use cases)
- Low-level Kubernetes customization escape hatches
These omissions are deliberate. The goal is to make it nearly impossible for users to create unsafe configurations while still supporting the vast majority of production use cases.
Reference Configurations
These examples demonstrate how the simplified API supports different deployment scales:
Dev Cluster (Minimal Cost HA):
solr_container:
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
disk_size: "20Gi"
image:
repo: "apache/solr"
tag: "9.4.0"
config:
java_mem: "-Xms2g -Xmx2g"
zookeeper_container:
replicas: 3 # ZK HA is still required
resources:
requests:
cpu: "100m"
memory: "512Mi"
disk_size: "2Gi"
Staging Cluster (Scaled-Down Prod):
solr_container:
replicas: 3
resources:
requests:
cpu: "2"
memory: "10Gi"
limits:
memory: "10Gi"
disk_size: "100Gi"
image:
repo: "apache/solr"
tag: "9.4.0"
config:
java_mem: "-Xms8g -Xmx8g"
zookeeper_container:
replicas: 3
resources:
requests:
cpu: "250m"
memory: "1Gi"
disk_size: "10Gi"
Production Cluster (HA & Resilient):
solr_container:
replicas: 5
resources:
requests:
cpu: "4"
memory: "24Gi" # Higher than heap for OS file system cache
limits:
memory: "24Gi"
disk_size: "500Gi"
image:
repo: "apache/solr"
tag: "9.4.0"
config:
java_mem: "-Xms16g -Xmx16g"
zookeeper_container:
replicas: 5 # Higher HA, tolerating F=2 failures
resources:
requests:
cpu: "500m"
memory: "2Gi"
disk_size: "20Gi"
Note: In the production configuration, total pod memory is set higher than the JVM heap to leave room for the OS file system cache, which Solr relies on heavily for performance.
Day 2 Operations
A production-ready platform must account for ongoing operational needs beyond initial deployment.
High Availability and Scaling
High availability in this architecture is achieved through three layers:
- Coordination HA: A 3-node or 5-node Zookeeper ensemble ensures the cluster's "brain" is fault-tolerant.
- Node Resilience: StatefulSet + PVCs ensure that if a Solr pod fails, Kubernetes reschedules it and reconnects it to its existing persistent volume.
- Data Replication: Within SolrCloud, collections must be configured with
replicationFactor=2or higher, ensuring that if one Solr node (holding a replica) dies, another node still holds a copy of that shard.
The Solr Operator provides intelligent, application-aware scaling. When you increase replicas, a new node is added. When you decrease replicas, the Operator first tries to safely move shards off that node before decommissioning it. This Solr-specific logic is far superior to a blind kubectl scale command, which could lead to data loss or unavailability.
Disaster Recovery: Backups and Restore
Backup (Declarative): The Operator provides the SolrBackup CRD, making backups a declarative operation. Users define a repository (Volume, GCS, or S3) in the SolrCloud CR's spec.backupRepositories, then create a SolrBackup resource. The CRD also supports scheduled, recurring backups via the spec.recurrence field.
Restore (Manual API Call): There is currently no SolrRestore CRD. The Operator manages backups but not restores. The restore operation is a standard Solr Collections API call (action=RESTORE) that must be triggered manually:
curl 'http://<solr-service>/solr/admin/collections?action=RESTORE&name=<backup-name>&collection=<new-collection-name>&repository=<repo-name>'
This is a known limitation of the current Operator ecosystem, and users must be aware of the manual restore process.
Monitoring and Alerting
The Operator provides a declarative SolrPrometheusExporter CRD. Users create this resource pointing to their SolrCloud instance, and the Operator deploys the exporter. The cluster's Prometheus instance scrapes the /metrics endpoint, and users can import the official Solr Grafana dashboard for instant visibility.
Key Metrics to Alert On:
solr_ping: Cluster liveness- Query Latency (p95, p99): Most important user-facing metric
- Cache Hit Ratio: (e.g.,
filterCache,queryResultCache) Low hit ratios indicate performance tuning issues - JVM Heap Usage & GC Pauses: High heap usage or long GC pauses are precursors to OutOfMemoryErrors
Security Hardening
Production clusters must not be open to the public. The Operator has first-class support for:
- TLS (Encryption): Via
SolrCloud.spec.solrTLS, with support for auto-generated self-signed certificates, cert-manager integration, or user-provided Kubernetes Secrets. - Authentication: Basic Authentication can be bootstrapped via
SolrCloud.spec.solrSecurity.authenticationType: Basic, storing users and hashed passwords in asecurity.jsonfile. - Authorization: Rule-based authorization restricts access to collections or APIs.
Networking: Ingress and External Access
Solr clients (e.g., web applications) often run outside the Kubernetes cluster. The standard cloud-native way to expose HTTP services is through an Ingress resource—a "smart" L7 reverse proxy that allows multiple services to share a single load balancer using hostnames or URL paths.
The Solr Operator has native support for this via spec.solrAddressability.external.method: Ingress. This integrates cleanly with ingress controllers (e.g., ingress-nginx) and cert-manager for automated TLS certificate provisioning.
TLS Termination Patterns:
- End-to-End Encryption: Solr pods have TLS enabled, and the Ingress passes through encrypted traffic (most secure, more complex).
- TLS Termination at Ingress: The Ingress handles the public TLS certificate, decrypts traffic, and forwards it to Solr pods as unencrypted HTTP (common, practical pattern).
For automated TLS, the Operator supports passing cert-manager annotations through to the managed Ingress resource, enabling fully automated certificate provisioning.
Conclusion: Production-Ready Solr on Kubernetes
The journey from anti-patterns to production-ready Solr deployments on Kubernetes reflects the broader maturation of the cloud-native ecosystem. What was once considered risky or infeasible is now not only possible but represents best-in-class infrastructure.
The Apache Solr Operator, as an official Apache project backed by production deployments at some of the largest Solr users, embodies this maturation. It provides the application-aware intelligence necessary to safely manage stateful systems at scale. By combining stable network identities, persistent storage, and sophisticated rolling update logic that understands Solr's coordination semantics, the Operator transforms Kubernetes from a challenging platform for Solr into an ideal one.
Project Planton's SolrKubernetes API builds upon this foundation, providing a simplified abstraction that makes the Operator's power accessible without exposing its complexity. This "Operator-as-a-Service" model delivers both the Day 1 simplicity that developers want and the Day 2 operational safety that production environments demand.
The result is a deployment approach that is simultaneously simple, powerful, and—most importantly—production-safe.
Next article