GKE Control Plane: From Pet Clusters to Cattle Infrastructure
Introduction
For the first decade of cloud-native infrastructure, Kubernetes clusters were treated like pets—carefully hand-crafted, lovingly maintained, and nearly impossible to reproduce. Engineers spent hours clicking through cloud consoles, tweaking CIDR blocks, configuring VPC peering, and managing IAM policies. The result? Snowflake clusters that drifted from their documentation, required tribal knowledge to debug, and made disaster recovery a nightmare.
Google Kubernetes Engine (GKE) changed the conversation by moving the operational burden of Kubernetes control plane management to Google. But GKE alone doesn't solve the infrastructure-as-code problem. The cluster still needs to be provisioned, configured, and integrated with networking, security, and identity systems. Without automation, you're back to click-ops and configuration drift.
This is where declarative infrastructure comes in. GKE clusters should be cattle, not pets—defined in code, version-controlled, reproducible, and ephemeral. If a cluster fails or drifts from spec, you destroy and recreate it from the same source of truth. No tribal knowledge required.
The GKE Control Plane Problem Space
A production-ready GKE cluster is not just "create cluster and go." It involves:
- Network Architecture: VPC-native networking with IP address management for pods and services (secondary ranges), private vs public endpoints, master authorized networks
- Security Posture: Private clusters (nodes without public IPs), Workload Identity (IAM for pods), service account configuration, Binary Authorization
- Connectivity: Cloud NAT for outbound internet access, VPC peering for private services, firewall rules
- Operational Concerns: Release channels (auto-upgrades), maintenance windows, node management policies, logging and monitoring integration
- Capacity Planning: Regional vs zonal clusters, master CIDR allocation, scaling considerations
The reality: getting all of this right is hard. Miss a secondary IP range configuration? Your pods can't schedule. Forget Cloud NAT? Your private nodes can't pull Docker images. Misconfigure Workload Identity? Your pods can't access GCP services.
Project Planton's GcpGkeCluster API abstracts this complexity into a focused, production-ready specification. This document explains why GKE is designed this way, how deployment methods evolved, and what the 80/20 scoping decisions mean for your infrastructure.
The Evolution: From Manual to Declarative
Level 0: Manual Console Provisioning (The Anti-Pattern)
What it is: The Google Cloud Console's "Create Cluster" wizard. You click through 12+ screens, selecting region, network, IP ranges, add-ons, security options, and finally hit "Create."
When it's used: During initial GKE exploration, learning, or proof-of-concept work.
Why it fails in production:
- Not repeatable: Can you recreate this cluster next week with identical settings? Only if you screenshot every screen.
- Not auditable: No paper trail of what changed or why.
- Not versionable: Can't use Git, can't do code review, can't roll back.
- Drift prone: Manual changes accumulate, documentation becomes stale, "production" becomes a mystery box.
Real-world consequence: The "we need to rebuild prod but don't remember how it was configured" incident. This happens more often than engineering leaders admit.
Verdict: Acceptable for learning, completely unacceptable for production.
Level 1: Scripted CLI (Imperative Automation)
What it is: Using gcloud container clusters create with a long list of flags:
gcloud container clusters create prod-cluster \
--region=us-central1 \
--network=my-vpc \
--subnetwork=my-subnet \
--cluster-secondary-range-name=pod-range \
--services-secondary-range-name=svc-range \
--enable-private-nodes \
--enable-private-endpoint=false \
--master-ipv4-cidr=172.16.0.0/28 \
--enable-ip-alias \
--enable-workload-identity \
--workload-pool=my-project.svc.id.goog \
--enable-network-policy \
--release-channel=regular \
--enable-autorepair \
--enable-autoupgrade \
--no-enable-basic-auth \
--no-issue-client-certificate
The advancement: It's scriptable. You can version-control the bash script, run it in CI/CD, and theoretically recreate clusters.
The limitations:
- Imperative, not declarative: The command says "do this action" but doesn't maintain state. If someone manually changes a cluster setting via the console, your script doesn't know.
- No drift detection: You'd need to write custom logic to compare actual cluster config against desired state.
- Updates are manual: Changing cluster settings requires separate
gcloud container clusters updatecommands. You must calculate the delta between current and desired state. - Idempotency is DIY: You have to check "does this cluster already exist?" before running create.
- Flag explosion: That 14-line command only covers basics. Production clusters need 30+ flags.
Verdict: Better than clicking, but still far from production-grade infrastructure management.
Level 2: Infrastructure as Code (The Production Standard)
This is where real infrastructure automation begins. IaC tools are:
- Declarative: You define the desired end state, not the steps to get there
- Stateful: They track what currently exists in the cloud
- Lifecycle-aware: They compute diffs and execute only necessary API calls
- Drift-correcting: They detect when reality diverges from code and can reconcile
Terraform (Industry Standard)
Terraform is the dominant IaC tool for multi-cloud infrastructure. You define resources in HashiCorp Configuration Language (HCL):
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1"
network = google_compute_network.vpc.id
subnetwork = google_compute_subnetwork.primary.id
# Remove default node pool immediately (best practice)
remove_default_node_pool = true
initial_node_count = 1
# Private cluster configuration
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.0.0/28"
}
# VPC-native networking (required for modern GKE)
ip_allocation_policy {
cluster_secondary_range_name = "pod-range"
services_secondary_range_name = "svc-range"
}
# Workload Identity (IAM for Pods)
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Network policy enforcement (Calico)
addons_config {
network_policy_config {
disabled = false
}
}
# Auto-upgrade strategy
release_channel {
channel = "REGULAR"
}
}
On terraform apply, Terraform:
- Reads its state file (the last-known cloud state)
- Queries GCP APIs (the current actual state)
- Compares desired state (your HCL) against actual state
- Generates an execution plan showing what will change
- Executes only the necessary API calls (create/update/delete)
The critical advantage: If you change master_ipv4_cidr_block, Terraform knows this forces cluster recreation (GKE API limitation). If you change release_channel, Terraform knows this is an in-place update. You don't have to calculate the delta—Terraform does.
Production pattern: Terraform state is stored remotely (GCS bucket), enabling team collaboration. State locking prevents concurrent modifications. Modules enable reusability across environments (dev/staging/prod).
OpenTofu (Open-Source Terraform Fork)
Identical to Terraform in functionality and HCL syntax. Born from the community response to HashiCorp's 2023 license change (BSL). For organizations requiring open-source governance, OpenTofu is a drop-in Terraform replacement.
Pulumi (Modern IaC with General-Purpose Languages)
Pulumi brings infrastructure-as-code into the world of real programming languages (Python, TypeScript, Go, C#):
import pulumi_gcp as gcp
cluster = gcp.container.Cluster("prod-cluster",
name="prod-cluster",
location="us-central1",
network=vpc.id,
subnetwork=subnet.id,
remove_default_node_pool=True,
initial_node_count=1,
private_cluster_config=gcp.container.ClusterPrivateClusterConfigArgs(
enable_private_nodes=True,
enable_private_endpoint=False,
master_ipv4_cidr_block="172.16.0.0/28",
),
ip_allocation_policy=gcp.container.ClusterIpAllocationPolicyArgs(
cluster_secondary_range_name="pod-range",
services_secondary_range_name="svc-range",
),
workload_identity_config=gcp.container.ClusterWorkloadIdentityConfigArgs(
workload_pool=f"{project_id}.svc.id.goog",
),
addons_config=gcp.container.ClusterAddonsConfigArgs(
network_policy_config=gcp.container.ClusterAddonsConfigNetworkPolicyConfigArgs(
disabled=False,
),
),
release_channel=gcp.container.ClusterReleaseChannelArgs(
channel="REGULAR",
),
)
Why it's powerful: You get the full expressiveness of a programming language—loops, conditionals, functions, types, IDE autocomplete. Complex logic that requires 200 lines of HCL templating can be 20 lines of Python.
Trade-off: Requires developers comfortable with the chosen language. HCL is intentionally limited; Pulumi lets you shoot yourself in the foot with complex abstractions.
Verdict: Terraform/OpenTofu and Pulumi are the production standard. They are declarative, stateful, drift-aware, and enable true infrastructure-as-code workflows.
Level 3: Kubernetes-Native GitOps (The Unified Control Plane)
The next evolution: managing infrastructure using the Kubernetes API itself.
Config Connector (Google's First-Party Kubernetes Operator)
Config Connector is a Kubernetes operator that adds Google Cloud resource types as Custom Resource Definitions (CRDs). You define GCP infrastructure as Kubernetes manifests:
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
name: prod-cluster
spec:
location: us-central1
networkRef:
name: my-vpc
subnetworkRef:
name: my-subnet
privateClusterConfig:
enablePrivateNodes: true
enablePrivateEndpoint: false
masterIpv4CidrBlock: 172.16.0.0/28
ipAllocationPolicy:
clusterSecondaryRangeName: pod-range
servicesSecondaryRangeName: svc-range
workloadIdentityConfig:
workloadPool: my-project.svc.id.goog
addonsConfig:
networkPolicyConfig:
disabled: false
releaseChannel:
channel: REGULAR
removeDefaultNodePool: true
initialNodeCount: 1
You kubectl apply this manifest. The Config Connector operator, running inside your Kubernetes cluster, watches for these custom resources and makes the corresponding GCP API calls.
The paradigm shift: Infrastructure and applications share a unified control plane. The same GitOps tools (ArgoCD, Flux) that deploy your microservices also provision your GKE clusters. The same RBAC, audit logs, and observability tools work for both.
The self-provisioning cluster: A GKE cluster can provision its own node pools, databases, DNS zones, and load balancers. The cluster becomes "self-describing" infrastructure.
The bootstrapping problem: You still need something to create the first cluster that runs Config Connector. This is usually Terraform or Pulumi. Once bootstrapped, Config Connector can manage everything else.
Crossplane (Vendor-Neutral, Composable Multi-Cloud)
Crossplane extends the Kubernetes-native infrastructure pattern with:
- Multi-cloud: Manage GCP, AWS, Azure resources with the same workflow
- Composite resources: Platform teams define high-level abstractions (e.g.,
XCluster) that compose lower-level resources (VPC + Subnets + GKE + Node Pools + Databases) - Developer self-service: Developers request an
XCluster, and Crossplane provisions all underlying infrastructure
The strategic advantage: Platform engineering teams build internal developer platforms that abstract cloud complexity. Developers don't need to understand VPC secondary ranges or Workload Identity—they just request a "cluster" and get production-ready infrastructure.
The complexity cost: Crossplane is operationally complex. The operator must be highly available. Composite resource definitions require deep Kubernetes and cloud expertise.
Verdict: Cutting-edge. Ideal for platform teams building self-service infrastructure, but requires investment in operator reliability and sophisticated abstractions.
Project Planton's Approach: Guard Rails, Not Handcuffs
The GcpGkeCluster API resource is designed around a principle: make the right thing easy and the wrong thing hard.
Core Philosophy
Production GKE clusters follow a predictable pattern:
- Private nodes (no public IPs) with Cloud NAT for outbound internet
- VPC-native networking (IP aliasing for pods and services)
- Workload Identity (secure IAM for pods without node-level service account keys)
- Network Policy enforcement (Calico for microsegmentation)
- Release channels (auto-upgrades with Regular channel as default)
This isn't opinion—it's the consolidated wisdom of Google's own best practices documentation, enterprise architecture patterns, and production war stories.
What the API Includes (The 80%)
The GcpGkeClusterSpec focuses on control plane configuration—the essential, immutable decisions that define a cluster's networking, security, and upgrade strategy:
Core Networking (Required)
project_id: GCP project (foreign key toGcpProjectresource)location: Region (e.g.,us-central1) or zone (e.g.,us-central1-a)- Regional clusters: Multi-zonal high availability (control plane and nodes across 3+ zones)
- Zonal clusters: Single-zone (cheaper, less resilient)
subnetwork_self_link: VPC subnet (foreign key toGcpSubnetwork)cluster_secondary_range_name: Secondary IP range for Pod IPsservices_secondary_range_name: Secondary IP range for Service IPs
Why VPC-native (IP aliasing) is mandatory: Legacy "routes-based" clusters are deprecated and lack modern features (Workload Identity, Private Service Connect, some GKE add-ons). All new clusters should be VPC-native.
Private Cluster Configuration (Required)
master_ipv4_cidr_block: RFC1918 /28 CIDR for the Kubernetes API server (control plane) private endpoint- Must be /28 (exactly 16 IPs)
- Must not overlap with VPC, pod, or service ranges
- Example:
172.16.0.0/28
Why private by default: Public control plane endpoints are a security liability. Private endpoints limit API access to VPC networks (or authorized networks via VPN/Interconnect).
enable_public_nodes: Boolean to allow nodes with public IPs (default:false)- If
false: Nodes are private (RFC1918 only), requires Cloud NAT for internet egress - If
true: Nodes get external IPs (legacy pattern, higher attack surface)
- If
Connectivity (Required)
router_nat_name: Reference to Cloud NAT configuration (foreign key toGcpRouterNat)- Required for private nodes to pull images, reach APIs, install packages
- Cloud NAT provides managed, scalable outbound internet without exposing inbound attack surface
Operational Configuration (Optional with Defaults)
release_channel: Auto-upgrade strategy (default:REGULAR)- RAPID: Bleeding-edge (new Kubernetes versions ~2-3 months before Stable)
- REGULAR: Production-recommended balance (new versions ~2-3 months after Rapid)
- STABLE: Conservative (new versions ~2-3 months after Regular)
- NONE: Manual upgrades (not recommended—you manage security patching)
Best practice: Use REGULAR for production. It balances access to new features with stability.
-
disable_network_policy: Boolean to turn off Calico network policy enforcement (default:false)- If
false: Network policies enabled (recommended for production security) - If
true: All pods can reach all pods (simpler, less secure)
- If
-
disable_workload_identity: Boolean to turn off Workload Identity (default:false)- If
false: Workload Identity enabled (pods use KSA→GSA binding for IAM) - If
true: Legacy metadata server (pods inherit node service account—overly permissive)
- If
Why Workload Identity is critical: It follows the principle of least privilege. Each pod gets exactly the IAM permissions it needs via Kubernetes Service Account annotations. No shared node-level service account keys.
What the API Excludes (The 20%)
These are deliberately excluded to maintain focus on control plane configuration:
Node Pools (Separate Resource)
Node pools are an independent resource (GcpGkeNodePool). This aligns with GKE's API design and mirrors Terraform/Pulumi best practices. Reasons:
- Lifecycle independence: Change machine types, scale pools, add GPU pools—without touching the control plane
- Avoid unintended side effects: Inline node pool configs in cluster definitions can trigger unwanted cluster-level operations
- Production pattern: Validated by Terraform docs, Pulumi patterns, and real-world usage
Add-ons (Opinionated Defaults)
GKE add-ons (HTTP Load Balancing, Horizontal Pod Autoscaling, Cloud Logging, Cloud Monitoring) are enabled by default. Disabling them is rare. The API doesn't expose every add-on toggle to reduce surface area.
If you need granular add-on control, use the underlying Pulumi/Terraform providers directly.
Binary Authorization, Advanced Security Features
Features like Binary Authorization (image attestation), Shielded GKE Nodes (Secure Boot), and GKE Sandbox (gVisor isolation) are advanced security patterns. Including them would clutter the API for the 80% use case.
Extension path: These can be added as optional fields in future API versions without breaking existing resources.
Master Authorized Networks (Advanced)
Master authorized networks restrict API server access to specific CIDR ranges (e.g., corporate VPN IPs). This is important for highly regulated environments but not a universal requirement.
Workaround: Use VPC firewall rules and private endpoints for access control.
The 80/20 Scoping Decisions
Project Planton's GcpGkeCluster is opinionated but extensible. The goal: 80% of users get production-ready defaults; 20% with advanced needs can drop to Terraform/Pulumi.
| Feature | Included? | Why? |
|---|---|---|
| VPC-native networking | ✅ Required | Routes-based is deprecated; IP aliasing is mandatory for modern features |
| Private cluster config | ✅ Required | Security best practice; public clusters are legacy anti-pattern |
| Cloud NAT reference | ✅ Required | Private nodes can't function without NAT for egress |
| Workload Identity | ✅ Enabled by default | IAM for pods; principle of least privilege |
| Network Policy (Calico) | ✅ Enabled by default | Microsegmentation; production security baseline |
| Release channel | ✅ Configurable (default: REGULAR) | Auto-upgrades are critical for security; channel choice matters |
| Node pools | ❌ Separate resource | Lifecycle independence; aligns with Terraform/Pulumi patterns |
| Master authorized networks | ❌ Excluded | Advanced use case; achievable via VPC firewall rules |
| Binary Authorization | ❌ Excluded | Advanced security; not universal requirement |
| Add-on granular control | ❌ Opinionated defaults | Defaults are production-ready; reduces API surface |
Production Best Practices
Private Clusters: The Modern Default
Anti-pattern: Public clusters with nodes that have external IPs.
Why it fails: Every node is exposed to the internet. Even with firewall rules, you've expanded the attack surface. If a node is compromised, lateral movement and data exfiltration are easier.
Best practice: Private clusters with private nodes:
- Nodes have only RFC1918 (private) IPs
- Control plane endpoint is private (accessible only from VPC or authorized networks)
- Outbound internet access via Cloud NAT (managed, stateless, no inbound ports)
The trade-off: SSH access to nodes requires a bastion host or IAP (Identity-Aware Proxy). This is intentional—nodes should be immutable and auto-replaceable, not SSH destinations.
VPC-Native Networking: Non-Negotiable
Legacy approach: Routes-based clusters use Google Cloud routes for pod IP routing. This is deprecated.
Modern approach: VPC-native clusters use alias IP ranges:
- Pods get IPs from a dedicated secondary range on the subnet
- Services get IPs from another secondary range
- GCP's VPC natively understands these ranges (no custom routes)
Why it matters:
- Workload Identity requires VPC-native
- Private Service Connect (private access to Google APIs) requires VPC-native
- GKE Dataplane V2 (eBPF-based networking) requires VPC-native
- Shared VPC patterns work better with VPC-native
Verdict: Always use VPC-native. Project Planton enforces this by requiring secondary range names in the spec.
Workload Identity: IAM for Pods Without Keys
Anti-pattern: Nodes use the Compute Engine default service account (overly permissive). Pods inherit node-level permissions. If one pod is compromised, all pods share the same IAM blast radius.
Modern approach: Workload Identity:
- Each Kubernetes Service Account (KSA) is annotated with a Google Service Account (GSA)
- GKE's metadata server intercepts credential requests from pods
- Pods receive short-lived GSA tokens scoped to the KSA's bound GSA
- No service account keys, no node-level shared permissions
Configuration:
# Kubernetes Service Account with Workload Identity binding
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
namespace: production
annotations:
iam.gke.io/gcp-service-account: app@my-project.iam.gserviceaccount.com
# Bind KSA to GSA
gcloud iam service-accounts add-iam-policy-binding \
app@my-project.iam.gserviceaccount.com \
--role=roles/iam.workloadIdentityUser \
--member="serviceAccount:my-project.svc.id.goog[production/app-sa]"
Result: The app-sa service account in the production namespace gets exactly the IAM permissions of app@my-project.iam.gserviceaccount.com. No other pods can impersonate this GSA.
Verdict: Enable Workload Identity by default. Project Planton does this via disable_workload_identity: false.
Release Channels: Auto-Upgrades Are Your Friend
Anti-pattern: Disabling auto-upgrades (release_channel: NONE) out of fear of disruption.
Why it backfires: Your cluster falls behind on security patches. When you do upgrade, the version gap is large and risky. You accumulate technical debt.
Best practice: Use release channels to control when upgrades happen, not whether they happen:
- RAPID: For dev/test environments that want bleeding-edge features
- REGULAR: For production (recommended)—new versions ~2-3 months after Rapid
- STABLE: For highly risk-averse production—new versions ~2-3 months after Regular
Combine with maintenance windows (define recurring time windows when upgrades are permitted, e.g., "Sunday 02:00-04:00 AM").
Verdict: Use REGULAR for production. Auto-upgrades minimize security risk.
Network Policies: Microsegmentation for Kubernetes
Anti-pattern: Default Kubernetes allows any pod to reach any pod on any port (flat network).
Why it's risky: If one pod is compromised, attackers can pivot laterally to databases, internal APIs, or other services.
Best practice: Enable Network Policy enforcement (Calico in GKE):
# Example: Only allow frontend pods to reach backend on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-policy
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Result: Backend pods reject traffic from anything except frontend pods on port 8080.
Verdict: Enable network policies by default. Project Planton does this via disable_network_policy: false.
Cloud NAT: Outbound Internet for Private Nodes
Private nodes have no public IPs. How do they:
- Pull Docker images from Docker Hub, Artifact Registry?
- Reach external APIs (Stripe, Twilio, GitHub)?
- Install packages via
apt,yum?
Answer: Cloud NAT (Network Address Translation)—a managed, regional NAT gateway:
- Provides outbound internet access
- No inbound ports exposed (stateless NAT)
- Auto-scales with your cluster (no VM-based NAT bottlenecks)
- Logs all outbound connections for audit trails
Configuration: Create a Cloud Router and Cloud NAT in the same region as your cluster. Project Planton's router_nat_name field ensures this is validated at spec definition time.
Verdict: Required for private clusters. Project Planton enforces this by making router_nat_name a required field.
Regional vs Zonal Clusters: Availability vs Cost
Zonal Clusters
- Control plane: Single zone (e.g.,
us-central1-a) - Nodes: Can span multiple zones, but control plane is in one zone
- SLA: Lower (control plane failure = API downtime)
- Cost: Lower (one control plane replica)
Use case: Dev/test environments, cost-sensitive non-critical workloads
Regional Clusters
- Control plane: Multi-zonal (replicas in 3+ zones)
- Nodes: Automatically spread across multiple zones
- SLA: Higher (control plane survives zonal outages)
- Cost: Higher (3+ control plane replicas)
Use case: Production workloads requiring high availability
Verdict: Use regional clusters for production. The control plane cost is marginal compared to node costs, and the resilience is worth it.
Control Plane and Node Separation: The Architecture That Matters
One of GKE's core design principles: control plane and nodes are separate. The control plane (Kubernetes API server, etcd, controllers) is managed by Google. You provision the cluster (control plane config), then separately provision node pools.
This separation is intentional:
- Control plane upgrades don't touch nodes
- Node pool changes (machine types, autoscaling, labels) don't risk control plane disruptions
- Node pool lifecycles are independent (create/update/delete pools without recreating the cluster)
Project Planton's API reflects this:
GcpGkeCluster: Control plane configuration (networking, security, upgrade strategy)GcpGkeNodePool: Node configuration (machine types, autoscaling, taints, labels)
This mirrors Terraform and Pulumi best practices: google_container_cluster (control plane) and google_container_node_pool (nodes) are separate resources.
Anti-pattern: Inline node pools within cluster definitions. Terraform docs explicitly warn this can cause IaC tools to "struggle with complex changes" and risk unintended cluster recreation.
Verdict: Keep them separate. Project Planton enforces this architectural boundary.
Master CIDR Planning: The /28 You Can't Change
Every GKE private cluster requires a master CIDR block—a /28 range (16 IPs) for the Kubernetes control plane's private endpoint.
Critical constraints:
- Must be /28 (exactly 16 IPs)—no larger, no smaller
- Must be RFC1918 (private):
10.0.0.0/8,172.16.0.0/12, or192.168.0.0/16 - Must not overlap with:
- VPC primary range
- Pod secondary range
- Service secondary range
- Any peered VPC ranges
- Any VPN/Interconnect on-premises ranges
- Cannot be changed after cluster creation (immutable)
Planning strategy: Reserve a dedicated /16 subnet for GKE master CIDRs. Example:
172.16.0.0/28→ Cluster 1 masters172.16.0.16/28→ Cluster 2 masters172.16.0.32/28→ Cluster 3 masters- ...up to 4096 clusters in
172.16.0.0/16
Verdict: Plan your master CIDR allocation strategy before creating clusters. Project Planton's master_ipv4_cidr_block field enforces /28 via regex validation.
Shared VPC: Enterprise Multi-Project Networking
In enterprise GCP organizations, networking is centralized:
- Host project: Owns the VPC, subnets, firewall rules
- Service projects: Consume subnets from the host project
Shared VPC for GKE:
- GKE cluster is in a service project
- Cluster references subnets from the host project
- Nodes and pods use IP ranges from the shared VPC
- Firewall rules, Cloud NAT, and VPN are managed centrally
Why it matters: Separates network administration (platform team) from application deployment (product teams). Platform team controls IP allocation and security policies; product teams deploy clusters within those guard rails.
Project Planton support: The subnetwork_self_link field can reference subnets from any project (host or same project). For Shared VPC, use:
project_id: Service project (where cluster is created)subnetwork_self_link: Subnet from host project
The GKE Add-Ons Landscape
GKE enables several add-ons by default:
- HTTP Load Balancing: Provisions GCP Load Balancers for
Ingressresources - Horizontal Pod Autoscaler: Scales pods based on CPU/memory metrics
- Cloud Logging: Exports cluster logs to Cloud Logging
- Cloud Monitoring: Exports cluster metrics to Cloud Monitoring
Most production clusters keep these enabled. Project Planton does not expose granular add-on toggles to reduce API surface area.
Advanced use cases: If you need custom Ingress controllers (e.g., NGINX, Traefik) or third-party observability (e.g., Datadog, Prometheus), you can disable GCP's HTTP Load Balancing. This is rare. For most users, the defaults are production-ready.
The Node Pool Lifecycle: Why Separation Matters
Node pools have independent lifecycles:
- Create: Add a new pool (GPU nodes, high-memory nodes) without touching the cluster
- Update: Change machine types, autoscaling ranges, labels—forces node recreation but cluster control plane is unaffected
- Delete: Remove a pool (drain nodes, terminate VMs)—no impact on other pools or cluster
Critical pattern: GKE's best practice is to remove the default node pool and create custom pools:
resource "google_container_cluster" "primary" {
# ...
remove_default_node_pool = true
initial_node_count = 1 # Temporary; deleted after cluster creation
}
resource "google_container_node_pool" "custom" {
cluster = google_container_cluster.primary.name
# ... custom configuration
}
Why? The default node pool has generic settings. Production workloads need customized pools (machine types, autoscaling, labels, taints).
Project Planton's Pulumi implementation follows this pattern: remove_default_node_pool = true, then separately provision GcpGkeNodePool resources.
Example Configurations
Minimal Production Cluster (Private, Regional, Auto-Upgrades)
apiVersion: gcp.project-planton.org/v1
kind: GcpGkeCluster
metadata:
name: prod-cluster
spec:
project_id:
value_ref:
kind: GcpProject
name: my-project
path: status.outputs.project_id
location: us-central1 # Regional for HA
subnetwork_self_link:
value_ref:
kind: GcpSubnetwork
name: prod-subnet
path: status.outputs.self_link
cluster_secondary_range_name:
value_ref:
kind: GcpSubnetwork
name: prod-subnet
path: status.outputs.pods_secondary_range_name
services_secondary_range_name:
value_ref:
kind: GcpSubnetwork
name: prod-subnet
path: status.outputs.services_secondary_range_name
master_ipv4_cidr_block: 172.16.0.0/28
enable_public_nodes: false # Private nodes
release_channel: REGULAR # Auto-upgrades
disable_network_policy: false # Enable Calico
disable_workload_identity: false # Enable Workload Identity
router_nat_name:
value_ref:
kind: GcpRouterNat
name: prod-nat
path: metadata.name
Result: Regional private cluster with VPC-native networking, Workload Identity, network policies, and auto-upgrades on the Regular channel.
Dev Cluster (Zonal, Cost-Optimized)
apiVersion: gcp.project-planton.org/v1
kind: GcpGkeCluster
metadata:
name: dev-cluster
spec:
project_id:
value: dev-project-12345
location: us-central1-a # Zonal (cheaper)
subnetwork_self_link:
value: https://www.googleapis.com/compute/v1/projects/dev-project/regions/us-central1/subnetworks/dev-subnet
cluster_secondary_range_name:
value: dev-pods
services_secondary_range_name:
value: dev-services
master_ipv4_cidr_block: 172.16.0.16/28
enable_public_nodes: false
release_channel: RAPID # Bleeding-edge for dev
disable_network_policy: true # Simplify dev debugging
disable_workload_identity: true # Not needed for dev
router_nat_name:
value: dev-nat
Result: Zonal dev cluster with relaxed security (no network policies, no Workload Identity) for faster iteration.
High-Security Prod Cluster (Stable Channel, All Security Features)
apiVersion: gcp.project-planton.org/v1
kind: GcpGkeCluster
metadata:
name: secure-prod
spec:
project_id:
value: security-project-67890
location: us-central1
subnetwork_self_link:
value: https://www.googleapis.com/compute/v1/projects/security-project/regions/us-central1/subnetworks/secure-subnet
cluster_secondary_range_name:
value: secure-pods
services_secondary_range_name:
value: secure-services
master_ipv4_cidr_block: 172.16.0.32/28
enable_public_nodes: false
release_channel: STABLE # Conservative upgrades
disable_network_policy: false # Maximum security
disable_workload_identity: false # IAM per pod
router_nat_name:
value: secure-nat
Result: Regional cluster on Stable channel with all security features enabled for risk-averse production.
Conclusion: Control Plane First, Nodes Second
GKE clusters are not monolithic. They're composed of:
- Control plane (API server, etcd, controllers)—configuration captured in
GcpGkeCluster - Node pools (VMs running kubelet)—configuration captured in
GcpGkeNodePool
This separation is architectural and intentional. It mirrors GKE's API design, Terraform best practices, Pulumi patterns, and real-world production stability requirements.
Project Planton's GcpGkeCluster API encodes the lessons learned from thousands of production GKE deployments:
- Private clusters with VPC-native networking by default
- Workload Identity for IAM-per-pod without shared secrets
- Network policies for microsegmentation
- Release channels for security-conscious auto-upgrades
- Cloud NAT for private node internet egress
- Master CIDR validation to prevent configuration mistakes
The goal: production-ready GKE clusters in 10 lines of YAML, not 100 lines of HCL.
The 80/20 philosophy ensures that simple use cases remain simple, while the remaining 20% (advanced security, add-on customization, specialized networking) can be addressed by extending the API or dropping to Terraform/Pulumi.
From pet clusters maintained by tribal knowledge to cattle infrastructure defined in Git—this is the evolution Project Planton enables.
Next article