GCP Compute Instance: Technical Research and Implementation Guide
Introduction
Google Compute Engine is Google Cloud's Infrastructure-as-a-Service (IaaS) offering that provides virtual machines running on Google's global infrastructure. Compute Engine VMs are the foundational building blocks for running workloads on GCP, offering a wide range of machine types, storage options, and networking configurations.
This document provides comprehensive research on GCP Compute Engine instances, exploring the deployment landscape, best practices, and the rationale behind Project Planton's implementation choices.
The Evolution of Compute Engine
Historical Context
Google Compute Engine was launched in 2012 as Google's answer to Amazon EC2. Since then, it has evolved significantly:
- 2012: Initial launch with basic VM functionality
- 2014: Live migration and custom machine types
- 2016: Preemptible VMs for cost optimization
- 2018: Sole-tenant nodes for compliance
- 2020: Confidential VMs with memory encryption
- 2022: Spot VMs replacing preemptible VMs
- 2023: C3 and C3D machine types with latest Intel/AMD processors
Current Capabilities
Today, Compute Engine offers:
- 100+ predefined machine types across multiple families
- Custom machine types with user-defined CPU/memory ratios
- GPU and TPU accelerators for ML/AI workloads
- Local SSDs for high-performance temporary storage
- Persistent disks with regional and zonal options
- Advanced networking with VPC, load balancing, and CDN integration
Deployment Methods
1. Google Cloud Console
The web-based console provides a guided experience:
Advantages:
- Visual interface with real-time validation
- Helpful defaults and recommendations
- Easy exploration of available options
Disadvantages:
- Not reproducible or version-controlled
- Manual process prone to human error
- Not suitable for automation
2. gcloud CLI
The command-line interface for direct API access:
gcloud compute instances create my-vm \
--project=my-project \
--zone=us-central1-a \
--machine-type=e2-medium \
--image-family=debian-11 \
--image-project=debian-cloud \
--boot-disk-size=20GB \
--boot-disk-type=pd-ssd
Advantages:
- Scriptable and automatable
- Can be version-controlled
- Suitable for CI/CD pipelines
Disadvantages:
- Imperative rather than declarative
- State management is manual
- Complex for multi-resource deployments
3. Terraform/OpenTofu
Infrastructure as Code with declarative configuration:
resource "google_compute_instance" "vm" {
name = "my-vm"
machine_type = "e2-medium"
zone = "us-central1-a"
boot_disk {
initialize_params {
image = "debian-cloud/debian-11"
}
}
network_interface {
network = "default"
access_config {
// Ephemeral public IP
}
}
}
Advantages:
- Declarative and idempotent
- State tracking and drift detection
- Plan/apply workflow for safety
- Large ecosystem and community
Disadvantages:
- Requires Terraform knowledge
- State file management complexity
- Learning curve for HCL syntax
4. Pulumi
Infrastructure as Code using general-purpose languages:
instance, err := compute.NewInstance(ctx, "my-vm", &compute.InstanceArgs{
MachineType: pulumi.String("e2-medium"),
Zone: pulumi.String("us-central1-a"),
BootDisk: &compute.InstanceBootDiskArgs{
InitializeParams: &compute.InstanceBootDiskInitializeParamsArgs{
Image: pulumi.String("debian-cloud/debian-11"),
},
},
NetworkInterfaces: compute.InstanceNetworkInterfaceArray{
&compute.InstanceNetworkInterfaceArgs{
Network: pulumi.String("default"),
},
},
})
Advantages:
- Use familiar programming languages
- Full IDE support and type checking
- Better abstraction capabilities
- Easier testing with standard tools
Disadvantages:
- Requires programming knowledge
- Smaller community than Terraform
- More complex setup
5. Cloud Deployment Manager
Google's native IaC tool:
resources:
- name: my-vm
type: compute.v1.instance
properties:
zone: us-central1-a
machineType: zones/us-central1-a/machineTypes/e2-medium
disks:
- boot: true
autoDelete: true
initializeParams:
sourceImage: projects/debian-cloud/global/images/family/debian-11
networkInterfaces:
- network: global/networks/default
Advantages:
- Native GCP integration
- No external tools required
- Integrated with Cloud Console
Disadvantages:
- GCP-specific, not multi-cloud
- Limited community and examples
- Less flexible than Terraform/Pulumi
6. Config Connector (Kubernetes)
Manage GCP resources through Kubernetes:
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeInstance
metadata:
name: my-vm
spec:
zone: us-central1-a
machineType: e2-medium
bootDisk:
initializeParams:
sourceImage: projects/debian-cloud/global/images/family/debian-11
networkInterface:
- network: default
Advantages:
- Kubernetes-native experience
- GitOps-friendly
- Unified control plane for GCP and K8s
Disadvantages:
- Requires Kubernetes cluster
- Additional complexity
- Slower than native API
Comparative Analysis
| Method | Reproducibility | Automation | Learning Curve | Multi-Cloud |
|---|---|---|---|---|
| Console | Low | None | Low | No |
| gcloud | Medium | Good | Medium | No |
| Terraform | High | Excellent | Medium | Yes |
| Pulumi | High | Excellent | Medium-High | Yes |
| Deployment Manager | High | Good | Medium | No |
| Config Connector | High | Excellent | High | No |
Project Planton's Approach
Why We Created This Component
Project Planton provides a Kubernetes Resource Model (KRM) interface for GCP Compute Engine instances, offering:
- Declarative YAML Configuration: Familiar syntax for Kubernetes users
- Dual IaC Implementation: Both Pulumi (Go) and Terraform modules
- Cross-Resource References: Link to GcpProject, GcpVpc, GcpSubnetwork resources
- Validation at Definition Time: Proto-based schema with buf.validate rules
- Consistent Patterns: Same structure across all deployment components
80/20 Feature Selection
We focused on the 20% of features that cover 80% of use cases:
In Scope:
- Machine type selection
- Boot disk configuration (image, size, type)
- Network interface configuration (VPC, subnet, external IP)
- Service account attachment
- Spot/Preemptible VMs
- Labels, tags, and metadata
- Startup scripts
- Attached data disks
- Scheduling options
Out of Scope (for now):
- GPU/TPU accelerators
- Local SSDs
- Shielded VM options
- Confidential VM
- Sole-tenant nodes
- Instance templates
- Managed instance groups
- Reservation affinity
Design Decisions
Machine Type Flexibility
We accept any valid machine type string rather than enumerating all options. This provides:
- Forward compatibility with new machine types
- Support for custom machine types
- Simpler schema without bloated enums
Zone vs Region
We use zone-level deployment (not regional) because:
- Compute instances are inherently zonal resources
- Regional deployments require managed instance groups
- Simpler configuration and mental model
Network Interface Structure
We support multiple network interfaces because:
- Multi-NIC VMs are common for network appliances
- Each interface can have different configurations
- Matches the underlying GCP API structure
Boot Disk Simplification
We use image instead of source_image + source_image_project because:
- The combined format (
project/image-family) is more intuitive - GCP accepts both short and full paths
- Reduces configuration complexity
Implementation Landscape
Pulumi Module Architecture
module/
├── main.go # Entry point, provider setup
├── locals.go # Data transformations, labels
├── outputs.go # Export constants
└── instance.go # Instance resource creation
Key implementation details:
- Uses
compute.Instancefrom Pulumi GCP provider - Resolves StringValueOrRef fields for project, network, subnet
- Applies standard Project Planton labels
- Handles optional fields with nil checks
Terraform Module Architecture
tf/
├── provider.tf # Google provider configuration
├── variables.tf # Input variables (mirrors spec.proto)
├── locals.tf # Computed values and transformations
├── main.tf # Instance resource definition
└── outputs.tf # Output values
Key implementation details:
- Uses
google_compute_instanceresource - Dynamic blocks for network interfaces and disks
- Conditional logic for optional configurations
- Outputs match stack_outputs.proto
Production Best Practices
Machine Type Selection
- Start Small: Begin with E2 series for cost efficiency
- Right-Size: Monitor actual usage and adjust
- Consider Committed Use: 1-3 year commitments for 37-70% savings
- Use Spot for Fault-Tolerant: Batch jobs, CI/CD, stateless apps
Networking
- Use Custom VPCs: Avoid the default network in production
- Private IPs: Prefer private connectivity when possible
- Network Tags: Use for firewall rule targeting
- Alias IP Ranges: For multi-tenant container workloads
Security
- Least-Privilege Service Accounts: Never use default compute SA
- OS Login: Use for SSH access management via IAM
- Shielded VMs: Enable for production workloads
- No External IPs: Use Cloud NAT or IAP for egress/ingress
Reliability
- Live Migration: Keep enabled for maintenance resilience
- Startup Scripts: Make idempotent, add health checks
- Metadata: Use for dynamic configuration
- Labels: Consistent labeling for operations
Cost Optimization
- Spot VMs: Use for fault-tolerant workloads
- Preemptible Batch Jobs: Schedule during off-peak hours
- Auto-Shutdown: Delete dev instances after hours
- Right-Size Disks: Start small, grow as needed
Common Pitfalls
1. Using Default Service Account
The default compute service account has broad permissions. Always create dedicated service accounts with minimal scopes.
2. Public IPs Without Firewall Rules
External IPs are open to the internet. Always configure firewall rules or remove external access.
3. Ignoring Startup Script Failures
Startup scripts can fail silently. Implement logging and health checks to detect issues.
4. Not Using Labels
Labels are essential for cost allocation, automation, and operations. Establish a labeling strategy.
5. Hardcoding Zone Selection
Zones can have capacity issues. Consider zone-agnostic deployments or fallback zones.
Conclusion
GCP Compute Engine instances are versatile building blocks for cloud infrastructure. Project Planton's GcpComputeInstance component provides a standardized, validated interface for deploying VMs with both Pulumi and Terraform, focusing on the most common configuration patterns while maintaining flexibility for advanced use cases.
The implementation balances simplicity with capability, enabling teams to quickly deploy production-grade VM instances while following GCP best practices.
Next article