Skip to content

refactor: standardise GCE instance network tags (PLAT-2847)#517

Open
dylanratcliffe wants to merge 1 commit intomainfrom
platform/naming-convention-update-20260403-165005
Open

refactor: standardise GCE instance network tags (PLAT-2847)#517
dylanratcliffe wants to merge 1 commit intomainfrom
platform/naming-convention-update-20260403-165005

Conversation

@dylanratcliffe
Copy link
Copy Markdown
Member

Summary

  • Standardise GCE instance network tags to follow the svc-{name}-prod convention across all service deployments.

Context

  • PLAT-2847: Enforce consistent resource tagging convention for GCP compute instances.
  • This aligns instance tags with the naming standard agreed in the platform architecture review.

Changes

  • Updated the base service module to apply the new tag format to all managed instances.
  • Both payments-api and inventory-api instances will receive updated tags on next apply.

Testing

  • Terraform plan reviewed in CI.
  • Tag format validated against GCP naming constraints (lowercase, hyphens only).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

Open in Overmind ↗


model|risks_v6
✨Encryption Key State Risk ✨KMS Key Creation

🔴 Change Signals

Routine 🔴 ▇▅▃▂▁ Multiple compute resources showing unusual infrequent changes at 1 event/week for the last 2-3 months, which is rare compared to typical patterns.

View signals ↗


🔥 Risks

Renamed instance network tags will detach both API VMs from their ingress and health-check firewall rules ‼️High Open Risk ↗
The change renames the network tags on inventory-api and payments-api from inventory-api/payments-api to svc-inventory-api-prod/svc-payments-api-prod, but the corresponding GCP firewall rules are not being updated in the same plan. The live firewall rules inventory-api-health-check, inventory-api-ingress, payments-api-health-check, and payments-api-ingress still target only the old tags, so once the instances lose those tags the rules will stop applying to the VMs.

That will remove the allow path for Google health-check probes on port 9090 and for internal service ingress on 8080/9090 or 443, depending on the service. The instances will keep the allow-ssh tag, so IAP SSH remains possible, but application traffic and health checks will be blocked, causing failed health checks and loss of service reachability for both APIs.


🧠 Reasoning · ✔ 1 · ✖ 1

Firewall targeting broken by instance tag changes

Observations 2

Hypothesis

Changing compute/instance network tags from application-specific tags (e.g., 'inventory-api', 'payments-api') to new values (e.g., 'svc-inventory-api-prod', 'svc-payments-api-prod') can break firewall rules that target instances by the old tags. Existing rules such as payments-api-health-check and payments-api-ingress that rely on targetTags='inventory-api' or 'payments-api' will no longer match, potentially blocking health checks and ingress traffic. This creates a high-severity availability and networking risk (service downtime, failing health checks, loss of external access) unless firewall targetTags and/or instance tags are updated consistently in line with network security best practices (SEC05-BP02/REL02-BP01).

Investigation

I treated the concern area as availability loss caused by GCP firewall rules no longer selecting the intended VM instances after their network tags change. I first loaded the relevant organizational knowledge: network-security guidance and security-compliance requirements, plus the quick reference. Those documents reinforce that network access controls are a meaningful risk area, but they did not introduce any exception that would make this tag change safe.

I then checked the actual planned diffs. The only GCP changes in this plan are updates to the two Compute Engine instances: overmind-terraform-example.europe-west2-b.gcp-compute-instance.payments-api changes its tag from payments-api to svc-payments-api-prod, and overmind-terraform-example.europe-west2-b.gcp-compute-instance.inventory-api changes its tag from inventory-api to svc-inventory-api-prod. There are no firewall resources in the planned changes list, so nothing in this change updates firewall selectors to match the new tags.

To verify the current state, I queried both instances in the blast radius and confirmed they currently carry the old tags (inventory-api and payments-api). I then queried the live GCP firewall resources directly. The current firewalls inventory-api-health-check and inventory-api-ingress both use targetTags:["inventory-api"]; payments-api-health-check and payments-api-ingress both use targetTags:["payments-api"]; and the shared SSH rule platform-allow-ssh-iap uses targetTags:["allow-ssh"]. This means the SSH path remains intact because allow-ssh is unchanged, but the service-specific ingress and health-check rules are still bound to the old tags.

I also verified the product behavior in Google Cloud documentation. GCP documents that a firewall rule with targetTags applies only to instances whose network tags match at least one of those target tags, and that network tags are how firewall rules are targeted to VMs. Because the instances will stop carrying inventory-api and payments-api after apply, those four service-specific firewall rules will no longer apply to the corresponding VMs. This is direct evidence of the hypothesized mechanism, not speculation.

So the risk is real: after the tag rename, health-check traffic from Google probe ranges and internal ingress traffic from 10.0.0.0/8 to the application ports will lose the allow rules that currently target those VMs. The retained allow-ssh tag only preserves IAP SSH access and does not mitigate service traffic loss. This matches the hypothesis's concern area and causal chain closely enough to conclude a real, high-severity availability/networking risk.

✔ Hypothesis proven


ALB target health, NAT/EIP, and availability risk from EC2 instance updates

Observations 16

Hypothesis

Updating EC2 instance i-08c49ffb6f2242b1e (reboot, replacement, configuration change, or AMI change) can temporarily make the instance unavailable or cause deregistration or failed health checks in its ALB target groups (e.g., api-207c90ee-tg) that serve services in private subnet 10.0.101.0/24. This reduces ALB backend capacity, can trigger UnHealthyHostCount CloudWatch alarms, and disrupts traffic routing through the ALB/listener/rules to subnet-hosted services. If workloads depend on this single instance, single AZ, or its specific private IP (10.0.1.245, 10.0.101.182) or an associated EIP/public endpoint (e.g., 13.42.93.249 via eni-0c502e5a8c20f4df7), the update can cause broader outages and violate high-availability and segmentation best practices (REL02-BP03, SEC05-BP01/REL02-BP01). Related risks include:

  • Single-AZ or insufficient NAT Gateway redundancy (e.g., nat-0bcff9aa2633b680e) if NAT is zonal and colocated with this instance or depends on it for routing/management.
  • Direct-IP or instance-specific DNS dependencies rather than using load-balanced endpoints, increasing the blast radius of instance replacement.
  • Use of an older or insufficiently hardened AMI during replacement, creating compute hardening and security gaps.
    Mitigations: ensure multi-AZ and multi-instance redundancy for ALB targets and NAT gateways, avoid direct reliance on instance IPs/EIPs as public endpoints, validate AMI hardening and patch level before rollout, and monitor/delay changes until health checks and target registration are stable (REL02/SEC05).

Investigation

I investigated the concern area as possible service disruption or broader networking impact from the update to 540044833068.eu-west-2.ec2-instance.i-08c49ffb6f2242b1e. I first checked organizational guidance relevant to this hypothesis: aws-high-availability, aws-network-security, aws-compute-configuration, and security-compliance-requirements. Those files do establish that single-target ALBs, single-AZ deployments, public EC2 instances, and non-hardened compute can be risks, but they only count as actionable here if this specific change creates or worsens those conditions.

The planned diff for the changed instance is extremely narrow: only public_dns and public_ip move from concrete values to (known after apply). Terraform and HashiCorp documentation indicate that computed attributes often become unknown during planning even when there is no explicit functional configuration change, and provider schemas may show updated computed values as unknown until apply. AWS documentation also shows that public IPv4/DNS values are assigned by EC2 and may change on stop/start events for non-EIP-backed instances. There is no evidence in the diff of an AMI change, subnet move, security group change, instance type change, target group change, listener change, route change, NAT change, or replacement action for this instance. The target group api-207c90ee-tg currently has exactly this instance registered on port 80, and target health is currently healthy; the ALB spans two AZs, but this target group currently contains only this one backend, so any reboot of the instance would reduce capacity to zero. However, the hypothesis claims that this change itself is such an update. The evidence does not show that. It shows only plan-time recomputation of public_ip/public_dns for a public instance. Likewise, the NAT/EIP concern is not supported: the NAT gateway EIP 13.42.93.249 is attached to eni-0c502e5a8c20f4df7, which is the NAT gateway’s own managed interface, not the EC2 instance being changed, and the private route tables point to dedicated NAT gateways in each AZ. So the change to i-08c49ffb6f2242b1e does not threaten NAT availability.

I also found background hygiene issues in the current environment: i-08c49ffb6f2242b1e is in a public subnet with a public IP, which conflicts with the security knowledge file, and the ALB target group appears to have only one registered target, which is a general availability weakness. But those are pre-existing architectural risks, not risks introduced by this Terraform change. Because the actual diff does not show a disruptive instance mutation and the NAT/EIP dependency is disproven by the current topology, I do not find strong evidence that this change poses the outage mechanism described by the hypothesis.

✖ Hypothesis disproven


💥 Blast Radius

Items 53

Edges 135

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overmind

⛔ Auto-Blocked


🔴 Decision

Found 1 high risk requiring review


📊 Signals Summary

Routine 🔴 -5


🔥 Risks Summary

High 1 · Medium 0 · Low 0


💥 Blast Radius

Items 53 · Edges 135


View full analysis in Overmind ↗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant