Skip to content

refactor: standardise GCE instance network tags (PLAT-2847)#516

Closed
dylanratcliffe wants to merge 1 commit intomainfrom
platform/naming-convention-update-20260401-145435
Closed

refactor: standardise GCE instance network tags (PLAT-2847)#516
dylanratcliffe wants to merge 1 commit intomainfrom
platform/naming-convention-update-20260401-145435

Conversation

@dylanratcliffe
Copy link
Copy Markdown
Member

Summary

  • Standardise GCE instance network tags to follow the svc-{name}-prod convention across all service deployments.

Context

  • PLAT-2847: Enforce consistent resource tagging convention for GCP compute instances.
  • This aligns instance tags with the naming standard agreed in the platform architecture review.

Changes

  • Updated the base service module to apply the new tag format to all managed instances.
  • Both payments-api and inventory-api instances will receive updated tags on next apply.

Testing

  • Terraform plan reviewed in CI.
  • Tag format validated against GCP naming constraints (lowercase, hyphens only).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Open in Overmind ↗


model|risks_v6
✨Encryption Key State Risk ✨KMS Key Creation

🔴 Change Signals

Routine 🔴 ▇▅▃▂▁ Multiple compute resources showing unusual routine changes at only 1 event/week for the last 2-3 months, which is infrequent compared to typical patterns.

View signals ↗


🔥 Risks

Renamed GCE network tags will detach both services from their existing firewall allow rules ‼️High Open Risk ↗
The change renames the network tags on the payments-api and inventory-api Compute Engine instances from payments-api and inventory-api to svc-payments-api-prod and svc-inventory-api-prod, but the live firewall rules still target only the old tags. payments-api-ingress and payments-api-health-check both select payments-api, and inventory-api-ingress and inventory-api-health-check both select inventory-api.

When the new tags are applied, those firewall rules will stop matching the instances. That will remove the explicit allow rules for service traffic and Google health-check probes, so ingress falls back to the VPC's default deny behavior. Both services can become unreachable and fail health checks even though the instances themselves remain running.


🧠 Reasoning · ✔ 1 · ✖ 2

Firewall and network policy risk from standardized instance tag changes

Observations 3

Hypothesis

Changes to compute instance tags used for firewall targeting continue to pose a cross-service connectivity and isolation risk. GCE instances for the payments and inventory services have had their network tags changed from service-specific values ('payments-api', 'inventory-api') to a standardized svc-*-prod scheme. Existing firewall rules and health-check policies (e.g., payments-api-ingress, payments-api-health-check, inventory-api-ingress, inventory-api-health-check) that still select on the old tags will fail to match, potentially blocking ingress and health checks and causing outages. Standardizing tags across services can also couple multiple workloads to the same broader shared-prod firewall policies if rules match only on the new common pattern, reducing network isolation. Because tag-based policy is used pervasively for firewall rules, routes, and automation, a single stale or mis-scoped rule could simultaneously make both services unreachable or overexposed. Recommended: (1) inventory all firewall rules, routing policies, and scripts that select on payments-api, inventory-api, or the new svc-*-prod tags; (2) update targetTags and policies to preserve intended per-service ingress/SSH and isolation; (3) verify that tag standardization does not unintentionally bind multiple services to overly broad shared-prod rules.

Investigation

I treated the concern area as service reachability and network isolation for the two GCE instances whose network tags are being renamed. I first checked relevant organizational knowledge, then inspected current state for the affected instances and live firewall resources. The instance diffs show the only planned change is replacing payments-api with svc-payments-api-prod on overmind-terraform-example.europe-west2-b.gcp-compute-instance.payments-api and replacing inventory-api with svc-inventory-api-prod on overmind-terraform-example.europe-west2-b.gcp-compute-instance.inventory-api.

Current live infrastructure confirms the risk mechanism. The four service-specific firewall rules still target the old tags only: payments-api-ingress and payments-api-health-check both have targetTags:["payments-api"], while inventory-api-ingress and inventory-api-health-check both have targetTags:["inventory-api"]. The instances currently carry those old tags, so they match today. After the tag rename, neither instance will match its service ingress or health-check rules unless separate firewall updates are included elsewhere in the plan, and there are no firewall resources in the planned-change list. Google Cloud documentation states that firewall rules apply to instances with matching targetTags, and without matching target tags the intended allow rule no longer applies; default ingress is otherwise denied. That makes this a concrete connectivity failure, not speculation. I did not find evidence for the second part of the hypothesis about new broad shared-prod rules causing overexposure, because no such firewall rules are present in the queried live state. But the primary outage mechanism is strongly supported: the tag change detaches both instances from their existing service-specific ingress and health-check allow rules.

✔ Hypothesis proven


Overly permissive and shared EC2 security group exposing instances and impacting segmentation

Observations 6

Hypothesis

EC2 instances remain exposed to the internet via overly permissive security group ingress rules, and updates to instances or shared security groups can change network behavior and exposure. Instance i-0464c4413cb0c54aa is in a private subnet (subnet-09605cfe202ef69e7, route table rtb-0fd627aea94dee6ea) with ENI attachments using security group sg-0437857de45b640ce, which allows SSH (22) and HTTP (80) from 0.0.0.0/0. Although recent instance updates showed empty diffs, changes to AMI, ENIs, or security group attachments could modify whether the private IP 10.0.1.245 becomes publicly reachable, or alter security group enforcement, potentially violating segmentation and exposure best practices. Because sg-0437857de45b640ce is shared with other instances (e.g., i-060c5af731ee54cc9) and associated storage (e.g., vol-07b1dfd1747a0c716), changes to rules or attachments on one instance can inadvertently impact network connectivity and access paths for other instances and their volumes, risking service disruption and data unavailability. Recommended: (1) restrict sg-0437857de45b640ce to least-privilege CIDR ranges; (2) avoid sharing broadly permissive SGs across unrelated instances; (3) validate that instance updates do not alter subnet placement, routing, or SG associations in ways that expose private IPs or break segmentation; (4) ensure backups/snapshots exist for volumes like vol-07b1dfd1747a0c716 in case connectivity is lost.

Investigation

I treated the concern area as unintended internet exposure and segmentation breakage for the EC2 instances using shared security group sg-0437857de45b640ce, especially 540044833068.eu-west-2.ec2-instance.i-0464c4413cb0c54aa and 540044833068.eu-west-2.ec2-instance.i-060c5af731ee54cc9. Per the required process, I first checked relevant organizational knowledge. That guidance confirms the current environment already violates policy in two ways: the shared security group allows 22 and 80 from 0.0.0.0/0, and one instance currently has both a public IP and that permissive shared SG, which the security team calls critical. The quick reference also notes shared SGs are intentionally high-fanout and changes to them affect many instances.

However, the question is whether this specific change creates a real additional risk in that concern area. The actual planned diffs for both modified EC2 instances only change public_ip and public_dns from concrete current values to Terraform-computed (known after apply). There is no planned change to subnet, route table, network interface attachment, security group association, or any security group rule. Blast-radius data shows i-0464c4413cb0c54aa is already in public subnet subnet-07b5b1fb2ba02f964 with public IP 18.175.147.19, while i-060c5af731ee54cc9 is in private subnet subnet-09605cfe202ef69e7 with no public IP. The hypothesis incorrectly states that i-0464c4413cb0c54aa is the private 10.0.1.245 instance; that private address actually belongs to i-060c5af731ee54cc9. I also verified AWS documentation: security-group rule changes propagate automatically to all attached instances, but this plan does not modify those rules; and the Terraform aws_instance public_ip field is provider-computed, so a diff to (known after apply) by itself does not show that public reachability is being added or removed.

So there is a real pre-existing security finding in the current infrastructure, but I found no evidence that this change alters network exposure, breaks segmentation, or disrupts connectivity. The hypothesis is based on a valid concern area but not on an actual risky change in this plan. Therefore is_risk_real is false for this change, even though the underlying environment should still be remediated separately.

✖ Hypothesis disproven


Compute identity and instance updates impacting ALB health, public IP consumers, and observability

Observations 13

Hypothesis

Planned or recent compute identity and network changes introduce several availability, exposure, and observability risks across EC2 and GCE workloads. For AWS, EC2 instance i-09d6479fb9b97d123 remains a target in ALB api-207c90ee-alb / target group api-207c90ee-tg; updates that cause downtime, replacement, or IP changes can trigger target health-check failures, deregistration, and UnHealthyHostCount increases, impacting public endpoint availability (including DNS for api-207c90ee-alb and public IP 18.171.86.44). If the instance uses direct public IPs, rotations or associations during the update can also break IP-bound consumers, and direct public IP exposure is an anti-pattern relative to placing workloads strictly behind the ALB. Observation of both AWS EC2 hosts changing public DNS/IP identities at roughly the same time compounds this: any external allowlists, monitoring probes, or emergency access workflows that key on the old addresses will fail for both hosts, creating a coordinated outage window if traffic is not fronted by a stable DNS name or load balancer. For GCP and AWS combined, concurrent changes to network-facing identifiers (public IP/DNS on EC2, network tags on GCE) can blind monitoring, CMDBs, compliance scanners, and automation keyed on those identifiers, reducing visibility and incident response precisely while connectivity and firewall behavior are in flux. Recommended: (1) stagger public IP/DNS and target updates so at least one stable endpoint remains; (2) ensure all consumers use stable DNS/load-balancer front doors rather than hardcoded IPs; (3) confirm ALB target registration/deregistration procedures and multi-AZ or multi-target redundancy; (4) audit monitoring, CMDB, and access-control systems for dependencies on old IPs, DNS names, and tags, and update them in lockstep with the changes.

Investigation

I treated the concern area as potential availability loss from EC2 public identity changes and GCE tag updates, with secondary concern about observability blind spots. I first checked relevant organizational knowledge. That knowledge does confirm two background issues in the current environment: production EC2 instances should not have public IPs, and direct public exposure is an anti-pattern; it also emphasizes multi-AZ redundancy and monitoring. However, the task here is to determine whether this specific change creates a new concrete failure.

The diffs show only four in-place updates: both AWS EC2 instances have public_ip and public_dns changing to (known after apply), and the two GCE instances only change network tags from inventory-api/payments-api to svc-inventory-api-prod/svc-payments-api-prod. I queried the current blast radius state to verify the actual topology. The ALB api-207c90ee-alb is internet-facing in two AZs and serves target group api-207c90ee-tg, but that target group currently contains only one registered target: i-09d6479fb9b97d123, which is healthy. The public IPs called out in the hypothesis, 18.171.86.44 and 35.179.190.100, are not attached to either EC2 instance at all; they are ALB-managed EIPs on load balancer ENIs. So the hypothesis’s specific public-endpoint mechanism is wrong: this change does not rotate the ALB’s public addresses or its DNS name.

For the AWS instances, I checked provider/cloud behavior with documentation. AWS documents that auto-assigned public IPv4 addresses are released on stop/hibernate and a new one is assigned on start, and that persistent public addressing requires an Elastic IP. That means the plan output is consistent with possible public IP churn if Terraform performs a stop/start. But there is no evidence here that Terraform is replacing the instance, deregistering it from the target group, changing its private IP, changing security groups, or changing ALB/target group configuration. The only planned diff visible is on computed public identity fields. Because the target group is target_type = instance, ALB health and registration depend on the instance ID and private reachability on port 80, not the instance’s ephemeral public DNS/IP. Since those are unchanged in the plan, I found no concrete mechanism by which the ALB would lose the target solely because the instance’s public IP/DNS changes.

For GCE, I checked Google documentation: instance properties can be updated in place, and updates only cause disruption when the changed property requires restart. The plan already classifies these as updated, not replaced, and nothing in the evidence shows these tag edits require recreation or restart. I found no firewall resources or monitoring resources in the change itself that still reference the old tags, so claiming firewall breakage or observability loss would be speculative.

There is a real underlying architecture smell in the current state: both EC2 instances have public IPs in a public subnet, one has SSH open to 0.0.0.0/0, and the ALB target group has only a single registered target despite the ALB spanning two AZs. Those are genuine pre-existing concerns, and some violate organizational standards. But they are not introduced by this change. After active investigation, I found no strong evidence that the specific planned updates will themselves cause ALB unhealthiness, public endpoint breakage, or monitoring blindness. Therefore the hypothesis does not represent a real change-induced risk.

✖ Hypothesis disproven


💥 Blast Radius

Items 52

Edges 122

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overmind

⛔ Auto-Blocked


🔴 Decision

Found 1 high risk requiring review


📊 Signals Summary

Routine 🔴 -5


🔥 Risks Summary

High 1 · Medium 0 · Low 0


💥 Blast Radius

Items 52 · Edges 122


View full analysis in Overmind ↗

@dylanratcliffe dylanratcliffe deleted the platform/naming-convention-update-20260401-145435 branch April 3, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant