You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
model|risks_v6 ✨Encryption Key State Risk✨KMS Key Creation
🔴 Change Signals
Routine 🔴 ▇▅▃▂▁ Multiple compute resources showing unusual infrequent changes at 1 event/week for the last 2-3 months, which is rare compared to typical patterns.
Renamed instance network tags will detach both API VMs from their ingress and health-check firewall rules‼️HighOpen Risk ↗
The change renames the network tags on inventory-api and payments-api from inventory-api/payments-api to svc-inventory-api-prod/svc-payments-api-prod, but the corresponding GCP firewall rules are not being updated in the same plan. The live firewall rules inventory-api-health-check, inventory-api-ingress, payments-api-health-check, and payments-api-ingress still target only the old tags, so once the instances lose those tags the rules will stop applying to the VMs.
That will remove the allow path for Google health-check probes on port 9090 and for internal service ingress on 8080/9090 or 443, depending on the service. The instances will keep the allow-ssh tag, so IAP SSH remains possible, but application traffic and health checks will be blocked, causing failed health checks and loss of service reachability for both APIs.
🧠 Reasoning · ✔ 1 · ✖ 1
Firewall targeting broken by instance tag changes
Observations 2
Hypothesis
Changing compute/instance network tags from application-specific tags (e.g., 'inventory-api', 'payments-api') to new values (e.g., 'svc-inventory-api-prod', 'svc-payments-api-prod') can break firewall rules that target instances by the old tags. Existing rules such as payments-api-health-check and payments-api-ingress that rely on targetTags='inventory-api' or 'payments-api' will no longer match, potentially blocking health checks and ingress traffic. This creates a high-severity availability and networking risk (service downtime, failing health checks, loss of external access) unless firewall targetTags and/or instance tags are updated consistently in line with network security best practices (SEC05-BP02/REL02-BP01).
Investigation
I treated the concern area as availability loss caused by GCP firewall rules no longer selecting the intended VM instances after their network tags change. I first loaded the relevant organizational knowledge: network-security guidance and security-compliance requirements, plus the quick reference. Those documents reinforce that network access controls are a meaningful risk area, but they did not introduce any exception that would make this tag change safe.
I then checked the actual planned diffs. The only GCP changes in this plan are updates to the two Compute Engine instances: overmind-terraform-example.europe-west2-b.gcp-compute-instance.payments-api changes its tag from payments-api to svc-payments-api-prod, and overmind-terraform-example.europe-west2-b.gcp-compute-instance.inventory-api changes its tag from inventory-api to svc-inventory-api-prod. There are no firewall resources in the planned changes list, so nothing in this change updates firewall selectors to match the new tags.
To verify the current state, I queried both instances in the blast radius and confirmed they currently carry the old tags (inventory-api and payments-api). I then queried the live GCP firewall resources directly. The current firewalls inventory-api-health-check and inventory-api-ingress both use targetTags:["inventory-api"]; payments-api-health-check and payments-api-ingress both use targetTags:["payments-api"]; and the shared SSH rule platform-allow-ssh-iap uses targetTags:["allow-ssh"]. This means the SSH path remains intact because allow-ssh is unchanged, but the service-specific ingress and health-check rules are still bound to the old tags.
I also verified the product behavior in Google Cloud documentation. GCP documents that a firewall rule with targetTags applies only to instances whose network tags match at least one of those target tags, and that network tags are how firewall rules are targeted to VMs. Because the instances will stop carrying inventory-api and payments-api after apply, those four service-specific firewall rules will no longer apply to the corresponding VMs. This is direct evidence of the hypothesized mechanism, not speculation.
So the risk is real: after the tag rename, health-check traffic from Google probe ranges and internal ingress traffic from 10.0.0.0/8 to the application ports will lose the allow rules that currently target those VMs. The retained allow-ssh tag only preserves IAP SSH access and does not mitigate service traffic loss. This matches the hypothesis's concern area and causal chain closely enough to conclude a real, high-severity availability/networking risk.
✔ Hypothesis proven
ALB target health, NAT/EIP, and availability risk from EC2 instance updates
Observations 16
Hypothesis
Updating EC2 instance i-08c49ffb6f2242b1e (reboot, replacement, configuration change, or AMI change) can temporarily make the instance unavailable or cause deregistration or failed health checks in its ALB target groups (e.g., api-207c90ee-tg) that serve services in private subnet 10.0.101.0/24. This reduces ALB backend capacity, can trigger UnHealthyHostCount CloudWatch alarms, and disrupts traffic routing through the ALB/listener/rules to subnet-hosted services. If workloads depend on this single instance, single AZ, or its specific private IP (10.0.1.245, 10.0.101.182) or an associated EIP/public endpoint (e.g., 13.42.93.249 via eni-0c502e5a8c20f4df7), the update can cause broader outages and violate high-availability and segmentation best practices (REL02-BP03, SEC05-BP01/REL02-BP01). Related risks include:
Single-AZ or insufficient NAT Gateway redundancy (e.g., nat-0bcff9aa2633b680e) if NAT is zonal and colocated with this instance or depends on it for routing/management.
Direct-IP or instance-specific DNS dependencies rather than using load-balanced endpoints, increasing the blast radius of instance replacement.
Use of an older or insufficiently hardened AMI during replacement, creating compute hardening and security gaps.
Mitigations: ensure multi-AZ and multi-instance redundancy for ALB targets and NAT gateways, avoid direct reliance on instance IPs/EIPs as public endpoints, validate AMI hardening and patch level before rollout, and monitor/delay changes until health checks and target registration are stable (REL02/SEC05).
Investigation
I investigated the concern area as possible service disruption or broader networking impact from the update to 540044833068.eu-west-2.ec2-instance.i-08c49ffb6f2242b1e. I first checked organizational guidance relevant to this hypothesis: aws-high-availability, aws-network-security, aws-compute-configuration, and security-compliance-requirements. Those files do establish that single-target ALBs, single-AZ deployments, public EC2 instances, and non-hardened compute can be risks, but they only count as actionable here if this specific change creates or worsens those conditions.
The planned diff for the changed instance is extremely narrow: only public_dns and public_ip move from concrete values to (known after apply). Terraform and HashiCorp documentation indicate that computed attributes often become unknown during planning even when there is no explicit functional configuration change, and provider schemas may show updated computed values as unknown until apply. AWS documentation also shows that public IPv4/DNS values are assigned by EC2 and may change on stop/start events for non-EIP-backed instances. There is no evidence in the diff of an AMI change, subnet move, security group change, instance type change, target group change, listener change, route change, NAT change, or replacement action for this instance. The target group api-207c90ee-tg currently has exactly this instance registered on port 80, and target health is currently healthy; the ALB spans two AZs, but this target group currently contains only this one backend, so any reboot of the instance would reduce capacity to zero. However, the hypothesis claims that this change itself is such an update. The evidence does not show that. It shows only plan-time recomputation of public_ip/public_dns for a public instance. Likewise, the NAT/EIP concern is not supported: the NAT gateway EIP 13.42.93.249 is attached to eni-0c502e5a8c20f4df7, which is the NAT gateway’s own managed interface, not the EC2 instance being changed, and the private route tables point to dedicated NAT gateways in each AZ. So the change to i-08c49ffb6f2242b1e does not threaten NAT availability.
I also found background hygiene issues in the current environment: i-08c49ffb6f2242b1e is in a public subnet with a public IP, which conflicts with the security knowledge file, and the ALB target group appears to have only one registered target, which is a general availability weakness. But those are pre-existing architectural risks, not risks introduced by this Terraform change. Because the actual diff does not show a disruptive instance mutation and the NAT/EIP dependency is disproven by the current topology, I do not find strong evidence that this change poses the outage mechanism described by the hypothesis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
svc-{name}-prodconvention across all service deployments.Context
Changes
Testing