ci(railway): add Railway OSS deployment framework and preview environment CI#3787
ci(railway): add Railway OSS deployment framework and preview environment CI#3787
Conversation
…ment CI Add complete Railway OSS deployment infrastructure: - Bootstrap, configure, deploy, and smoke test scripts - Nginx gateway with Railway IPv6 DNS resolver and dynamic proxy_pass - Wrapper Dockerfiles for all 11 services (api, web, services, workers, cron, alembic, etc.) - Preview lifecycle scripts (create/update, destroy, stale cleanup) - Three GitHub Actions workflows for automated PR preview environments: - 06: build and push PR-tagged images to GHCR - 07: deploy preview environment and post URL as PR comment - 08: destroy on PR close + daily stale cleanup cron - Design docs covering architecture, caveats, and phased rollout plan
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The deploy job calls a reusable workflow that posts PR comments. The caller's permissions block must include pull-requests:write for the called workflow to use it via secrets:inherit.
Railway Preview Environment
Updated at 2026-02-19T20:11:09.961Z |
| ENV AGENTA_AUTH_KEY=0000000000000000000000000000000000000000000000000000000000000000 | ||
| ENV AGENTA_CRYPT_KEY=1111111111111111111111111111111111111111111111111111111111111111 |
There was a problem hiding this comment.
🚩 Hardcoded default auth/crypt keys in deploy-from-images.sh wrappers
The render_api_like_wrapper, render_api_wrapper, render_services_wrapper, render_web_wrapper, and render_alembic_wrapper functions in deploy-from-images.sh all hardcode AGENTA_AUTH_KEY=000... and AGENTA_CRYPT_KEY=111... as ENV defaults in the generated Dockerfiles. These are the same defaults used in configure.sh:9-10.
For preview environments this is acceptable since they're ephemeral. However, configure.sh also uses these as defaults for production deployments. If an operator runs configure.sh without setting AGENTA_AUTH_KEY/AGENTA_CRYPT_KEY, the deployment will use these well-known placeholder keys, which could be a security concern for non-preview deployments. The README and deployment notes don't explicitly warn about this.
Was this helpful? React with 👍 or 👎 to provide feedback.
The Railway CLI uses --version flag, not a version subcommand.
| render_api_like_wrapper worker-tracing '["python", "-m", "entrypoints.worker_tracing"]' | ||
| render_api_like_wrapper worker-evaluations '["python", "-m", "entrypoints.worker_evaluations"]' | ||
| render_api_like_wrapper cron '["cron", "-f"]' |
There was a problem hiding this comment.
📝 Info: Worker Dockerfiles use bare python but this is safe due to PATH in base image
The static Dockerfiles at hosting/railway/oss/worker-tracing/Dockerfile:14 and hosting/railway/oss/worker-evaluations/Dockerfile:14, as well as the dynamic wrappers generated by hosting/railway/oss/scripts/deploy-from-images.sh:163-164, all use bare python in their CMD. The deployment notes document Bug 3 where alembic failed because bare python resolved to the system python without packages.
However, this is not a bug for workers. The base image api/oss/docker/Dockerfile.gh sets PATH="/opt/venv/bin:${PATH}" in the runner stage, so bare python resolves to /opt/venv/bin/python with all packages. The alembic bug was specifically caused by using sh -lc (login shell), which sources /etc/profile and can reset PATH. Workers use exec-form CMD (["python", "-m", ...]) which doesn't invoke a shell, so PATH from the image ENV is preserved.
Was this helpful? React with 👍 or 👎 to provide feedback.
Railway CLI uses two different env vars: - RAILWAY_TOKEN: project-scoped actions only - RAILWAY_API_TOKEN: account/workspace-level actions (create/list/delete projects) Our preview scripts need account-level access. Updated all scripts to accept either variable, and CI workflows to set RAILWAY_API_TOKEN.
Passes through COMPOSIO_API_KEY to the api service if set. Skipped silently if not provided.
- preview-cleanup-stale.sh: use process substitution instead of pipe-to-while so DELETED/SKIPPED counters are not lost in subshell - smoke.sh: propagate check_endpoint exit code after repair instead of unconditional return 0 - 06-railway-preview-build.yml: add path filters so docs-only PRs don't trigger full image builds and Railway deploys - README.md: add security note about placeholder auth/crypt keys
| created_at="$(printf "%s" "$project" | jq -r '.createdAt')" | ||
|
|
||
| # Parse ISO 8601 timestamp to epoch seconds. | ||
| created_epoch="$(date -d "$created_at" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S" "${created_at%%.*}" +%s 2>/dev/null || echo 0)" |
There was a problem hiding this comment.
🟡 Stale preview cleanup uses project creation time instead of last-update time, deleting active previews
The preview-cleanup-stale.sh script determines staleness by comparing a project's createdAt timestamp against the max age threshold. Since preview-create-or-update.sh updates preview environments in-place (it calls bootstrap.sh which links to the existing project rather than recreating it), the createdAt field never changes even when a preview receives new deploys.
Root Cause and Impact
At hosting/railway/oss/scripts/preview-cleanup-stale.sh:37, the script extracts createdAt:
created_at="$(printf "%s" "$project" | jq -r '.createdAt')"
And at line 65-66 the jq filter only passes through createdAt:
'.[] | select(.name | startswith($prefix)) | {name: .name, createdAt: .createdAt}'
This means: if a PR is opened and its preview environment is created, then the developer actively pushes new commits over the next 2 days, the daily cron (running at 06:00 UTC with default 24h TTL) will delete the preview environment because createdAt is >24h old — even though the preview was just updated minutes ago.
The preview will be recreated on the next push (the build workflow chains to deploy), but there's a window where the preview URL returns nothing, causing confusion for reviewers who click the link.
The fix should use updatedAt instead of createdAt to reflect actual activity on the project.
| created_at="$(printf "%s" "$project" | jq -r '.createdAt')" | |
| # Parse ISO 8601 timestamp to epoch seconds. | |
| created_epoch="$(date -d "$created_at" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S" "${created_at%%.*}" +%s 2>/dev/null || echo 0)" | |
| name="$(printf "%s" "$project" | jq -r '.name')" | |
| updated_at="$(printf "%s" "$project" | jq -r '.updatedAt')" | |
| # Parse ISO 8601 timestamp to epoch seconds. | |
| created_epoch="$(date -d "$updated_at" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S" "${updated_at%%.*}" +%s 2>/dev/null || echo 0)" | |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
hosting/railway/oss/What's included
Deployment scripts (
hosting/railway/oss/scripts/)bootstrap.sh-- create Railway project, services, volumes (idempotent)configure.sh-- set all environment variables per servicedeploy-from-images.sh-- full deploy flow from pre-built GHCR imagessmoke.sh-- health check validation for/w,/api/health,/services/healthpreview-create-or-update.sh-- create/update PR preview projectpreview-destroy.sh-- delete PR preview projectpreview-cleanup-stale.sh-- delete previews older than configurable TTLbuild-and-push-images.sh,deploy-gateway.sh,deploy-services.sh,init-databases.sh,upgrade.shGateway (
hosting/railway/oss/gateway/)[fd12::10])proxy_passfor dynamic DNS re-resolutionCI Workflows (
.github/workflows/)06-railway-preview-build.yml-- build and push PR-tagged images to GHCR (Docker Buildx + GHA cache)07-railway-preview-deploy.yml-- deploy preview and post URL as PR comment08-railway-preview-cleanup.yml-- destroy on PR close + daily stale cleanup cronDesign docs (
docs/design/railway-preview-environments/)Testing
This PR itself tests the CI workflows. The build workflow should trigger on this PR, build the 3 images, then deploy a preview environment and post the URL as a comment.
Requires
RAILWAY_TOKENGitHub Actions secret (already configured).