A reliability analytics tool for CI and infrastructure failures — classifies logs, fingerprints recurring incident signatures, tracks historical recurrence, detects anomaly patterns, and generates release-risk summaries through an API, CLI, CI workflow, and dashboard.
AutoOps-Insight takes raw failure logs and turns them into structured, actionable reliability intelligence. Rather than simply labeling a log as "timeout", it produces a structured incident artifact with severity, likely cause, remediation steps, ownership, and a stable fingerprint for tracking recurrence over time.
The system answers questions like:
- Has this failure happened before, and how often?
- Is this build environment risky enough to block a release?
- What failure patterns are dominating recent CI runs?
- Which recurring signatures should the team prioritize?
Log Input
│
├── Rule-Based Detection (deterministic patterns)
└── ML-Assisted Classification (TF-IDF + Logistic Regression)
│
▼
Structured Incident Analysis
(severity, signature, cause, owner, release-blocking flag)
│
▼
SQLite Persistence
│
┌──────┴──────┐
│ │
History API Reports
Recurrence (JSON + Markdown)
Detection Release-Risk Score
Each log upload produces a full incident record — not just a label:
| Field | Description |
|---|---|
predicted_issue |
Failure type (e.g. timeout, oom, flaky_test_signature) |
confidence |
ML classification confidence |
failure_family |
Normalized operational category |
severity |
low / medium / high / critical |
signature |
Stable fingerprint for recurrence tracking |
summary |
Human-readable incident summary |
likely_cause |
Taxonomy-based likely cause hint |
first_remediation_step |
What to check first |
next_debugging_action |
Suggested follow-up |
probable_owner |
Probable service/team ownership hint |
release_blocking |
Whether this should gate a release |
evidence |
Supporting log lines |
recurrence |
How many times this signature has appeared |
Each incident gets a stable, normalized signature like timeout:733da8a4e20740af. This enables cross-run recurrence tracking — the system knows when two failures are the same underlying issue regardless of log noise.
Results are persisted in SQLite. The system tracks:
- Total occurrence count per signature
- First and last seen timestamps
- Whether a signature qualifies as recurring
- Failure family distribution over time
The report engine aggregates stored history into a release-risk summary (low / medium / high / critical) based on:
- Presence of release-blocking incidents
- Recurring signature concentration
- Anomaly flags (e.g. one signature accounts for 80% of recent failures)
- Window comparison vs. baseline blocker rate
Heuristic-based flags that surface meaningful signals without fake sophistication:
- Signature concentration spike
- High-count recurring failures
- Family-level spikes
- Release blocker saturation
Full FastAPI backend with endpoints for:
POST /analyze— analyze a log, persist the resultGET /history/recent— recent incident listGET /history/recurring— top recurring signaturesGET /history/signature/{signature}— recurrence detail for one signatureGET /history/analysis/{analysis_id}— stored incident detailGET /reports/summary— structured release-risk summary (JSON)GET /reports/markdown— human-readable markdown reportPOST /reports/generate— write report artifacts to diskGET /metrics— Prometheus countersGET /healthz— health check
Headless operation for CI and automation:
# Health check
python cli.py health
# Analyze a log file and persist it
python cli.py analyze sample.log
# Compact operator-style output (no JSON)
python cli.py analyze sample.log --no-print-json
# Generate release-risk report artifacts
python cli.py reportGitHub Actions workflow that:
- Runs CLI health check
- Analyzes sample logs automatically
- Generates markdown and JSON report artifacts
- Uploads report artifacts and SQLite DB for inspection
React frontend showing:
- Release risk score, total analyses, blocker count, recurring signatures
- Log upload with full incident breakdown
- Anomaly panel
- Recurring signatures table
- Recent analyses list
- Failure family distribution
- Markdown report preview
The classifier uses two layers:
Rule-based detection checks for deterministic patterns:
timeout · dns_failure · connection_refused · tls_failure · retry_exhausted · oom · flaky_test_signature · dependency_unavailable · crash_loop · latency_spike
ML fallback uses:
- TF-IDF vectorization
- Logistic Regression trained on labeled log data (
ml_model/log_train.csv)
Each analysis record indicates whether rule-based detection or ML prediction was used.
Each failure family maps to reliability metadata:
| Family | Severity | Release Blocking |
|---|---|---|
timeout |
high | yes |
oom |
critical | yes |
connection_refused |
high | yes |
dns_failure |
high | yes |
flaky_test_signature |
medium | no / context-dependent |
retry_exhausted |
medium | yes |
crash_loop |
critical | yes |
dependency_error |
high | yes |
dependency_unavailable |
high | yes |
AutoOps-Insight/
├── main.py # FastAPI application and API routes
├── cli.py # Headless CLI for analysis and reporting
├── ml_predictor.py # Structured incident analysis + ML-backed prediction
├── classifiers/
│ ├── rules.py # Deterministic failure-family detection
│ └── taxonomy.py # Severity, ownership, remediation metadata
├── analysis/
│ ├── formatter.py # Incident summary formatting
│ ├── signatures.py # Signature normalization and fingerprinting
│ ├── trends.py # Trend/distribution/window analysis
│ └── anomalies.py # Heuristic anomaly detection
├── storage/
│ └── history.py # SQLite persistence and historical queries
├── reports/
│ ├── renderer.py # Markdown/JSON report generation
│ └── generated/ # Generated report artifacts
├── schemas/
│ └── incident.py # Pydantic incident schema
├── ml_model/
│ ├── log_train.csv # Training data
│ ├── train_model.py # Training script
│ └── log_model.pkl # Trained model + vectorizer
├── autoops-ui/ # React/Vite dashboard
├── tests/ # Unit and API integration tests
└── .github/workflows/ # CI workflow
Install dependencies:
python -m pip install -r requirements.txtTrain or retrain the model:
cd ml_model
python train_model.py
cd ..Start the API server:
uvicorn main:app --reloadRun the CLI:
python cli.py analyze sample.log
python cli.py reportStart the frontend:
cd autoops-ui
npm install
npm run devpython -m pytest -qCurrent suite: 14 passing tests
Coverage includes:
- Deterministic rule detection
- Signature stability and normalization
- Trend and anomaly heuristics
- Markdown report rendering
- API integration for
/analyze,/history/recent,/history/recurring, and/reports/summary
AutoOps-Insight supports four usage modes:
- API mode — upload logs and query history/report endpoints through FastAPI
- CLI mode — analyze logs and generate reports headlessly for CI or local workflows
- Dashboard mode — inspect release risk, recurring signatures, anomalies, and reports in the React UI
- CI mode — run sample analyses and upload report artifacts through GitHub Actions
Prometheus counters exposed at /metrics:
logs_processed_totalpredict_requests_totalanalyze_requests_totalsummarize_requests_totalreport_requests_total
- Multi-source ingestion from system logs, containers, or metrics agents
- Time-series anomaly detection with robust statistical baselines
- Deep root-cause inference
- Multi-tenant incident correlation
- Production-scale storage or querying
- Real release gating inside a deployment pipeline
- Learned summarization or recommendation models
SRE · Production Engineering · Release Engineering · Internal Tooling · Platform / Infrastructure