An end-to-end machine learning trading system — ensemble of transformer models (LuTransformer, Informer), hybrid RNN model (LSTM-GRU) and LightGBM with temperature calibration and Kelly Criterion position sizing, backed by a FastAPI live trading bot. Validated on SPY 5-minute bars.
- Four-model ensemble — two Transformer-based architectures (LuTransformer, Informer), a hybrid recurrent network (LSTM-GRU), and gradient boosting (LightGBM) — weighted per prediction horizon using squared AUC-PR scores
- Post-hoc temperature calibration (LBFGS) converts overconfident neural network logits into reliable probabilities — required for correct Kelly position sizing
- Cost-sensitive loss with an asymmetric cost matrix handles extreme class imbalance without oversampling
- Walk-forward cross-validation with expanding windows prevents temporal data leakage that standard k-fold CV introduces in time series
- Kelly Criterion ties probability calibration directly to position sizing — miscalibrated models automatically receive smaller allocations via a Brier score penalty term
- Automated feature selection pipeline combines mutual information, F-statistic, and pairwise redundancy filtering to reduce 223 raw features to the top 100 and auto-updates the training config
- Fault-tolerant training saves per-fold checkpoints so a multi-hour training run can resume from the last completed fold after any interruption
Holdout period: 2026-01-14 to 2026-02-19 (1,901 × 5-minute bars). Initial cash: $5,000. Kelly fraction: 1.0, conservatism: 1.0.
| Metric | Ensemble | Buy & Hold |
|---|---|---|
| Total Return | +0.36% | −0.37% |
| Outperformance | +0.73% | — |
| Sharpe Ratio | 0.61 | — |
| Max Drawdown | 1.44% | — |
| AUC-PR (H0) | 0.385 | — |
| Brier Score (H0) | 0.1134 | — |
| Accuracy | 67.8% | — |
| Total Trades | 163 | — |
Raw OHLCV (5-minute SPY)
│
▼
Feature Engineering ── 223 features
├── 158 QLib Alpha158 quantitative factors
├── ICT market structure (BOS, CHoCH, Order Blocks, FVG, Liquidity)
├── RSI, Momentum, Volatility, VWAP deviation
└── Time & volume features
│
▼
Walk-Forward Cross-Validation (3 folds, expanding window)
│
├── LSTM-GRU Hybrid (recurrent, captures sequential patterns)
├── LuTransformer (Transformer — custom dual-tower SRU encoder)
├── Informer (Transformer — ProbSparse attention, O(L log L))
└── LightGBM (gradient boosting on 14K flattened features)
│
▼
Temperature Calibration ── per-horizon T fitted via LBFGS
│
▼
Weighted Voting Meta-Learner ── squared AUC-PR weights per horizon
│
▼
Kelly Criterion Position Sizing
│
▼
Backtest / Live Trading API
Four models with complementary inductive biases are trained independently and combined into a weighted voting ensemble.
A custom hybrid architecture that stacks GRU, LSTM, and GRU layers sequentially. GRUs are efficient at capturing short-term recurrence while LSTMs excel at retaining long-range memory — combining both layers allows the model to exploit patterns at multiple time scales within the 5-minute bar sequence.
A custom adaptation of Hidformer (Liu et al., 2024) retargeted for short-term intraday classification. The original Hidformer uses a hierarchical dual-tower architecture with a segment-and-merge token mixer designed for long-term forecasting. This implementation replaces the frequency-domain tower with a Segmented Recurrent Unit (SRU) encoder that tokenizes the input sequence into fixed-length segments and pools them hierarchically, producing a compact token representation before attention. This reduces the quadratic cost of standard self-attention while preserving local structure — important for intraday market data where adjacent bars are highly correlated.
A Transformer using ProbSparse self-attention, which approximates the full attention matrix in O(L log L) by selecting the queries with the highest contribution. The encoder also applies convolutional distillation between layers to reduce sequence length progressively. This makes it practical for long input windows (seq_len = 205 bars) without the memory cost of full attention.
Gradient boosted decision trees operating on the full flattened feature matrix (205 timesteps × 80 features = 16,400 inputs per sample). LightGBM provides a non-sequential baseline with built-in feature importance, regularization (L1 + L2), and column subsampling to prevent overfitting on high-dimensional inputs. Feature importance from LightGBM also drives the feature selection step that reduces 223 raw features to the top 80.
Each base model receives a per-horizon weight based on its validation AUC-PR score, squared to amplify differences between strong and weak models:
w(i, h) = score(i, h)² / Σⱼ score(j, h)²
Squaring preserves relative ranking without completely discarding weaker models, and allows a different model to dominate at each prediction horizon.
Neural networks are systematically overconfident — their raw softmax outputs do not reflect true event probabilities. After cross-validation, each model's logits are rescaled by a learned scalar temperature T fitted on pooled held-out validation data using LBFGS. A separate T is fitted per prediction horizon (H0–H5).
Accurate calibration is critical for Kelly Criterion: over-confident probabilities lead to overbetting and large drawdowns. Temperature is clamped to [0.1, 8.0] to prevent LBFGS from diverging to extreme values (T = 10,038 was observed without clamping, which collapses all predictions to uniform).
The label distribution is extremely imbalanced: ~98% Slope, ~1% Peak, ~1% Bottom. Rather than reweighting classes, a cost matrix penalizes misclassifications asymmetrically — missing a Peak or Bottom costs far more than a false alarm:
| Actual Peak | Actual Slope | Actual Bottom | |
|---|---|---|---|
| Predicted Peak | 0 | 500 | 5000 |
| Predicted Slope | 3000 | 0 | 3000 |
| Predicted Bottom | 5000 | 500 | 0 |
This forces the model to recall rare regimes at the cost of some precision, which is the correct trade-off for directional trading.
Position sizes are computed from calibrated probabilities using a conservative Kelly formula that automatically reduces exposure when model calibration is poor:
f* = 2p − 1 # raw Kelly fraction
α = max(0, 1 − c · RMSE / |f*|) # calibration penalty
f = α · f* # safe Kelly fractionLow-confidence signals receive near-zero allocation without any hard probability threshold. The conservatism coefficient c can be tuned between 1.0 and 2.0.
Training with 6 prediction horizons instead of 1 increases the proportion of Peak and Bottom labels in the training set from ~0.83% to ~8% — a 10× increase that provides substantially more minority-class gradient signal per epoch. Only H0 is used in production; H1–H5 collapse to near-100% Slope predictions but are essential for training data efficiency.
Standard k-fold cross-validation leaks future market data into the training set. A temporal splitter with 3 expanding windows ensures each validation fold only sees data from after the training window, giving realistic out-of-sample performance estimates.
├── Core/
│ ├── config.py # All hyperparameters and model configuration
│ ├── data_sources.py # Yahoo Finance / Alpaca / Polygon / Schwab
│ └── path_manager.py # Versioned model checkpoint resolution
│
├── Training/
│ ├── train_ensemble.py # Entry point: full training pipeline
│ ├── TimeSeriesSimpleEnsemble.py # Orchestrates CV + deployment training
│ ├── WeightedVotingMetaLearner.py # Squared-weight ensemble combiner
│ ├── TemperatureCalibrationManager.py # Post-hoc probability calibration (LBFGS)
│ ├── DeepLearningBaseModelAdapter.py # PyTorch Lightning wrapper for DL models
│ ├── LightGBMBaseModelAdapter.py # LightGBM wrapper with feature flattening
│ ├── StockFeaturesCreator.py # 223-feature engineering pipeline
│ ├── StockFeatureSelector.py # LightGBM importance-based feature selection
│ └── TimeSeriesSplitter.py # Walk-forward CV splitter
│
├── TimeSeriesLib/models/
│ ├── LSTM_GRU.py # Hybrid GRU → LSTM → GRU architecture
│ ├── LuTransformer.py # Dual-tower SRU Transformer
│ ├── LuTransformerEncoder.py # SRU segmentation + hierarchical pooling
│ └── Informer.py # ProbSparse attention Transformer
│
├── features/
│ ├── QLibFeaturesCreator.py # 158 QLib Alpha158 quantitative factors
│ ├── leak_free_ict_indicators.py # ICT: BOS, CHoCH, Order Blocks, FVG, Liquidity
│ └── ... # RSI, Momentum, Volatility, VWAP
│
├── utilities/
│ ├── cost_sensitive_loss.py # Asymmetric cost matrix loss function
│ ├── metrics.py # AUC-PR, Brier score
│ └── ldam_loss.py # LDAM+DRW (meta-learner training)
│
├── predict_ensemble.py # Prediction, backtesting, charting
└── TradeBot/ # FastAPI live trading integration
git clone <repo-url>
cd MarketRegimeNet
pip install -r requirements.txtRequires Python 3.10+ and a CUDA-capable GPU for deep learning model training.
# Full training with walk-forward CV (3 folds)
python train_ensemble.py --ticker SPY
# Auto-update the holdout date to 30 days before today, then train
python train_ensemble.py --ticker SPY --auto-update-holdout
# Delete existing checkpoints and train from scratch
python train_ensemble.py --ticker SPY --from-scratch
# Run feature selection then train from scratch
python train_ensemble.py --ticker SPY --select-features
# Skip base model retraining — re-fit calibration and meta-learner only
python train_ensemble.py --ticker SPY --skip-base-models
# Train or resume training of model of the specific version, a new version is created by default.
python train_ensemble.py --ticker SPY --version v2.0_2026-03-01
# Train base models with all samples (including samples in validation set) without validation before deploy the model.
python train_ensemble.py --ticker SPY --deployment --version v1.0_2026-03-01Training resumes automatically from the last completed fold if interrupted — no extra flag needed.
# Backtest on holdout data with Kelly position sizing
python predict_ensemble.py --ticker SPY
# Download fresh data and predict on latest market state
python predict_ensemble.py --ticker SPY --use-fresh-data
# Half-Kelly (more conservative allocation)
python predict_ensemble.py --ticker SPY --kelly-fraction 0.5
# Raise conservatism coefficient (penalizes calibration error more aggressively)
python predict_ensemble.py --ticker SPY --kelly-conservatism 2.0
# Use a specific data file
python predict_ensemble.py --ticker SPY --data-file data/market/SPY_max_5m_data.csv# Analyze all features and write the top 80 to config.py
python Training/StockFeatureSelector.py --ticker SPY --update-configFeature selection uses three methods in combination — mutual information, F-statistic, and target correlation — normalized and averaged into a single importance score. Features below the importance threshold or above the redundancy threshold are removed first, then pairwise highly-correlated features are pruned (keeping the higher-scoring one). The result is written back to Core/config.py automatically.
Active base models and all hyperparameters are in Core/config.py. Toggle models by commenting or uncommenting their blocks:
'base_models': {
'lstm_gru': { 'type': 'lstm_gru', 'gru_hidden': 128, 'lstm_hidden': 128, 'dropout': 0.2 },
'lutransformer': { 'type': 'lutransformer', 'num_tokens': 32, 'segment_length': 16 },
'informer': { 'type': 'informer' },
'lightgbm': { 'type': 'lightgbm', 'n_estimators': 1000, ... },
}Key global hyperparameters:
| Parameter | Value | Description |
|---|---|---|
seq_len |
205 | Input window (~17 hours of 5-minute bars) |
pred_len |
6 | Prediction horizons H0–H5 |
d_model |
128 | Transformer embedding dimension |
cv_folds |
3 | Walk-forward cross-validation folds |
T_MIN / T_MAX |
0.1 / 8.0 | Temperature calibration clamp bounds |
Trained models are saved under data/models/SPY_5m_classification/<version>/ with:
- Per-fold checkpoints and validation logits
- Per-model, per-horizon temperature parameters
- AUC-PR scores used for ensemble weighting
current_version.txtfor active version tracking and rollback
- Hidformer — LuTransformer is adapted from: Z. Liu, Y. Cao, H. Xu, Y. Huang, Q. He, X. Chen, X. Tang, X. Liu. Hidformer: Hierarchical dual-tower transformer using multi-scale mergence for long-term time series forecasting. Expert Systems With Applications, 239 (2024), 122412. https://doi.org/10.1016/j.eswa.2023.122412
- LSTM-GRU — architecture inspired by: I. Akouaouch, A. Bouayad. A new deep learning approach for predicting high-frequency short-term cryptocurrency price. Bulletin of Electrical Engineering and Informatics, 14(1) (2025). https://doi.org/10.11591/eei.v14i1.7377
- Time-Series-Library — base Informer implementation
- Qlib Alpha158 — quantitative alpha factor library