You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fine-tune once, deploy everywhere — from cloud to $15 edge devices
High-performance LLM inference engine with 2.5x faster GPU attention and 15.5x lower latency jitter than Ollama. Built with Mojo for deterministic real-time performance.
# List available models
edgellm models
# Download a model
edgellm pull smollm-135m
# Interactive chat
edgellm run smollm-135m
# Start API server
edgellm serve smollm-135m --port 8080
Features
Inference Engine
BitNet 1.58-bit quantization (4.8x compression)
T-MAC lookup table inference (no multiplication)
INT8 __dp4a GPU kernels (Turing+)
Zero-copy KV cache management
Deterministic latency (no GC pauses)
API Gateway
Multi-tenant API key management
Role-based access control (admin/user)
Rate limiting per API key
Usage analytics and monitoring
PostgreSQL persistent storage
Frontend Dashboard
React + Vite modern UI
Dark mode support
Interactive playground
Real-time statistics
CUDA Kernel Status
Complete GPU inference pipeline for LLaMA-style transformers.
Kernel
Status
Features
Attention (INT8)
✅ Complete
__dp4a Flash Attention, 2.5x faster than Ollama
RMSNorm
✅ Complete
Warp reductions, vectorized, fused residual
FFN/MLP
✅ Complete
SwiGLU activation, tiled, INT8 quantized
Embeddings
✅ Complete
Token lookup, RoPE positional encoding
Sampling
✅ Complete
Temperature, Top-K, Top-P, greedy
Build CUDA Kernels
cd mojo-gateway/src/kernels/cuda
# Build unified inference library (recommended)
make inference-unified
# Or build individual kernels
make rmsnorm ffn embeddings sampling int8
# Platform-specific builds
make t4 # Tesla T4 (Kaggle/Colab)
make jetson-nano # Jetson Nano
make rtx # RTX 30/40 series