From laptop agent to production platform
Most agent prototypes run in a single process. Production systems need reproducibility, observability, and safe rollout paths. The deployment stack below gives you deterministic builds and scalable runtime behavior.
Container boundary with Docker
- Build separate images for API gateway, planner, workers, and embedding service.
- Pin model/runtime dependencies for deterministic behavior across environments.
- Use multi-stage builds to reduce image size and cold-start times.
- Run health checks and fail fast when models or tool endpoints are unavailable.
Kubernetes scheduling strategy
Workload separation
Use dedicated node pools: CPU nodes for API/planner, memory-heavy nodes for retrieval, and GPU pools for model inference.
Autoscaling signals
Scale workers on queue depth and inference latency, not just CPU. HPA + custom metrics avoids backlog explosions.
Failure domains
Spread replicas across zones with anti-affinity rules. Keep stateful components in managed services where possible.
Nginx at the edge
Nginx handles TLS termination, request buffering, and traffic shaping before traffic reaches the gateway. For agent APIs, configure strict timeouts, request-size limits, and streaming-friendly settings for token-by-token responses.
- Path-based routing for
/chat,/tools, and/webhooks. - Rate limits per API key and tenant.
- Canary upstream pools for safe model version rollout.
Where Ngrok fits
Ngrok is ideal for development and integration tests: exposing local webhooks, testing external callback flows, and validating OAuth tool connectors. Keep it out of long-term production paths, but use it heavily to speed iteration before formal deployment.
Deployment playbook
- Build and scan Docker images.
- Deploy to staging namespace with synthetic load tests.
- Shadow production traffic to new model and planner versions.
- Canary to 5%, 25%, 50%, and 100% with rollback gates.
- Record latency/cost deltas before final promotion.
Runbook metrics
- Token latency (time to first token and time to last token).
- Tool call failure rate by connector.
- Queue age and dropped requests.
- Cost per successful workflow.
"Good deployment design lets your agents fail small, recover fast, and ship often."
- Victor Hale