Observability
Sprintsail's observability stack is built on open standards: OpenTelemetry for instrumentation, Prometheus for metrics, Loki for logs, and Grafana for visualization.
Stack overview
Your App → OpenTelemetry Collector → Prometheus (metrics)
→ Loki (logs)
→ Tempo (traces)
↓
Grafana
(dashboards, alerts)
All components run within the management cluster. Observability data is namespace-scoped --- each organization sees only its own metrics, logs, and traces.
Prometheus
Prometheus scrapes and stores time-series metrics. Sprintsail collects platform metrics automatically and supports custom application metrics.
Platform metrics
Collected for every app without any configuration:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total requests by method, path, status code |
http_request_duration_seconds | Histogram | Request latency (p50, p95, p99) |
container_cpu_usage_seconds_total | Counter | Cumulative CPU time consumed |
container_memory_working_set_bytes | Gauge | Current memory usage |
container_network_receive_bytes_total | Counter | Inbound network bytes |
container_network_transmit_bytes_total | Counter | Outbound network bytes |
container_restarts_total | Counter | Container restart count |
kube_pod_status_phase | Gauge | Pod lifecycle phase |
Custom metrics
If your app exposes a Prometheus-compatible /metrics endpoint, Sprintsail scrapes it automatically. Enable scraping:
ss app update my-app --metrics-path /metrics --metrics-port 9090
Example using the Prometheus client library (Node.js):
const client = require('prom-client');
// Collect default Node.js metrics
client.collectDefaultMetrics();
// Custom counter
const ordersProcessed = new client.Counter({
name: 'orders_processed_total',
help: 'Total orders processed',
labelNames: ['status'],
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
Custom metrics are available in Grafana alongside platform metrics.
Retention
| Plan | Metric retention | Resolution |
|---|---|---|
| Starter | 7 days | 15s scrape interval |
| Growth | 90 days | 15s scrape interval |
| Enterprise | 1 year | 15s scrape interval, long-term downsampled |
Loki
Loki aggregates and indexes logs from all app instances. Logs are collected from container stdout/stderr automatically.
Log collection
No agent installation needed. Loki reads container logs via Promtail, which runs as a DaemonSet on cluster nodes. All output to stdout and stderr is captured.
Structured logging
Loki indexes JSON-formatted log lines for efficient querying. If your app outputs structured logs:
{"level":"error","msg":"database connection failed","host":"db.example.com","retry":3,"duration_ms":1200}
You can query by any field in Grafana:
{app="my-app"} | json | level="error" | duration_ms > 1000
LogQL queries
Loki uses LogQL for queries. Common patterns:
# All logs for an app
{app="my-app"}
# Error logs only
{app="my-app"} |= "error"
# JSON-parsed with field filter
{app="my-app"} | json | status >= 500
# Rate of errors over time
rate({app="my-app"} |= "error" [5m])
Retention
| Plan | Log retention |
|---|---|
| Starter | 24 hours |
| Growth | 30 days |
| Enterprise | 1 year |
Tempo (distributed tracing)
Grafana Tempo stores distributed traces collected via OpenTelemetry. Tracing is available on Growth and Enterprise plans.
Enabling tracing
Configure your app to send traces to the OpenTelemetry Collector endpoint:
ss env set my-app OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.sprintsail-system:4317
ss env set my-app OTEL_SERVICE_NAME=my-app
Or use the platform-provided auto-instrumentation (no code changes):
ss app update my-app --tracing auto
Auto-instrumentation is available for Node.js, Python, Java, and .NET.
Viewing traces
Traces appear in Grafana under the Tempo data source. You can:
- Search by trace ID, service name, or duration
- View the full request waterfall across services
- Correlate traces with logs and metrics (click from a trace span to see matching logs)
OpenTelemetry Collector
The OpenTelemetry Collector runs in the management cluster and acts as the central telemetry pipeline:
- Receives traces, metrics, and logs from apps via OTLP (gRPC and HTTP)
- Processes data (batching, sampling, attribute enrichment)
- Exports to Prometheus, Loki, and Tempo
The collector is pre-configured. Apps only need to set the OTLP endpoint to start sending telemetry.
Grafana
Each organization gets a Grafana instance at:
https://grafana.{org}.sprintsail.com
Login with your Sprintsail credentials (SSO via Dex).
Pre-built dashboards
Every organization starts with these dashboards:
| Dashboard | Contents |
|---|---|
| App Overview | Request rate, error rate, latency percentiles, instance count |
| Resource Usage | CPU and memory per app and instance, trending |
| Services | Database connections, query latency, cache hit rate |
| Deployments | Deploy timeline, build duration, rollback events |
| Logs Explorer | Full-text log search with filters |
Custom dashboards
Create custom dashboards using any combination of Prometheus, Loki, and Tempo data sources. Dashboards are saved per-organization and are accessible to all org members.
Alerting via Grafana
In addition to CLI-based alerts (ss alerts create), you can create alert rules directly in Grafana with full PromQL/LogQL conditions, notification channels, and silence windows.
Next steps
- Monitoring guide --- practical log streaming and alerting setup
- Architecture --- where the observability stack runs
- Scaling --- autoscaling uses the same metrics