Operations
Day-2 concerns: what to monitor, how to log, when to alert, what to do when something goes wrong. This page is the operational counterpart to Maintenance (which covers periodic tasks like key rotation).
Health endpoints
Two endpoints expose service health:
| Endpoint | Purpose | What it checks | Use for |
|---|---|---|---|
GET /health | Liveness | Process is running and the HTTP server is reachable | Load balancer, container HEALTHCHECK |
GET /ready | Readiness | DB is reachable + encryption key decrypts | Deploy gates, alert on outage |
Both return JSON:
// /health
{ "status": "ok" }
// /ready
{ "status": "ok", "checks": { "db": "ok", "encryption": "ok" } }A degraded /ready:
{ "status": "degraded", "checks": { "db": "fail: connection refused", "encryption": "ok" } }Alert on /ready, not /health
/health will keep returning 200 even with a dead database — it only proves the process is up. Always alert on /ready failures.
Health check configuration on Fly
fly.toml configures both:
/healthevery 30s with 5s timeout (liveness probe)/readyevery 60s with 10s timeout, 30s grace period (readiness probe)
A failing /ready during deploy aborts the rollout and keeps the old machine serving. A failing /ready after deploy marks the machine as unhealthy; Fly's load balancer stops routing to it.
Logging
Structured JSON logs to stdout. Every line has at minimum ts, level, and msg:
{"ts":"2026-04-25T03:01:30.123Z","level":"info","msg":"listening","port":8080,"env":"production"}
{"ts":"2026-04-25T03:01:53.428Z","level":"info","msg":"http_request","reqId":"req_…","method":"POST","path":"/v1/apple/verify","status":200,"latencyMs":143,"tenantId":"tenant_…"}
{"ts":"2026-04-25T03:02:11.876Z","level":"warn","msg":"validation_audit_enabled_no_retention","note":"…"}
{"ts":"2026-04-25T03:02:31.122Z","level":"fatal","msg":"startup_failed","error":"…"}What gets logged
Always:
listeningonce at startup (with port + env)shutdownonce on SIGINT/SIGTERM (with signal)http_requestper request with method, path, status, latency, tenantIdvalidation_audit_enabled_no_retentiononce at boot (a periodic reminder if the audit log is enabled — see Maintenance)
On error:
startup_failed(fatal) for any boot-time configuration / init error- Per-request errors include the AppError code in structured form, never the full stack trace in production (
NODE_ENV=productionredacts stack traces; onlyerrorClassis kept)
What is NOT logged
- Apple
.p8contents - Google service-account JSON
- Raw API keys
- Webhook secrets
- Full
signedPayloadbodies
If you see these in logs, that's a bug — open an issue.
Log analysis tips
Common queries against Fly logs (or any log shipper):
# All errors in the last hour
fly logs -a attesto | jq -c 'select(.level=="error" or .level=="fatal")'
# Per-tenant verify volume
fly logs -a attesto | jq -c 'select(.path=="/v1/apple/verify") | .tenantId' | sort | uniq -c
# p99 latency by endpoint over last 100k requests
fly logs -a attesto | jq -c 'select(.msg=="http_request") | [.path, .latencyMs]' \
| awk -F'"' '{ count[$2]++; arr[$2 NR]=$3 } …'
# Rate-limit denials
fly logs -a attesto | jq -c 'select(.error=="RATE_LIMITED")'Metrics worth watching
Attesto doesn't ship a Prometheus endpoint in v0.1.0; use Fly's built-in metrics or scrape these from logs:
| Metric | Healthy range | Why |
|---|---|---|
/v1/apple/verify p99 latency | <500ms | Mostly bounded by Apple's API; spikes mean Apple is slow or your DB is slow |
/v1/google/verify p99 latency | <800ms | Google's API is slower than Apple's; OAuth refreshes add up |
| HTTP 5xx rate | <0.1% | Anything above suggests upstream or DB issues |
RATE_LIMITED denials | 0 in normal traffic | Spikes mean a tenant is misbehaving or your limits are too tight |
| Webhook delivery success rate | >99% | Persistent failures indicate a tenant's callback URL is broken |
webhook_deliveries.status='pending' count | <100 typical | Backlog; if growing, the dispatcher is wedged |
Scaling
Vertical (stronger machines)
For most loads, the default Fly shared-cpu-1x / 512MB RAM machine is sufficient. Bump up if:
- Sustained CPU >70% (check
fly logs -a … | jq …or Fly metrics) - Heap usage >300MB (increase memory)
- Apple/Google verify p99 climbing without external cause (check upstream metrics first)
Horizontal (more machines)
Attesto's verify path is fully stateless — adding machines scales it linearly. Scale via:
fly scale count 3 -a attestoWebhook dispatcher caveat
The webhook dispatcher is single-instance in v0.1.0. If you scale horizontally, all replicas will pick up pending rows from webhook_deliveries and double-deliver to your callbacks.
Until the v0.2 multi-instance dispatcher (FOR UPDATE SKIP LOCKED) lands:
- For verify-heavy workloads with light webhooks: scale freely; the webhook dupes are tolerable
- For webhook-heavy workloads: stay at
count 1for the dispatcher
A workaround pattern: run two Fly apps from the same image — attesto-verify (scaled to N replicas, WEBHOOK_DISPATCHER_DISABLED=1 if such a flag existed; today, just rely on most traffic being verify) and attesto-webhooks (replica count 1).
Rate-limit tuning
Defaults: RATE_LIMIT_PER_SECOND=100, RATE_LIMIT_BURST=200 per tenant per process. With N machines, the effective cap per tenant is N × RATE_LIMIT_BURST.
If you scale to 4 machines but want each tenant capped at the equivalent of 100 RPS overall:
fly secrets set -a attesto RATE_LIMIT_PER_SECOND=25 RATE_LIMIT_BURST=50Or accept the multiplier as a soft ceiling — exceeding 100 × N is unlikely for most tenants.
Database pool sizing
DATABASE_URL accepts standard Postgres connection params. The internal pool defaults to ~10 connections per process; for high-concurrency deployments tune via the connection URL:
postgres://user:pass@host/db?max=20&idle_timeout=30For Fly Postgres, monitor connection count vs the configured max_connections (default 100 on the smallest cluster). If you're approaching the cap, either raise it on the Postgres side or use PgBouncer in front.
Monitoring with a BI tool
Attesto deliberately doesn't ship a built-in admin UI or dashboard — the right tool for operator monitoring is an off-the-shelf BI product pointed at a read-only Postgres user.
Recommended setup
-- One-time, on your Postgres
CREATE USER monitoring WITH PASSWORD '<strong random>';
GRANT CONNECT ON DATABASE attesto TO monitoring;
GRANT USAGE ON SCHEMA public TO monitoring;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO monitoring;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO monitoring;Point your tool of choice at this user. Three popular options:
| Tool | License | Best for |
|---|---|---|
| Metabase | OSS (also paid SaaS) | Drag-and-drop dashboards, easiest learning curve |
| Grafana | OSS | Time-series-heavy views, extensive alerting |
| Apache Superset | OSS | More query flexibility, steeper learning curve |
All three deploy in a single Docker container; for Fly, ~$5/mo machine.
Useful starter queries
Verify volume by tenant, last 24h:
SELECT tenant_id, source, count(*), avg(latency_ms)::int as avg_latency_ms
FROM validation_audit
WHERE occurred_at > now() - interval '24 hours'
GROUP BY tenant_id, source
ORDER BY count(*) DESC;valid:false rate by tenant — spike here usually means a tenant onboarding regression:
SELECT tenant_id,
count(*) FILTER (WHERE valid = false) as invalid,
count(*) as total,
(count(*) FILTER (WHERE valid = false))::float / count(*) as rate
FROM validation_audit
WHERE occurred_at > now() - interval '7 days'
GROUP BY tenant_id
HAVING count(*) > 100
ORDER BY rate DESC;Webhook delivery health:
SELECT tenant_id, status, count(*)
FROM webhook_deliveries
WHERE created_at > now() - interval '7 days'
GROUP BY tenant_id, status
ORDER BY tenant_id, status;API key activity — find unused keys to revoke:
SELECT id, tenant_id, name, last_used_at,
extract(days FROM now() - coalesce(last_used_at, created_at)) as days_idle
FROM api_keys
WHERE revoked_at IS NULL
ORDER BY last_used_at DESC NULLS LAST;Lock down access
The BI tool sees all tenant data — including via validation_audit the HMAC-keyed identifier hashes (which can't be reversed without the master key, but still). Treat it as sensitive infrastructure:
- Put it behind Cloudflare Access (free tier, email SSO) or Tailscale
- Don't expose on the public internet without auth
- Use a strong dedicated password for the
monitoringuser — different from any application credential - Periodically rotate (
ALTER USER monitoring WITH PASSWORD '<new>')
Why not build it into Attesto?
Two reasons:
- Security surface. An
/admin/*HTTP route would need its own auth, rate limiting, audit logging, etc. — an attack surface that has to be maintained alongside the verify endpoints. Read-only Postgres user + external tool sidesteps it entirely. - Dashboard iteration speed. Metabase / Grafana dashboards take minutes to build and modify; equivalent custom UI takes weeks. The BI tool's authors are better at dashboard UX than you'll be at re-implementing it.
Backup and disaster recovery
Database backups
Fly Postgres takes daily snapshots automatically (retained 7 days on the free tier, 30 on paid). To recover:
fly postgres backup list -a attesto-db
fly postgres backup restore -a attesto-db <backup-id>For self-hosted, use pg_dump on a schedule:
docker compose exec db pg_dump -U attesto attesto | gzip > backup-$(date +%F).sql.gzEncryption key recovery
There is no key recovery. Lose ATTESTO_ENCRYPTION_KEY and every encrypted credential becomes unrecoverable. The mitigation is good backup hygiene:
- Store the key in a password manager BEFORE first deploy
- For team setups, use a shared password manager with controlled access
- Document who has access in your runbook so you know who to call when it's needed at 2am
Tenant data restore
If a tenant accidentally deletes their credentials (or needs them restored from a backup):
-- in psql against a restored backup
INSERT INTO apple_credentials
SELECT * FROM old_backup.apple_credentials WHERE tenant_id = 'tenant_…';Then ask the tenant to verify a transaction to confirm.
Incident response patterns
Apple / Google API outage
Symptoms: spike in APPLE_API_ERROR or GOOGLE_API_ERROR (502 status). Action:
- Check Apple Developer System Status or Google Cloud Status
- If the upstream is degraded, your tenants are queued — Attesto will continue retrying. Communicate to tenants that verifications are delayed.
- Don't restart the service unless
/readyis failing. Apple/Google outages don't affect Attesto's own health.
DB outage
Symptoms: /ready fails with db: fail: connection refused. Action:
- Check Fly Postgres status:
fly status -a attesto-db - If down: scale Postgres back up or restore from backup
- The Attesto app keeps running —
/healthreturns 200,/readyreturns 503. New verify requests fail open withINTERNAL_ERROR; webhook deliveries are queued in-memory but lost if the app restarts
Bad deploy
Symptoms: deploy completed, but /ready flapping or new errors in logs. Action:
fly releases list -a attesto— find the previous good releasefly releases rollback -a attesto v<previous>- Investigate the failed release in a feature branch, not in prod
Webhook callback URL is failing
Symptoms: a tenant's webhook_deliveries.status='failed' count climbs. Action:
- Look at
last_response_codeandlast_response_bodyfor that tenant's deliveries — typical values:404(URL changed),500(their backend is broken), connect-refused (their service is down) - Reach out to the tenant
- Once they fix it, their incoming events from now on will deliver normally; failed deliveries don't auto-retry beyond the 7h12m schedule
What's next
- Maintenance — periodic tasks (key rotation, retention)
- Troubleshooting — symptom-keyed problem-solving
- Testing — running tests in CI and locally