ZAHLEN

API User Guide

Chapter 16 - Production Best Practices

Key rotation | Retry strategies | Monitoring

Audience

Merchants, developers, and integration engineers responsible for deploying, operating, and supporting a

production Zahlen integration.

Version 1.0 | Source baseline: zahlen_deploy_0616A.tar.gz | June 2026

‌Chapter 16 - Production Best Practices

Learning objectives

By the end of this chapter, you should be able to rotate API keys without downtime, implement safe network and application retries, monitor the complete Zahlen workflow, and define a practical production-readiness

process.

A correct API integration is not automatically a production-ready integration. Production systems must assume credentials will be rotated, networks will fail, requests may be repeated, dependencies may slow down, and operators will need enough evidence to understand what happened.

Canonical payment schedule

Zahlen payment retries follow the fixed Day 1, Day 2, Day 6, and Day 16 schedule. HTTP retries, queue retries, webhook retries, and worker retries are technical recovery mechanisms; they must never create extra

payment authorization attempts outside that schedule.

‌Production operating principles
- Fail closed when authentication or tenant context cannot be resolved.
- Use stable identifiers and idempotency keys for every repeatable write operation.
- Separate payment retry scheduling from API transport retry behavior.
- Collect enough telemetry to correlate merchant activity with Zahlen request, decision, batch, job, and outcome identifiers.
- Test key rotation, throttling, partial failures, and dependency outages before launch.

‌Key rotation

API keys are production credentials. Rotation should be planned as a normal operational process, not reserved only for emergencies. A safe rotation uses a controlled overlap period so the replacement key can be deployed and verified before the previous key is revoked.

‌Recommended rotation procedure

Create a replacement key for the correct tenant, merchant, environment, and service.
Store the new key in the approved secret manager. Do not place it in source control, a ticket, or a chat message.
Deploy the new key to one canary instance or a small percentage of traffic.
Verify authenticated health, payment-event, decision, and outcome requests with the new key.
Roll the new key out to all application instances, workers, and scheduled jobs.
Confirm that traffic using the old key has stopped.
Revoke the old key and monitor for rejected use attempts.

Record the rotation event, key identifier, operator, time, and validation evidence in the audit trail.

Do not rotate by overwrite alone

Replacing a secret everywhere at once without overlap can create an avoidable outage. Use dual-key overlap

when the administrative policy permits it, then revoke the retired credential after verification.

‌Key storage and exposure prevention

Control	Production expectation
Secret storage	Use a managed secret store or protected runtime injection mechanism.
Application logs	Never log the complete key; log only a safe key ID or fingerprint.
Client location	Keep production keys in server-side systems, not browsers or mobile applications.
Environment separation	Use different keys for development, staging, production, and independent services.
Access scope	Grant secret access only to the workloads and operators that require it.
Revocation	Revoke immediately after confirmed compromise or unauthorized disclosure.
Audit	Retain creation, rotation, revocation, and failed-use evidence.

‌Compromise response

Identify the affected key without redistributing the secret.
Revoke or disable the key according to incident policy.
Issue and deploy a replacement credential through the controlled rotation process.
Review API activity by tenant, endpoint, time, source, and status code.
Investigate unexpected outcome reporting, event ingestion, or decision traffic.

‌Retry strategy: classify before retrying

The client must determine whether a failure is safe to repeat. A retry decision should consider the HTTP method, status code, idempotency behavior, whether the server may have completed the operation, and the maximum retry budget.

Condition	Automatic retry?	Recommended behavior
GET with timeout or transient 5xx	Usually	Retry with bounded exponential backoff and jitter.
POST with stable Idempotency-Key	Carefully	Repeat the same logical request with the same key.
POST without idempotency guarantee	No automatic blind retry	Reconcile using stable identifiers before repeating.
400 or 422	No	Correct the payload; do not retry unchanged.
401 or 403	No	Repair authentication or authorization.
404	Usually no	Verify tenant-scoped identifier and endpoint.
409	Reconcile	Compare the original request and idempotency key.
429	Yes, later	Honor Retry-After when present and apply jitter.
500 or 503	Carefully	Retry within a bounded budget using idempotency.

Uncertain result

A timeout does not prove that the server did nothing. For a write request, treat the result as uncertain until

you reconcile it using the idempotency key or durable business identifier.

‌Exponential backoff with jitter
Backoff reduces pressure on a recovering service. Jitter prevents many clients from retrying at the same instant. Always cap both the delay and the total number of attempts.
import random import time

BASE_DELAY = 1.0
MAX_DELAY = 30.0
MAX_ATTEMPTS = 5

for retry_number in range(MAX_ATTEMPTS):
delay = min(MAX_DELAY, BASE_DELAY * (2 ** retry_number)) delay *= random.uniform(0.75, 1.25)
time.sleep(delay)

‌Retry budget guidance

Use a small maximum attempt count for synchronous user-facing requests.
Use longer, durable retry queues for asynchronous processing when appropriate.
Do not retry forever; move exhausted work to an alert or review queue.
Preserve the same idempotency key for the same logical operation.
Use a new idempotency key only for a genuinely new operation.

‌Technical retries versus payment retries

A production integration has several different retry layers. They must remain separate so a network recovery does not accidentally become another charge attempt.

Retry layer	Purpose	May create a new payment attempt?
HTTP request retry	Recover from network or server failure	No, not by itself
Queue delivery retry	Redeliver internal work	No, not by itself
Webhook delivery retry	Redeliver notification payload	No
Worker retry	Re-run a failed technical task	No, unless the task is explicitly authorized and idempotent
Zahlen payment retry schedule	Execute the business-authorized authorization attempt	Yes, only on Day 1, Day 2, Day 6, or Day 14

Critical safeguard

Store a durable payment-attempt record keyed by subscription or billing cycle and attempt number. Before sending an authorization to the processor, verify that the planned attempt matches the authorized Zahlen

schedule and has not already been executed.

‌Monitoring the end-to-end workflow

Monitor the complete commercial workflow, not only API uptime. A healthy health endpoint does not prove that decisioning, outcomes, investigation runs, or webhooks are functioning correctly.

Signal	Why it matters	Example alert condition
Authentication failure rate	Detects revoked, expired, or misconfigured keys	Sustained increase in 401 responses
Authorization failure rate	Detects plan, role, or route policy problems	Unexpected increase in 403 responses
429 rate	Shows rate-limit or quota pressure	Repeated throttling above normal baseline
Request latency	Detects slow dependencies or policy evaluation	High p95 or p99 latency
5xx / 503 rate	Shows service or dependency failure	Error ratio exceeds agreed threshold
Decision completion	Confirms event-to-decision flow	Events remain without decisions beyond expected time
Outcome reporting lag	Shows learning-loop interruption	Decision executed but outcome not reported
Webhook failure rate	Detects unavailable callbacks or consumer errors	Delivery failures or retry backlog
Investigation-run backlog	Shows ingestion or bridge pressure	Runs remain non-terminal or unpopulated

‌Correlation and structured logging

Production logs should make it possible to trace one payment event across merchant systems and Zahlen without exposing credentials or prohibited payment data.

Identifier	Use in logs
event_id	Merchant-created correlation for one payment event
payment_event_batch_id / batch_id	Groups events submitted together
upload_job_id	Connects ingestion with background processing and investigation
request_id	Correlates an API request with support and audit records
decision_id	Identifies the retry recommendation
outcome_id	Identifies the reported execution result
subscription_id / billing_cycle_id	Connects activity to the merchant billing workflow
safe key ID or fingerprint	Identifies credential use without exposing the secret

{

"event": "zahlen_api_request", "endpoint": "/v1/_next/retry-decision", "method": "POST",

"status_code": 200,

"latency_ms": 162,

"event_id": "evt_20260616_0001", "request_id": "req_example", "decision_id": "dec_example", "idempotent_replay": false

}

Sensitive-data rule

Do not log API keys, full card numbers, CVV values, passwords, raw bank credentials, or unrestricted request

bodies. Prefer allow-listed structured fields and merchant-side tokens.

‌Health checks and synthetic tests

Use layered checks. A connectivity probe verifies that the service responds; an authenticated synthetic test verifies that the commercial integration works. Synthetic tests must use approved test tenants, test credentials, and non-production payment evidence.

Check	What it proves
GET /v1/health	Service is reachable and returning health metadata
GET /v1/version	Deployed API and application version can be identified
Authenticated read	Key resolution and tenant-scoped authorization work
Synthetic event ingestion	Schema validation and ingestion path work
Synthetic decision request	Decision contract and policy evaluation work
Synthetic outcome report	Learning-loop write path works
Webhook test delivery	Callback, verification, deduplication, and processing work

Run smoke tests after deployment, configuration change, key rotation, and dependency maintenance.
Use unique test identifiers so synthetic activity can be filtered from business reporting.
Alert when a test fails repeatedly, but avoid aggressive probes that create rate-limit pressure.

‌Deployment and change management
1. Validate request models against the current /v1 contract or discovery metadata.
2. Run unit, integration, contract, and tenant-isolation tests.
3. Deploy to staging with staging-only keys and data stores.
4. Run synthetic event, decision, outcome, and webhook tests.
5. Deploy gradually to production and observe error rate, latency, and throughput.
6. Verify the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule in production configuration.
7. Record the release, configuration version, and verification evidence.
  ‌Rollback readiness
  - Keep the previous application version and configuration available for rollback.
  - Do not roll back databases or durable events without a tested migration strategy.
  - Preserve idempotency and identifiers across deployment boundaries.
  - After rollback, reconcile uncertain writes and verify outcome-reporting continuity.

‌Production-readiness checklist

Area	Ready when
Credentials	Keys are secret-managed, environment-specific, rotatable, and auditable.
Tenant safety	No client-supplied tenant ID is trusted as the ownership boundary.
Idempotency	Repeatable POST operations use stable logical keys and reconciliation.
Retry policy	400/401/403/404/422 are not blindly retried; 429/5xx use bounded backoff.
Payment schedule	Processor attempts are restricted to Day 1, Day 2, Day 6, and Day 16.
Timeouts	Every external request has connect and read timeouts.
Monitoring	Authentication, 429, 5xx, latency, decisions, outcomes, webhooks, and runs are observed.
Logging	Correlation IDs are captured without logging secrets or prohibited card data.
Testing	Key rotation, throttling, timeouts, duplicate delivery, and partial failure have been tested.
Operations	Runbooks, owners, escalation paths, and rollback procedures are documented.

Go-live rule

Do not launch solely because a happy-path request succeeded. Go live only after failure behavior, security

controls, observability, and operator response have been demonstrated in a production-like environment.

‌Practical incident playbooks
‌Authentication failures spike
1. Check recent deployments and secret-manager changes.
2. Identify affected key IDs and environments without exposing secret material.
3. Compare 401 failures by endpoint, tenant, and application instance.
4. Rotate or replace misconfigured or compromised keys.
5. Review audit activity and confirm recovery after deployment.
  ‌429 responses increase
6. Confirm whether the limit is short-window rate limiting or longer-window quota exhaustion.
7. Check for loops, duplicate jobs, or unexpected traffic growth.
8. Verify that clients honor Retry-After and use jitter.
9. Reduce concurrency or batch safely where appropriate.
10. Review plan and quota settings only after abnormal traffic has been ruled out.
  ‌Outcome reporting falls behind
11. Compare completed processor attempts with reported outcome IDs.
12. Check queue depth, worker health, and failed outcome requests.
13. Replay only with stable identifiers and idempotency.
14. Reconcile missing records without creating another payment attempt.
15. Confirm the learning loop has resumed and backlog is decreasing.
‌Chapter summary
- Rotate keys with overlap, verification, revocation, and audit evidence.
- Classify failures before retrying and use stable idempotency for uncertain writes.
- Use bounded exponential backoff with jitter for 429 and transient server failures.
- Keep technical retries separate from the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule.
- Monitor authentication, throttling, latency, decisions, outcomes, webhooks, and investigation runs.
- Log durable correlation identifiers while minimizing sensitive data.
- Treat failure testing, runbooks, and rollback readiness as launch requirements.

Next step

Use Appendix A for the OpenAPI specification, Appendix B for complete JSON examples, and Appendix C for

troubleshooting guidance.