Key rotation | Retry strategies | Monitoring
Audience |
Merchants, developers, and integration engineers responsible for deploying, operating, and supporting a production Zahlen integration. |
Version 1.0 | Source baseline: zahlen_deploy_0616A.tar.gz | June 2026
Learning objectives |
By the end of this chapter, you should be able to rotate API keys without downtime, implement safe network and application retries, monitor the complete Zahlen workflow, and define a practical production-readiness process. |
A correct API integration is not automatically a production-ready integration. Production systems must assume credentials will be rotated, networks will fail, requests may be repeated, dependencies may slow down, and operators will need enough evidence to understand what happened.
Canonical payment schedule |
Zahlen payment retries follow the fixed Day 1, Day 2, Day 6, and Day 16 schedule. HTTP retries, queue retries, webhook retries, and worker retries are technical recovery mechanisms; they must never create extra payment authorization attempts outside that schedule. |
Production operating principles
Fail closed when authentication or tenant context cannot be resolved.
Use stable identifiers and idempotency keys for every repeatable write operation.
Separate payment retry scheduling from API transport retry behavior.
Collect enough telemetry to correlate merchant activity with Zahlen request, decision, batch, job, and outcome identifiers.
Test key rotation, throttling, partial failures, and dependency outages before launch.
Key rotation
API keys are production credentials. Rotation should be planned as a normal operational process, not reserved only for emergencies. A safe rotation uses a controlled overlap period so the replacement key can be deployed and verified before the previous key is revoked.
Recommended rotation procedure
Create a replacement key for the correct tenant, merchant, environment, and service.
Store the new key in the approved secret manager. Do not place it in source control, a ticket, or a chat message.
Deploy the new key to one canary instance or a small percentage of traffic.
Verify authenticated health, payment-event, decision, and outcome requests with the new key.
Roll the new key out to all application instances, workers, and scheduled jobs.
Confirm that traffic using the old key has stopped.
Revoke the old key and monitor for rejected use attempts.
Record the rotation event, key identifier, operator, time, and validation evidence in the audit trail.
Do not rotate by overwrite alone |
Replacing a secret everywhere at once without overlap can create an avoidable outage. Use dual-key overlap when the administrative policy permits it, then revoke the retired credential after verification. |
Key storage and exposure prevention
Control | Production expectation |
Secret storage | Use a managed secret store or protected runtime injection mechanism. |
Application logs | Never log the complete key; log only a safe key ID or fingerprint. |
Client location | Keep production keys in server-side systems, not browsers or mobile applications. |
Environment separation | Use different keys for development, staging, production, and independent services. |
Access scope | Grant secret access only to the workloads and operators that require it. |
Revocation | Revoke immediately after confirmed compromise or unauthorized disclosure. |
Audit | Retain creation, rotation, revocation, and failed-use evidence. |
Compromise response
Identify the affected key without redistributing the secret.
Revoke or disable the key according to incident policy.
Issue and deploy a replacement credential through the controlled rotation process.
Review API activity by tenant, endpoint, time, source, and status code.
Investigate unexpected outcome reporting, event ingestion, or decision traffic.
Retry strategy: classify before retrying
The client must determine whether a failure is safe to repeat. A retry decision should consider the HTTP method, status code, idempotency behavior, whether the server may have completed the operation, and the maximum retry budget.
Condition | Automatic retry? | Recommended behavior |
GET with timeout or transient 5xx | Usually | Retry with bounded exponential backoff and jitter. |
POST with stable Idempotency-Key | Carefully | Repeat the same logical request with the same key. |
POST without idempotency guarantee | No automatic blind retry | Reconcile using stable identifiers before repeating. |
400 or 422 | No | Correct the payload; do not retry unchanged. |
401 or 403 | No | Repair authentication or authorization. |
404 | Usually no | Verify tenant-scoped identifier and endpoint. |
409 | Reconcile | Compare the original request and idempotency key. |
429 | Yes, later | Honor Retry-After when present and apply jitter. |
500 or 503 | Carefully | Retry within a bounded budget using idempotency. |
Uncertain result |
A timeout does not prove that the server did nothing. For a write request, treat the result as uncertain until you reconcile it using the idempotency key or durable business identifier. |
Exponential backoff with jitter
Backoff reduces pressure on a recovering service. Jitter prevents many clients from retrying at the same instant. Always cap both the delay and the total number of attempts.
import random import time
BASE_DELAY = 1.0
MAX_DELAY = 30.0
MAX_ATTEMPTS = 5
for retry_number in range(MAX_ATTEMPTS):
delay = min(MAX_DELAY, BASE_DELAY * (2 ** retry_number)) delay *= random.uniform(0.75, 1.25)
time.sleep(delay)
Retry budget guidance
Use a small maximum attempt count for synchronous user-facing requests.
Use longer, durable retry queues for asynchronous processing when appropriate.
Do not retry forever; move exhausted work to an alert or review queue.
Preserve the same idempotency key for the same logical operation.
Use a new idempotency key only for a genuinely new operation.
Technical retries versus payment retries
A production integration has several different retry layers. They must remain separate so a network recovery does not accidentally become another charge attempt.
Retry layer | Purpose | May create a new payment attempt? |
HTTP request retry | Recover from network or server failure | No, not by itself |
Queue delivery retry | Redeliver internal work | No, not by itself |
Webhook delivery retry | Redeliver notification payload | No |
Worker retry | Re-run a failed technical task | No, unless the task is explicitly authorized and idempotent |
Zahlen payment retry schedule | Execute the business-authorized authorization attempt | Yes, only on Day 1, Day 2, Day 6, or Day 14 |
Critical safeguard |
Store a durable payment-attempt record keyed by subscription or billing cycle and attempt number. Before sending an authorization to the processor, verify that the planned attempt matches the authorized Zahlen schedule and has not already been executed. |
Monitoring the end-to-end workflow
Monitor the complete commercial workflow, not only API uptime. A healthy health endpoint does not prove that decisioning, outcomes, investigation runs, or webhooks are functioning correctly.
Signal | Why it matters | Example alert condition |
Authentication failure rate | Detects revoked, expired, or misconfigured keys | Sustained increase in 401 responses |
Authorization failure rate | Detects plan, role, or route policy problems | Unexpected increase in 403 responses |
429 rate | Shows rate-limit or quota pressure | Repeated throttling above normal baseline |
Request latency | Detects slow dependencies or policy evaluation | High p95 or p99 latency |
5xx / 503 rate | Shows service or dependency failure | Error ratio exceeds agreed threshold |
Decision completion | Confirms event-to-decision flow | Events remain without decisions beyond expected time |
Outcome reporting lag | Shows learning-loop interruption | Decision executed but outcome not reported |
Webhook failure rate | Detects unavailable callbacks or consumer errors | Delivery failures or retry backlog |
Investigation-run backlog | Shows ingestion or bridge pressure | Runs remain non-terminal or unpopulated |
Correlation and structured logging
Production logs should make it possible to trace one payment event across merchant systems and Zahlen without exposing credentials or prohibited payment data.
Identifier | Use in logs |
event_id | Merchant-created correlation for one payment event |
payment_event_batch_id / batch_id | Groups events submitted together |
upload_job_id | Connects ingestion with background processing and investigation |
request_id | Correlates an API request with support and audit records |
decision_id | Identifies the retry recommendation |
outcome_id | Identifies the reported execution result |
subscription_id / billing_cycle_id | Connects activity to the merchant billing workflow |
safe key ID or fingerprint | Identifies credential use without exposing the secret |
{
"event": "zahlen_api_request", "endpoint": "/v1/_next/retry-decision", "method": "POST",
"status_code": 200,
"latency_ms": 162,
"event_id": "evt_20260616_0001", "request_id": "req_example", "decision_id": "dec_example", "idempotent_replay": false
}
Sensitive-data rule |
Do not log API keys, full card numbers, CVV values, passwords, raw bank credentials, or unrestricted request bodies. Prefer allow-listed structured fields and merchant-side tokens. |
Health checks and synthetic tests
Use layered checks. A connectivity probe verifies that the service responds; an authenticated synthetic test verifies that the commercial integration works. Synthetic tests must use approved test tenants, test credentials, and non-production payment evidence.
Check | What it proves |
GET /v1/health | Service is reachable and returning health metadata |
GET /v1/version | Deployed API and application version can be identified |
Authenticated read | Key resolution and tenant-scoped authorization work |
Synthetic event ingestion | Schema validation and ingestion path work |
Synthetic decision request | Decision contract and policy evaluation work |
Synthetic outcome report | Learning-loop write path works |
Webhook test delivery | Callback, verification, deduplication, and processing work |
Run smoke tests after deployment, configuration change, key rotation, and dependency maintenance.
Use unique test identifiers so synthetic activity can be filtered from business reporting.
Alert when a test fails repeatedly, but avoid aggressive probes that create rate-limit pressure.
Deployment and change management
Validate request models against the current /v1 contract or discovery metadata.
Run unit, integration, contract, and tenant-isolation tests.
Deploy to staging with staging-only keys and data stores.
Run synthetic event, decision, outcome, and webhook tests.
Deploy gradually to production and observe error rate, latency, and throughput.
Verify the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule in production configuration.
Record the release, configuration version, and verification evidence.
Rollback readiness
Keep the previous application version and configuration available for rollback.
Do not roll back databases or durable events without a tested migration strategy.
Preserve idempotency and identifiers across deployment boundaries.
After rollback, reconcile uncertain writes and verify outcome-reporting continuity.
Production-readiness checklist
Area | Ready when |
Credentials | Keys are secret-managed, environment-specific, rotatable, and auditable. |
Tenant safety | No client-supplied tenant ID is trusted as the ownership boundary. |
Idempotency | Repeatable POST operations use stable logical keys and reconciliation. |
Retry policy | 400/401/403/404/422 are not blindly retried; 429/5xx use bounded backoff. |
Payment schedule | Processor attempts are restricted to Day 1, Day 2, Day 6, and Day 16. |
Timeouts | Every external request has connect and read timeouts. |
Monitoring | Authentication, 429, 5xx, latency, decisions, outcomes, webhooks, and runs are observed. |
Logging | Correlation IDs are captured without logging secrets or prohibited card data. |
Testing | Key rotation, throttling, timeouts, duplicate delivery, and partial failure have been tested. |
Operations | Runbooks, owners, escalation paths, and rollback procedures are documented. |
Go-live rule |
Do not launch solely because a happy-path request succeeded. Go live only after failure behavior, security controls, observability, and operator response have been demonstrated in a production-like environment. |
Practical incident playbooks
Authentication failures spike
Check recent deployments and secret-manager changes.
Identify affected key IDs and environments without exposing secret material.
Compare 401 failures by endpoint, tenant, and application instance.
Rotate or replace misconfigured or compromised keys.
Review audit activity and confirm recovery after deployment.
429 responses increase
Confirm whether the limit is short-window rate limiting or longer-window quota exhaustion.
Check for loops, duplicate jobs, or unexpected traffic growth.
Verify that clients honor Retry-After and use jitter.
Reduce concurrency or batch safely where appropriate.
Review plan and quota settings only after abnormal traffic has been ruled out.
Outcome reporting falls behind
Compare completed processor attempts with reported outcome IDs.
Check queue depth, worker health, and failed outcome requests.
Replay only with stable identifiers and idempotency.
Reconcile missing records without creating another payment attempt.
Confirm the learning loop has resumed and backlog is decreasing.
Chapter summary
Rotate keys with overlap, verification, revocation, and audit evidence.
Classify failures before retrying and use stable idempotency for uncertain writes.
Use bounded exponential backoff with jitter for 429 and transient server failures.
Keep technical retries separate from the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule.
Monitor authentication, throttling, latency, decisions, outcomes, webhooks, and investigation runs.
Log durable correlation identifiers while minimizing sensitive data.
Treat failure testing, runbooks, and rollback readiness as launch requirements.
Next step |
Use Appendix A for the OpenAPI specification, Appendix B for complete JSON examples, and Appendix C for troubleshooting guidance. |