ZAHLEN

API User Guide

Chapter 16 - Production Best Practices

Key rotation | Retry strategies | Monitoring


Audience

Merchants, developers, and integration engineers responsible for deploying, operating, and supporting a

production Zahlen integration.


Version 1.0 | Source baseline: zahlen_deploy_0616A.tar.gz | June 2026

Chapter 16 - Production Best Practices


Learning objectives

By the end of this chapter, you should be able to rotate API keys without downtime, implement safe network and application retries, monitor the complete Zahlen workflow, and define a practical production-readiness

process.

A correct API integration is not automatically a production-ready integration. Production systems must assume credentials will be rotated, networks will fail, requests may be repeated, dependencies may slow down, and operators will need enough evidence to understand what happened.


Canonical payment schedule

Zahlen payment retries follow the fixed Day 1, Day 2, Day 6, and Day 16 schedule. HTTP retries, queue retries, webhook retries, and worker retries are technical recovery mechanisms; they must never create extra

payment authorization attempts outside that schedule.


    1. Production operating principles

      • Fail closed when authentication or tenant context cannot be resolved.

      • Use stable identifiers and idempotency keys for every repeatable write operation.

      • Separate payment retry scheduling from API transport retry behavior.

      • Collect enough telemetry to correlate merchant activity with Zahlen request, decision, batch, job, and outcome identifiers.

      • Test key rotation, throttling, partial failures, and dependency outages before launch.

    2. Key rotation

      API keys are production credentials. Rotation should be planned as a normal operational process, not reserved only for emergencies. A safe rotation uses a controlled overlap period so the replacement key can be deployed and verified before the previous key is revoked.

      Recommended rotation procedure

      1. Create a replacement key for the correct tenant, merchant, environment, and service.

      2. Store the new key in the approved secret manager. Do not place it in source control, a ticket, or a chat message.

      3. Deploy the new key to one canary instance or a small percentage of traffic.

      4. Verify authenticated health, payment-event, decision, and outcome requests with the new key.

      5. Roll the new key out to all application instances, workers, and scheduled jobs.

      6. Confirm that traffic using the old key has stopped.

      7. Revoke the old key and monitor for rejected use attempts.

      8. Record the rotation event, key identifier, operator, time, and validation evidence in the audit trail.


        Do not rotate by overwrite alone

        Replacing a secret everywhere at once without overlap can create an avoidable outage. Use dual-key overlap

        when the administrative policy permits it, then revoke the retired credential after verification.

    3. Key storage and exposure prevention


      Control

      Production expectation

      Secret storage

      Use a managed secret store or protected runtime injection

      mechanism.

      Application logs

      Never log the complete key; log only a safe key ID or

      fingerprint.

      Client location

      Keep production keys in server-side systems, not browsers

      or mobile applications.

      Environment separation

      Use different keys for development, staging, production, and

      independent services.

      Access scope

      Grant secret access only to the workloads and operators that

      require it.

      Revocation

      Revoke immediately after confirmed compromise or

      unauthorized disclosure.

      Audit

      Retain creation, rotation, revocation, and failed-use

      evidence.


      Compromise response

      • Identify the affected key without redistributing the secret.

      • Revoke or disable the key according to incident policy.

      • Issue and deploy a replacement credential through the controlled rotation process.

      • Review API activity by tenant, endpoint, time, source, and status code.

      • Investigate unexpected outcome reporting, event ingestion, or decision traffic.

    4. Retry strategy: classify before retrying

      The client must determine whether a failure is safe to repeat. A retry decision should consider the HTTP method, status code, idempotency behavior, whether the server may have completed the operation, and the maximum retry budget.


      Condition

      Automatic retry?

      Recommended behavior

      GET with timeout or transient 5xx

      Usually

      Retry with bounded exponential

      backoff and jitter.

      POST with stable Idempotency-Key

      Carefully

      Repeat the same logical request with

      the same key.

      POST without idempotency guarantee

      No automatic blind retry

      Reconcile using stable identifiers

      before repeating.

      400 or 422

      No

      Correct the payload; do not retry

      unchanged.

      401 or 403

      No

      Repair authentication or authorization.

      404

      Usually no

      Verify tenant-scoped identifier and

      endpoint.

      409

      Reconcile

      Compare the original request and

      idempotency key.

      429

      Yes, later

      Honor Retry-After when present and

      apply jitter.

      500 or 503

      Carefully

      Retry within a bounded budget using

      idempotency.


      Uncertain result

      A timeout does not prove that the server did nothing. For a write request, treat the result as uncertain until

      you reconcile it using the idempotency key or durable business identifier.

    5. Exponential backoff with jitter

      Backoff reduces pressure on a recovering service. Jitter prevents many clients from retrying at the same instant. Always cap both the delay and the total number of attempts.

      import random import time


      BASE_DELAY = 1.0

      MAX_DELAY = 30.0

      MAX_ATTEMPTS = 5


      for retry_number in range(MAX_ATTEMPTS):

      delay = min(MAX_DELAY, BASE_DELAY * (2 ** retry_number)) delay *= random.uniform(0.75, 1.25)

      time.sleep(delay)

Retry budget guidance

    1. Technical retries versus payment retries

      A production integration has several different retry layers. They must remain separate so a network recovery does not accidentally become another charge attempt.


      Retry layer

      Purpose

      May create a new payment

      attempt?

      HTTP request retry

      Recover from network or server

      failure

      No, not by itself

      Queue delivery retry

      Redeliver internal work

      No, not by itself

      Webhook delivery retry

      Redeliver notification payload

      No

      Worker retry

      Re-run a failed technical task

      No, unless the task is explicitly

      authorized and idempotent

      Zahlen payment retry schedule

      Execute the business-authorized

      authorization attempt

      Yes, only on Day 1, Day 2, Day 6, or Day

      14


      Critical safeguard

      Store a durable payment-attempt record keyed by subscription or billing cycle and attempt number. Before sending an authorization to the processor, verify that the planned attempt matches the authorized Zahlen

      schedule and has not already been executed.

    2. Monitoring the end-to-end workflow

      Monitor the complete commercial workflow, not only API uptime. A healthy health endpoint does not prove that decisioning, outcomes, investigation runs, or webhooks are functioning correctly.


      Signal

      Why it matters

      Example alert condition

      Authentication failure rate

      Detects revoked, expired, or

      misconfigured keys

      Sustained increase in 401 responses

      Authorization failure rate

      Detects plan, role, or route policy

      problems

      Unexpected increase in 403 responses

      429 rate

      Shows rate-limit or quota pressure

      Repeated throttling above normal

      baseline

      Request latency

      Detects slow dependencies or policy

      evaluation

      High p95 or p99 latency

      5xx / 503 rate

      Shows service or dependency failure

      Error ratio exceeds agreed threshold

      Decision completion

      Confirms event-to-decision flow

      Events remain without decisions

      beyond expected time

      Outcome reporting lag

      Shows learning-loop interruption

      Decision executed but outcome not

      reported

      Webhook failure rate

      Detects unavailable callbacks or

      consumer errors

      Delivery failures or retry backlog

      Investigation-run backlog

      Shows ingestion or bridge pressure

      Runs remain non-terminal or

      unpopulated

    3. Correlation and structured logging

      Production logs should make it possible to trace one payment event across merchant systems and Zahlen without exposing credentials or prohibited payment data.


      Identifier

      Use in logs

      event_id

      Merchant-created correlation for one payment event

      payment_event_batch_id / batch_id

      Groups events submitted together

      upload_job_id

      Connects ingestion with background processing and

      investigation

      request_id

      Correlates an API request with support and audit records

      decision_id

      Identifies the retry recommendation

      outcome_id

      Identifies the reported execution result

      subscription_id / billing_cycle_id

      Connects activity to the merchant billing workflow

      safe key ID or fingerprint

      Identifies credential use without exposing the secret


      {

      "event": "zahlen_api_request", "endpoint": "/v1/_next/retry-decision", "method": "POST",

      "status_code": 200,

      "latency_ms": 162,

      "event_id": "evt_20260616_0001", "request_id": "req_example", "decision_id": "dec_example", "idempotent_replay": false

      }


Sensitive-data rule

Do not log API keys, full card numbers, CVV values, passwords, raw bank credentials, or unrestricted request

bodies. Prefer allow-listed structured fields and merchant-side tokens.

    1. Health checks and synthetic tests

      Use layered checks. A connectivity probe verifies that the service responds; an authenticated synthetic test verifies that the commercial integration works. Synthetic tests must use approved test tenants, test credentials, and non-production payment evidence.


      Check

      What it proves

      GET /v1/health

      Service is reachable and returning health metadata

      GET /v1/version

      Deployed API and application version can be identified

      Authenticated read

      Key resolution and tenant-scoped authorization work

      Synthetic event ingestion

      Schema validation and ingestion path work

      Synthetic decision request

      Decision contract and policy evaluation work

      Synthetic outcome report

      Learning-loop write path works

      Webhook test delivery

      Callback, verification, deduplication, and processing work


      • Run smoke tests after deployment, configuration change, key rotation, and dependency maintenance.

      • Use unique test identifiers so synthetic activity can be filtered from business reporting.

      • Alert when a test fails repeatedly, but avoid aggressive probes that create rate-limit pressure.

    2. Deployment and change management

      1. Validate request models against the current /v1 contract or discovery metadata.

      2. Run unit, integration, contract, and tenant-isolation tests.

      3. Deploy to staging with staging-only keys and data stores.

      4. Run synthetic event, decision, outcome, and webhook tests.

      5. Deploy gradually to production and observe error rate, latency, and throughput.

      6. Verify the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule in production configuration.

      7. Record the release, configuration version, and verification evidence.

        Rollback readiness

        • Keep the previous application version and configuration available for rollback.

        • Do not roll back databases or durable events without a tested migration strategy.

        • Preserve idempotency and identifiers across deployment boundaries.

        • After rollback, reconcile uncertain writes and verify outcome-reporting continuity.

    3. Production-readiness checklist


      Area

      Ready when

      Credentials

      Keys are secret-managed, environment-specific, rotatable,

      and auditable.

      Tenant safety

      No client-supplied tenant ID is trusted as the ownership

      boundary.

      Idempotency

      Repeatable POST operations use stable logical keys and

      reconciliation.

      Retry policy

      400/401/403/404/422 are not blindly retried; 429/5xx use

      bounded backoff.

      Payment schedule

      Processor attempts are restricted to Day 1, Day 2, Day 6, and

      Day 16.

      Timeouts

      Every external request has connect and read timeouts.

      Monitoring

      Authentication, 429, 5xx, latency, decisions, outcomes,

      webhooks, and runs are observed.

      Logging

      Correlation IDs are captured without logging secrets or

      prohibited card data.

      Testing

      Key rotation, throttling, timeouts, duplicate delivery, and

      partial failure have been tested.

      Operations

      Runbooks, owners, escalation paths, and rollback

      procedures are documented.


      Go-live rule

      Do not launch solely because a happy-path request succeeded. Go live only after failure behavior, security

      controls, observability, and operator response have been demonstrated in a production-like environment.

    4. Practical incident playbooks

      Authentication failures spike

      1. Check recent deployments and secret-manager changes.

      2. Identify affected key IDs and environments without exposing secret material.

      3. Compare 401 failures by endpoint, tenant, and application instance.

      4. Rotate or replace misconfigured or compromised keys.

      5. Review audit activity and confirm recovery after deployment.

        429 responses increase

      6. Confirm whether the limit is short-window rate limiting or longer-window quota exhaustion.

      7. Check for loops, duplicate jobs, or unexpected traffic growth.

      8. Verify that clients honor Retry-After and use jitter.

      9. Reduce concurrency or batch safely where appropriate.

      10. Review plan and quota settings only after abnormal traffic has been ruled out.

        Outcome reporting falls behind

      11. Compare completed processor attempts with reported outcome IDs.

      12. Check queue depth, worker health, and failed outcome requests.

      13. Replay only with stable identifiers and idempotency.

      14. Reconcile missing records without creating another payment attempt.

      15. Confirm the learning loop has resumed and backlog is decreasing.

    5. Chapter summary

      • Rotate keys with overlap, verification, revocation, and audit evidence.

      • Classify failures before retrying and use stable idempotency for uncertain writes.

      • Use bounded exponential backoff with jitter for 429 and transient server failures.

      • Keep technical retries separate from the fixed Day 1, Day 2, Day 6, and Day 16 payment schedule.

      • Monitor authentication, throttling, latency, decisions, outcomes, webhooks, and investigation runs.

      • Log durable correlation identifiers while minimizing sensitive data.

      • Treat failure testing, runbooks, and rollback readiness as launch requirements.


Next step

Use Appendix A for the OpenAPI specification, Appendix B for complete JSON examples, and Appendix C for

troubleshooting guidance.