ZAHLEN

Appendix C - Troubleshooting

Zahlen API User Guide

For merchants, developers, and integration engineers


Version 1.0 | Source baseline: zahlen_deploy_0616A.tar.gz | June 2026


Commercial developer experience | Tenant-safe operations | Explainable retry intelligence

Appendix C - Troubleshooting


Purpose

This appendix provides a repeatable diagnostic method for common Zahlen API integration failures. Begin with evidence, preserve identifiers, and change only one variable at a time. Do not use troubleshooting as a reason to bypass authentication, tenant isolation, idempotency, quotas, or the fixed payment retry schedule.

The fastest path to resolution is usually not a larger retry loop. It is a precise answer to four questions: Which operation failed? What did the server return? Which durable identifiers were created? Did the failure occur before or after Zahlen accepted the logical operation?

    1. Before You Troubleshoot

      • Record the environment hostname and the exact route, method, and UTC timestamp.

      • Capture the HTTP status, response body, response headers, and any request or correlation identifier.

      • Identify the API key by safe key ID or fingerprint; never paste the secret into tickets or chat.

      • Preserve merchant-side event IDs, idempotency keys, batch IDs, upload job IDs, decision IDs, and outcome IDs.

      • Confirm whether the client retried and whether each retry reused the same logical identifiers.

      • Redact payment tokens, customer identifiers, and sensitive metadata before sharing evidence.


      Payment schedule boundary

      HTTP retries, polling, webhook redelivery, and administrative remediation must never create extra payment authorizations. Zahlen payment attempts follow the fixed schedule: Day 1, Day 2, Day 6, and Day 16.


    2. Five-Minute Triage


      Step

      Question

      Evidence

      1

      Is the API reachable?

      GET /v1/health, DNS, TLS, network path

      2

      Is authentication working?

      X-API-Key presence, environment, key status

      3

      Is the request contract valid?

      Content-Type, JSON syntax, required fields, types

      4

      Was the operation accepted?

      HTTP status and returned durable identifiers

      5

      Is downstream processing complete?

      Batch status, upload_job_id, decision/outcome resources

    3. Quick Symptom Matrix


      Symptom

      Most likely causes

      First action

      No response / timeout

      DNS, TLS, firewall, proxy, service unavailability

      Test /v1/health with a short timeout; preserve whether the POST result is uncertain.

      401 on every request

      Missing, invalid, revoked, malformed, or wrong-environment key

      Verify X-API-Key injection and key status without logging the secret.

      403

      Valid identity lacks permission, plan capability, role, or route access

      Confirm endpoint authorization and public-versus-admin boundary.

      404

      Wrong ID, wrong tenant, wrong path, or resource not yet created

      Verify identifier source, tenant context, and exact route.

      409

      Idempotency conflict or resource state conflict

      Compare the original idempotency key and request body.

      422

      Schema validation failure

      Correct required fields, types, constraints, and unknown properties.

      429

      Rate limit or quota enforcement

      Honor Retry-After, back off, inspect traffic and quota usage.

      500 / 503

      Server or dependency failure

      Retry only when safe, with bounded backoff and idempotency.

      Decision has no retry day

      Intentional WAIT/STOP/review outcome or insufficient evidence

      Follow explicit decision and reason; do not invent a date.

      Batch appears incomplete

      Asynchronous processing or pagination

      Check status, counts, has_more, offset, and upload_job_id.

      Webhook duplicates

      At-least-once delivery or provider retry

      Deduplicate durably and make the consumer repeat-safe.

      Investigation run not visible

      Admin authorization or tenant context mismatch

      Verify /v1/admin access and authenticated tenant context.


    4. Evidence Packet for Support

      • Environment and base URL (without embedded credentials).

      • HTTP method and path.

      • UTC timestamp and client timeout value.

      • Safe API key ID/fingerprint.

      • Status code, response headers, and redacted response body.

      • event_id, batch_id, upload_job_id, request_id, decision_id, outcome_id, or subscription_id as applicable.

      • Idempotency-Key and whether the body changed between attempts.

      • Minimal redacted request that reproduces the problem.

      • Expected behavior versus observed behavior.

    5. Connectivity, DNS, and TLS

      1. Health and version checks


        curl -i --connect-timeout 5 --max-time 15 \ https://api.example.com/v1/health


        curl -i --connect-timeout 5 --max-time 15 \ https://api.example.com/v1/version


A successful health response confirms basic reachability and service response. It does not prove that a merchant key is valid for authenticated routes.

      1. Common connectivity failures


        Failure

        Diagnostic direction

        Could not resolve host

        Verify hostname spelling, DNS records, VPN, and local resolver.

        Connection refused

        Confirm port, reverse proxy, service state, and firewall policy.

        Connection timed out

        Check routing, security groups, proxy egress, and service capacity.

        TLS certificate error

        Confirm hostname matches the certificate and the trust store is current.

        Unexpected redirect

        Use the documented HTTPS base URL; do not send credentials across unsafe redirects.

        HTML returned instead of JSON

        Check route, proxy configuration, and whether an operator login page intercepted the request.


        Uncertain POST result

        A client timeout does not prove the server rejected the request. Before creating a new logical operation, query by the stable event or decision identifiers, or repeat only with the same idempotency key where supported.

    1. Authentication and Authorization

      1. Diagnosing HTTP 401

        1. Confirm the header name is exactly X-API-Key.

        2. Confirm the key is being read from the intended secret or environment variable.

        3. Check for leading/trailing whitespace, line breaks, shell quoting, or accidental truncation.

        4. Confirm the key belongs to the selected development, staging, or production environment.

        5. Confirm the key is active and has not been revoked or rotated out.


          curl -i -X POST https://api.example.com/v1/payment-events \

          -H 'Content-Type: application/json' \

          -H 'X-API-Key: zk_live_REPLACE_ME' \

          -d '{"events":[{"event_id":"evt_auth_test_001"}]}'


      1. Diagnosing HTTP 403

        HTTP 403 normally means the credential was recognized but the caller is not allowed to perform the operation. Review plan capability, route authorization, merchant/tenant assignment, and whether the endpoint is administrative.


        Public versus administrative access

        A merchant X-API-Key should not be assumed to authorize /v1/admin/* routes. Administrative investigation-run and governance APIs require separately approved administrative context.


      2. Suspected key compromise

        • Revoke the affected key immediately.

        • Create a replacement through the approved lifecycle.

        • Review audit/activity by key ID, tenant, route, time, and source network.

        • Inspect quota spikes and unusual endpoint access.

        • Preserve evidence and notify authorized contacts.

    1. Validation and Request Construction

      1. HTTP 400 versus 422


        Status

        Typical meaning

        Client action

        400

        Malformed JSON, invalid business condition, or route-specific bad request

        Correct the request; do not retry unchanged.

        422

        Schema validation failure

        Read field-level errors and update serialization/types.


      2. Strict models

        Key Zahlen request models forbid unknown top-level properties. A misspelled field is a client bug, not an extension point. Place only approved custom data inside the documented metadata object where available.

        {

        "events": [{

        "event_id": "evt_001", "amount_mnor": 2999

        }]

        }


        // Incorrect: amount_mnor is misspelled and should be amount_minor.


      1. Frequent validation causes

        • Missing required event_id for payment events.

        • Empty events array or more than 10,000 payment events.

        • More than 500 events in a legacy retry-decision batch.

        • attempt_number below the documented minimum.

        • Negative amount or latency values where a nonnegative constraint applies.

        • String values sent for integer fields such as amount_minor.

        • Unknown top-level properties.

        • callback_url outside the supported length or an empty webhook events list.

      2. Local validation practice

        Use typed request models, JSON Schema validation, and contract tests before sending traffic. Test null, omitted optional fields, maximum batch size, and one unknown-field failure.

    1. Resource Visibility, 404, and Tenant Isolation

      1. Why a valid ID can still return 404

        • The identifier belongs to another tenant and is intentionally not visible.

        • The client used an event ID where a batch ID was required.

        • The resource has not been created or downstream processing is incomplete.

        • The path is incorrect, including singular/plural or nested resource differences.

        • The caller used the wrong environment.


          Do not bypass tenant filters

          Cross-tenant empty or 404 results are expected security behavior. Never add tenant_id to a request body, query string, or form to force visibility. Tenant ownership is resolved from authenticated context.


      2. Identifier chain


        Identifier

        Created by

        Troubleshooting use

        event_id

        Merchant

        Find one submitted event and avoid duplicate logical events.

        batch_id / payment_event_batch_id

        Zahlen

        Retrieve batch state, summaries, decisions, and processor results.

        upload_job_id

        Zahlen

        Correlate ingestion with asynchronous processing and investigation runs.

        request_id

        Zahlen

        Trace one decision request across logs and support.

        decision_id

        Zahlen

        Link recommendation to the observed retry outcome.

        outcome_id

        Zahlen

        Confirm the recovery-learning record was accepted.

        subscription_id

        Zahlen

        Manage and diagnose one webhook subscription.

    2. Conflicts, Idempotency, and Duplicate Effects

      1. Diagnosing HTTP 409

        A conflict can indicate that an idempotency key was reused with a different request body, that the resource already exists in an incompatible state, or that the requested transition is not allowed.

        1. Locate the original local operation record.

        2. Compare the exact idempotency key and serialized body.

        3. Confirm whether the first request returned a durable identifier.

        4. Do not generate a new key merely to force a second effect.

        5. Reconcile the server resource before retrying.

      2. Idempotency rules

        • One logical operation gets one stable key.

        • The same logical operation reuses the same key and body.

        • A materially different operation uses a new key.

        • Store the key until the operation is terminal and reconciled.

        • Never base the key only on the HTTP attempt number.


          Idempotency-Key: order-8842-attempt-2


Payment attempt identity

The attempt number in an idempotency key identifies the scheduled payment attempt. It does not authorize additional attempts beyond Day 1, Day 2, Day 6, and Day 16.

    1. Rate Limits, Quotas, and HTTP 429

      1. Immediate client response

        1. Stop immediate repeated retries.

        2. Read Retry-After and rate/quota metadata when supplied.

        3. Use bounded exponential backoff with randomized jitter.

        4. Preserve the idempotency key for the same logical operation.

        5. Alert when throttling is sustained or increases suddenly.


          delay = min(max_delay, base_delay * (2 ** retry_number)) delay = delay * random.uniform(0.75, 1.25)


      1. Diagnose before increasing capacity


        Pattern

        Possible cause

        Action

        Single key dominates

        Integration loop or compromised credential

        Stop the client, rotate/revoke if needed, inspect activity.

        Large synchronized spike

        Fleet restart or missing jitter

        Spread retries and add concurrency control.

        Steady legitimate growth

        Plan or quota no longer fits volume

        Coordinate an approved capacity change.

        Many small requests

        Inefficient integration

        Batch where appropriate without exceeding schema limits.

        Repeated same operation

        Missing idempotency/reconciliation

        Fix logical retry behavior before raising limits.


        Payment-event ingestion accepts 1 to 10,000 events per request, while the legacy retry-decision batch accepts up to 500. These are request-schema ceilings, not guaranteed throughput targets.

    1. Server Errors, Timeouts, and Safe Retries

      1. HTTP 500 and 503

        Treat 500 and 503 as potentially transient, but do not retry every POST blindly. A server may have completed the operation before a downstream failure or network interruption prevented the response from reaching the client.


        Operation

        Automatic retry guidance

        GET resource

        Usually safe with bounded backoff.

        POST retry decision

        Reuse the same idempotency key and body.

        POST retry outcome

        Reuse stable identifiers and idempotency where supported.

        POST payment-event batch

        Use stable event IDs and reconcile batch/job identifiers.

        Validation failure

        Do not retry until corrected.


      2. Retry budget

        • Set connect and total request timeouts.

        • Cap retry count and total elapsed time.

        • Use exponential backoff with jitter.

        • Open a circuit or reduce concurrency during sustained dependency failure.

        • Move irrecoverable work to a durable review queue instead of looping forever.


          Never confuse transport and payment retries

          Retrying an HTTP request is a technical recovery mechanism. Retrying a card authorization is a payment action governed by the Day 1, Day 2, Day 6, and Day 16 schedule.

    2. Payment Events and Batch Processing

      1. Event not found after submission

        1. Check whether the POST succeeded and returned payment_event_batch_id or batch_id.

        2. Verify event_id spelling and environment.

        3. Retrieve the batch and inspect accepted/rejected counts.

        4. Check upload_job_id and processing status.

        5. Allow for asynchronous completion before declaring loss.

      2. Batch counts do not match expectations


        Field

        Meaning

        submitted

        Number presented to the batch endpoint.

        accepted

        Number accepted for processing.

        rejected

        Number rejected at ingestion.

        event_count / total_events

        Known events associated with the batch.

        returned

        Number included in the current paginated response.

        has_more

        Whether another page remains.


        • Use offset >= 0 and limit 1 through 1,000.

        • Continue while has_more is true.

        • Do not compare returned with total_events as if they were the same metric.

        • Inspect invalid_rows, error_count, and ingestion details when available.

      3. Decision or processor result missing

        A payment event, its retry decision, and its processor result are separate resources and may become available at different times. Check batch status, decision source, processor-results source, and upload job processing before escalating.

    3. Retry Decisions and Outcomes

      1. Decision has no retry day

        A nullable retry day can be intentional. Follow the explicit decision/action, reason code, reason detail, and policy source. Do not invent a date or fall back to an ad hoc retry schedule.

      2. Unexpected decision

        • Verify the integration is using the intended legacy or next-generation contract.

        • Confirm attempt_number and decline evidence.

        • Compare issuer BIN, card brand, amount units, currency, and timestamps.

        • Review policy_source, matched_policy_id, reason codes, and confidence.

        • Confirm that optional fields were not silently omitted by client serialization.

      3. Outcome does not match a decision

        • Send request_id and decision_id whenever available.

        • Use the same token and attempt_number as the executed operation.

        • Report the actual processor result and timestamp.

        • Inspect matched_by in the outcome response.

        • Do not report recovery merely because a retry was scheduled.


          Confidence is not a guarantee

          Confidence describes evidence strength. It does not promise authorization success, and it should not override the explicit decision or fixed retry schedule.

    4. Investigation Runs

      1. Run is not listed

        • Confirm the caller has administrative authorization.

        • Confirm the correct tenant and environment.

        • Verify the upload_job_id or job_id from the ingestion response.

        • Check whether the run has been created yet.

        • Do not assume a merchant API key authorizes /v1/admin/investigation-runs.

      2. Run is complete but reporting is empty

        1. Confirm run status and row counts.

        2. Confirm Recovery Truth population.

        3. Confirm radar and issuer-health generation.

        4. Confirm monitoring-event and timeline population.

        5. Confirm cohort memory and classification persistence.

        6. Confirm reporting or command-center composition.


        Backfill is remediation, not the normal path

        On a clean deployment, completed runs should populate downstream resources through automatic bridges. Use previewed, tenant-scoped backfill only after locating a specific missing bridge.


      3. Polling never reaches terminal state

        Increase the polling interval, inspect runtime health and worker status, and stop after an application-defined timeout. Do not create a second run merely because the first is slow.

    5. Webhook Troubleshooting

      1. Subscription creation fails

        • Confirm callback_url is present and within 8 to 2,048 characters.

        • Provide 1 to 20 event names.

        • Remove unknown top-level fields.

        • Verify merchant authentication and plan capability.

        • Confirm HTTPS and production callback policy with the active deployment contract.

      2. Deliveries fail


        Check

        Why

        Subscription status and URL

        The subscription may be deleted, disabled, or misconfigured.

        DNS and TLS

        Zahlen must reach and trust the callback endpoint.

        Consumer response time

        Slow handlers can cause timeouts and redelivery.

        HTTP response code

        Non-success responses normally trigger delivery failure handling.

        Signature/verification policy

        A stale secret or wrong raw-body handling can reject valid deliveries.

        Dependency health

        The callback may be up while its database or queue is unavailable.


      3. Duplicate and out-of-order deliveries

        • Store a stable delivery/event identifier with a uniqueness constraint.

        • Acknowledge quickly and process asynchronously.

        • Make business effects idempotent.

        • Do not assume ordering across retries or event types.

        • Quarantine unknown event types rather than failing the entire consumer.


          Verification contract is deployment-specific

          The subscription schema does not define a universal signing algorithm or header. Use the active Zahlen webhook outcome and verification contract; never invent one from examples.

    6. Observability and Correlation

      1. Minimum structured log fields


        Category

        Recommended fields

        Request

        timestamp, environment, method, route, status, latency_ms

        Identity

        safe key_id/fingerprint, merchant_id where returned

        Correlation

        event_id, batch_id, upload_job_id, request_id, decision_id, outcome_id

        Retry

        idempotency_key fingerprint, retry_count, next_delay_ms

        Error

        error category, safe message, response request ID

        Webhook

        subscription_id, event type, delivery ID, verification result


      2. Do not log

        • Full API keys or webhook secrets.

        • Full payment card numbers, CVV, passwords, or raw bank credentials.

        • Unredacted payment tokens or customer data unless explicitly approved.

        • Entire metadata objects without a classification and retention policy.

      3. Monitoring signals

        • Authentication failure rate.

        • HTTP 422 rate by field/path.

        • HTTP 429 rate and quota utilization.

        • Decision latency and error rate.

        • Outcome reporting lag and match rate.

        • Webhook failure, retry, and duplicate rate.

        • Unresolved or long-running investigation runs.

    7. Escalation and Production Checklist

      1. Escalate immediately when

        • One tenant can view another tenant’s data.

        • An API key is exposed or used after revocation.

        • Duplicate payment authorizations may have occurred.

        • Audit or correlation evidence is missing for a privileged action.

        • Sustained errors affect many tenants or production payment flows.

      2. Troubleshooting checklist


Area

Ready when

Connectivity

Health/version checks work from the production network.

Authentication

Key injection, rotation, and revocation are tested.

Contracts

Required, optional, null, unknown-field, and maximum-size tests pass.

Idempotency

Uncertain POST results reconcile without duplicate effects.

Rate limits

429 handling honors Retry-After and uses jittered backoff.

Payment schedule

No technical retry can create attempts outside Days 1, 2, 6, and 16.

Correlation

All durable identifiers appear in structured logs.

Webhooks

Verification, deduplication, asynchronous processing, and replay are tested.

Investigation runs

Completed runs can be traced into downstream reporting.

Security

Secrets and sensitive payment data are redacted from logs and support evidence.


Final rule

Fix the first broken boundary or durable link in the evidence chain. Do not hide failures with synthetic data, bypass tenant filters, raise quotas without diagnosis, or create uncontrolled payment retries.