For merchants, developers, and integration engineers
Version 1.0 | Source baseline: zahlen_deploy_0616A.tar.gz | June 2026
Commercial developer experience | Tenant-safe operations | Explainable retry intelligence
Purpose |
This appendix provides a repeatable diagnostic method for common Zahlen API integration failures. Begin with evidence, preserve identifiers, and change only one variable at a time. Do not use troubleshooting as a reason to bypass authentication, tenant isolation, idempotency, quotas, or the fixed payment retry schedule. |
The fastest path to resolution is usually not a larger retry loop. It is a precise answer to four questions: Which operation failed? What did the server return? Which durable identifiers were created? Did the failure occur before or after Zahlen accepted the logical operation?
Before You Troubleshoot
Record the environment hostname and the exact route, method, and UTC timestamp.
Capture the HTTP status, response body, response headers, and any request or correlation identifier.
Identify the API key by safe key ID or fingerprint; never paste the secret into tickets or chat.
Preserve merchant-side event IDs, idempotency keys, batch IDs, upload job IDs, decision IDs, and outcome IDs.
Confirm whether the client retried and whether each retry reused the same logical identifiers.
Redact payment tokens, customer identifiers, and sensitive metadata before sharing evidence.
Payment schedule boundary |
HTTP retries, polling, webhook redelivery, and administrative remediation must never create extra payment authorizations. Zahlen payment attempts follow the fixed schedule: Day 1, Day 2, Day 6, and Day 16. |
Five-Minute Triage
Step | Question | Evidence |
1 | Is the API reachable? | GET /v1/health, DNS, TLS, network path |
2 | Is authentication working? | X-API-Key presence, environment, key status |
3 | Is the request contract valid? | Content-Type, JSON syntax, required fields, types |
4 | Was the operation accepted? | HTTP status and returned durable identifiers |
5 | Is downstream processing complete? | Batch status, upload_job_id, decision/outcome resources |
Symptom | Most likely causes | First action |
No response / timeout | DNS, TLS, firewall, proxy, service unavailability | Test /v1/health with a short timeout; preserve whether the POST result is uncertain. |
401 on every request | Missing, invalid, revoked, malformed, or wrong-environment key | Verify X-API-Key injection and key status without logging the secret. |
403 | Valid identity lacks permission, plan capability, role, or route access | Confirm endpoint authorization and public-versus-admin boundary. |
404 | Wrong ID, wrong tenant, wrong path, or resource not yet created | Verify identifier source, tenant context, and exact route. |
409 | Idempotency conflict or resource state conflict | Compare the original idempotency key and request body. |
422 | Schema validation failure | Correct required fields, types, constraints, and unknown properties. |
429 | Rate limit or quota enforcement | Honor Retry-After, back off, inspect traffic and quota usage. |
500 / 503 | Server or dependency failure | Retry only when safe, with bounded backoff and idempotency. |
Decision has no retry day | Intentional WAIT/STOP/review outcome or insufficient evidence | Follow explicit decision and reason; do not invent a date. |
Batch appears incomplete | Asynchronous processing or pagination | Check status, counts, has_more, offset, and upload_job_id. |
Webhook duplicates | At-least-once delivery or provider retry | Deduplicate durably and make the consumer repeat-safe. |
Investigation run not visible | Admin authorization or tenant context mismatch | Verify /v1/admin access and authenticated tenant context. |
Evidence Packet for Support
Environment and base URL (without embedded credentials).
HTTP method and path.
UTC timestamp and client timeout value.
Safe API key ID/fingerprint.
Status code, response headers, and redacted response body.
event_id, batch_id, upload_job_id, request_id, decision_id, outcome_id, or subscription_id as applicable.
Idempotency-Key and whether the body changed between attempts.
Minimal redacted request that reproduces the problem.
Expected behavior versus observed behavior.
Health and version checks
curl -i --connect-timeout 5 --max-time 15 \ https://api.example.com/v1/health
curl -i --connect-timeout 5 --max-time 15 \ https://api.example.com/v1/version
A successful health response confirms basic reachability and service response. It does not prove that a merchant key is valid for authenticated routes.
Common connectivity failures
Failure | Diagnostic direction |
Could not resolve host | Verify hostname spelling, DNS records, VPN, and local resolver. |
Connection refused | Confirm port, reverse proxy, service state, and firewall policy. |
Connection timed out | Check routing, security groups, proxy egress, and service capacity. |
TLS certificate error | Confirm hostname matches the certificate and the trust store is current. |
Unexpected redirect | Use the documented HTTPS base URL; do not send credentials across unsafe redirects. |
HTML returned instead of JSON | Check route, proxy configuration, and whether an operator login page intercepted the request. |
Uncertain POST result |
A client timeout does not prove the server rejected the request. Before creating a new logical operation, query by the stable event or decision identifiers, or repeat only with the same idempotency key where supported. |
Diagnosing HTTP 401
Confirm the header name is exactly X-API-Key.
Confirm the key is being read from the intended secret or environment variable.
Check for leading/trailing whitespace, line breaks, shell quoting, or accidental truncation.
Confirm the key belongs to the selected development, staging, or production environment.
Confirm the key is active and has not been revoked or rotated out.
curl -i -X POST https://api.example.com/v1/payment-events \
-H 'Content-Type: application/json' \
-H 'X-API-Key: zk_live_REPLACE_ME' \
-d '{"events":[{"event_id":"evt_auth_test_001"}]}'
Diagnosing HTTP 403
HTTP 403 normally means the credential was recognized but the caller is not allowed to perform the operation. Review plan capability, route authorization, merchant/tenant assignment, and whether the endpoint is administrative.
Public versus administrative access |
A merchant X-API-Key should not be assumed to authorize /v1/admin/* routes. Administrative investigation-run and governance APIs require separately approved administrative context. |
Suspected key compromise
Revoke the affected key immediately.
Create a replacement through the approved lifecycle.
Review audit/activity by key ID, tenant, route, time, and source network.
Inspect quota spikes and unusual endpoint access.
Preserve evidence and notify authorized contacts.
HTTP 400 versus 422
Status | Typical meaning | Client action |
400 | Malformed JSON, invalid business condition, or route-specific bad request | Correct the request; do not retry unchanged. |
422 | Schema validation failure | Read field-level errors and update serialization/types. |
Strict models
Key Zahlen request models forbid unknown top-level properties. A misspelled field is a client bug, not an extension point. Place only approved custom data inside the documented metadata object where available.
{
"events": [{
"event_id": "evt_001", "amount_mnor": 2999
}]
}
// Incorrect: amount_mnor is misspelled and should be amount_minor.
Frequent validation causes
Missing required event_id for payment events.
Empty events array or more than 10,000 payment events.
More than 500 events in a legacy retry-decision batch.
attempt_number below the documented minimum.
Negative amount or latency values where a nonnegative constraint applies.
String values sent for integer fields such as amount_minor.
Unknown top-level properties.
callback_url outside the supported length or an empty webhook events list.
Local validation practice
Use typed request models, JSON Schema validation, and contract tests before sending traffic. Test null, omitted optional fields, maximum batch size, and one unknown-field failure.
Why a valid ID can still return 404
The identifier belongs to another tenant and is intentionally not visible.
The client used an event ID where a batch ID was required.
The resource has not been created or downstream processing is incomplete.
The path is incorrect, including singular/plural or nested resource differences.
The caller used the wrong environment.
Do not bypass tenant filters |
Cross-tenant empty or 404 results are expected security behavior. Never add tenant_id to a request body, query string, or form to force visibility. Tenant ownership is resolved from authenticated context. |
Identifier chain
Identifier | Created by | Troubleshooting use |
event_id | Merchant | Find one submitted event and avoid duplicate logical events. |
batch_id / payment_event_batch_id | Zahlen | Retrieve batch state, summaries, decisions, and processor results. |
upload_job_id | Zahlen | Correlate ingestion with asynchronous processing and investigation runs. |
request_id | Zahlen | Trace one decision request across logs and support. |
decision_id | Zahlen | Link recommendation to the observed retry outcome. |
outcome_id | Zahlen | Confirm the recovery-learning record was accepted. |
subscription_id | Zahlen | Manage and diagnose one webhook subscription. |
Diagnosing HTTP 409
A conflict can indicate that an idempotency key was reused with a different request body, that the resource already exists in an incompatible state, or that the requested transition is not allowed.
Locate the original local operation record.
Compare the exact idempotency key and serialized body.
Confirm whether the first request returned a durable identifier.
Do not generate a new key merely to force a second effect.
Reconcile the server resource before retrying.
Idempotency rules
One logical operation gets one stable key.
The same logical operation reuses the same key and body.
A materially different operation uses a new key.
Store the key until the operation is terminal and reconciled.
Never base the key only on the HTTP attempt number.
Idempotency-Key: order-8842-attempt-2
Payment attempt identity |
The attempt number in an idempotency key identifies the scheduled payment attempt. It does not authorize additional attempts beyond Day 1, Day 2, Day 6, and Day 16. |
Immediate client response
Stop immediate repeated retries.
Read Retry-After and rate/quota metadata when supplied.
Use bounded exponential backoff with randomized jitter.
Preserve the idempotency key for the same logical operation.
Alert when throttling is sustained or increases suddenly.
delay = min(max_delay, base_delay * (2 ** retry_number)) delay = delay * random.uniform(0.75, 1.25)
Diagnose before increasing capacity
Pattern | Possible cause | Action |
Single key dominates | Integration loop or compromised credential | Stop the client, rotate/revoke if needed, inspect activity. |
Large synchronized spike | Fleet restart or missing jitter | Spread retries and add concurrency control. |
Steady legitimate growth | Plan or quota no longer fits volume | Coordinate an approved capacity change. |
Many small requests | Inefficient integration | Batch where appropriate without exceeding schema limits. |
Repeated same operation | Missing idempotency/reconciliation | Fix logical retry behavior before raising limits. |
Payment-event ingestion accepts 1 to 10,000 events per request, while the legacy retry-decision batch accepts up to 500. These are request-schema ceilings, not guaranteed throughput targets.
HTTP 500 and 503
Treat 500 and 503 as potentially transient, but do not retry every POST blindly. A server may have completed the operation before a downstream failure or network interruption prevented the response from reaching the client.
Operation | Automatic retry guidance |
GET resource | Usually safe with bounded backoff. |
POST retry decision | Reuse the same idempotency key and body. |
POST retry outcome | Reuse stable identifiers and idempotency where supported. |
POST payment-event batch | Use stable event IDs and reconcile batch/job identifiers. |
Validation failure | Do not retry until corrected. |
Retry budget
Set connect and total request timeouts.
Cap retry count and total elapsed time.
Use exponential backoff with jitter.
Open a circuit or reduce concurrency during sustained dependency failure.
Move irrecoverable work to a durable review queue instead of looping forever.
Never confuse transport and payment retries |
Retrying an HTTP request is a technical recovery mechanism. Retrying a card authorization is a payment action governed by the Day 1, Day 2, Day 6, and Day 16 schedule. |
Event not found after submission
Check whether the POST succeeded and returned payment_event_batch_id or batch_id.
Verify event_id spelling and environment.
Retrieve the batch and inspect accepted/rejected counts.
Check upload_job_id and processing status.
Allow for asynchronous completion before declaring loss.
Batch counts do not match expectations
Field | Meaning |
submitted | Number presented to the batch endpoint. |
accepted | Number accepted for processing. |
rejected | Number rejected at ingestion. |
event_count / total_events | Known events associated with the batch. |
returned | Number included in the current paginated response. |
has_more | Whether another page remains. |
Use offset >= 0 and limit 1 through 1,000.
Continue while has_more is true.
Do not compare returned with total_events as if they were the same metric.
Inspect invalid_rows, error_count, and ingestion details when available.
Decision or processor result missing
A payment event, its retry decision, and its processor result are separate resources and may become available at different times. Check batch status, decision source, processor-results source, and upload job processing before escalating.
Decision has no retry day
A nullable retry day can be intentional. Follow the explicit decision/action, reason code, reason detail, and policy source. Do not invent a date or fall back to an ad hoc retry schedule.
Unexpected decision
Verify the integration is using the intended legacy or next-generation contract.
Confirm attempt_number and decline evidence.
Compare issuer BIN, card brand, amount units, currency, and timestamps.
Review policy_source, matched_policy_id, reason codes, and confidence.
Confirm that optional fields were not silently omitted by client serialization.
Outcome does not match a decision
Send request_id and decision_id whenever available.
Use the same token and attempt_number as the executed operation.
Report the actual processor result and timestamp.
Inspect matched_by in the outcome response.
Do not report recovery merely because a retry was scheduled.
Confidence is not a guarantee |
Confidence describes evidence strength. It does not promise authorization success, and it should not override the explicit decision or fixed retry schedule. |
Run is not listed
Confirm the caller has administrative authorization.
Confirm the correct tenant and environment.
Verify the upload_job_id or job_id from the ingestion response.
Check whether the run has been created yet.
Do not assume a merchant API key authorizes /v1/admin/investigation-runs.
Run is complete but reporting is empty
Confirm run status and row counts.
Confirm Recovery Truth population.
Confirm radar and issuer-health generation.
Confirm monitoring-event and timeline population.
Confirm cohort memory and classification persistence.
Confirm reporting or command-center composition.
Backfill is remediation, not the normal path |
On a clean deployment, completed runs should populate downstream resources through automatic bridges. Use previewed, tenant-scoped backfill only after locating a specific missing bridge. |
Polling never reaches terminal state
Increase the polling interval, inspect runtime health and worker status, and stop after an application-defined timeout. Do not create a second run merely because the first is slow.
Subscription creation fails
Confirm callback_url is present and within 8 to 2,048 characters.
Provide 1 to 20 event names.
Remove unknown top-level fields.
Verify merchant authentication and plan capability.
Confirm HTTPS and production callback policy with the active deployment contract.
Deliveries fail
Check | Why |
Subscription status and URL | The subscription may be deleted, disabled, or misconfigured. |
DNS and TLS | Zahlen must reach and trust the callback endpoint. |
Consumer response time | Slow handlers can cause timeouts and redelivery. |
HTTP response code | Non-success responses normally trigger delivery failure handling. |
Signature/verification policy | A stale secret or wrong raw-body handling can reject valid deliveries. |
Dependency health | The callback may be up while its database or queue is unavailable. |
Duplicate and out-of-order deliveries
Store a stable delivery/event identifier with a uniqueness constraint.
Acknowledge quickly and process asynchronously.
Make business effects idempotent.
Do not assume ordering across retries or event types.
Quarantine unknown event types rather than failing the entire consumer.
Verification contract is deployment-specific |
The subscription schema does not define a universal signing algorithm or header. Use the active Zahlen webhook outcome and verification contract; never invent one from examples. |
Minimum structured log fields
Category | Recommended fields |
Request | timestamp, environment, method, route, status, latency_ms |
Identity | safe key_id/fingerprint, merchant_id where returned |
Correlation | event_id, batch_id, upload_job_id, request_id, decision_id, outcome_id |
Retry | idempotency_key fingerprint, retry_count, next_delay_ms |
Error | error category, safe message, response request ID |
Webhook | subscription_id, event type, delivery ID, verification result |
Do not log
Full API keys or webhook secrets.
Full payment card numbers, CVV, passwords, or raw bank credentials.
Unredacted payment tokens or customer data unless explicitly approved.
Entire metadata objects without a classification and retention policy.
Monitoring signals
Authentication failure rate.
HTTP 422 rate by field/path.
HTTP 429 rate and quota utilization.
Decision latency and error rate.
Outcome reporting lag and match rate.
Webhook failure, retry, and duplicate rate.
Unresolved or long-running investigation runs.
Escalate immediately when
One tenant can view another tenant’s data.
An API key is exposed or used after revocation.
Duplicate payment authorizations may have occurred.
Audit or correlation evidence is missing for a privileged action.
Sustained errors affect many tenants or production payment flows.
Troubleshooting checklist
Area | Ready when |
Connectivity | Health/version checks work from the production network. |
Authentication | Key injection, rotation, and revocation are tested. |
Contracts | Required, optional, null, unknown-field, and maximum-size tests pass. |
Idempotency | Uncertain POST results reconcile without duplicate effects. |
Rate limits | 429 handling honors Retry-After and uses jittered backoff. |
Payment schedule | No technical retry can create attempts outside Days 1, 2, 6, and 16. |
Correlation | All durable identifiers appear in structured logs. |
Webhooks | Verification, deduplication, asynchronous processing, and replay are tested. |
Investigation runs | Completed runs can be traced into downstream reporting. |
Security | Secrets and sensitive payment data are redacted from logs and support evidence. |
Final rule |
Fix the first broken boundary or durable link in the evidence chain. Do not hide failures with synthetic data, bypass tenant filters, raise quotas without diagnosis, or create uncontrolled payment retries. |