Zahlen Documentation
8.1 —
Troubleshooting Guide
Phase 8 — Supporting Documentation
This guide helps operators, supervisors, and technical teams diagnose ingestion failures, replay mismatches, telemetry gaps, watermark issues, and routing inconsistencies in Zahlen.
The Troubleshooting Guide explains how to investigate common operational problems in Zahlen without losing sight of the platform’s central purpose: turning payment behavior into reliable issuer intelligence.
Troubleshooting in Zahlen is not only a technical activity. It is an evidence-quality activity. When ingestion fails, replay diverges, telemetry is missing, watermarks stop advancing, or routing behaves unexpectedly, the operator must determine whether the issue affects operational confidence, governance trust, or downstream decision-making.
This chapter provides a structured approach to diagnosing problems. It defines each issue type, explains why it matters, describes likely causes, and recommends safe operator actions.
|
Operator Principle When troubleshooting Zahlen, first protect evidence integrity. Do not treat missing data, replay divergence, telemetry gaps, or routing anomalies as cosmetic issues. Each can change how much confidence operators should place in the resulting issuer intelligence. |
A troubleshooting mindset is the disciplined approach an operator uses to separate symptoms from causes.
A symptom is the visible problem, such as an empty dashboard, a failed ingestion run, or a missing alert. A cause is the underlying condition, such as a malformed CSV, a missing response_code field, an unadvanced watermark, or a replay mismatch.
Zahlen troubleshooting should move from the outer user-facing surface toward the underlying evidence path. Operators should start with what the page or report shows, then confirm the run, then confirm the input data, then confirm canonical mapping, then confirm downstream event creation, then confirm replay and telemetry status.
|
Troubleshooting Layer |
Question to Ask |
Why It Matters |
|
User-facing surface |
What did the operator see? |
Defines the visible symptom and affected workflow. |
|
Run or job record |
Did the analysis, ingestion, or health run complete? |
Shows whether the system executed the expected workflow. |
|
Input evidence |
Was the CSV, API event, or stream payload valid? |
Determines whether the system had usable evidence. |
|
Canonical mapping |
Were source fields mapped to canonical fields correctly? |
Protects response_code, issuer identity, recovery, and retry semantics. |
|
Derived signals |
Were issuer-health events, alerts, telemetry, or tasks generated? |
Shows whether data moved through the intelligence pipeline. |
|
Replay and governance |
Can the conclusion be reconstructed and trusted? |
Determines whether the result is governance-ready. |
An ingestion failure occurs when Zahlen cannot accept, parse, validate, normalize, or process incoming payment evidence.
Ingestion may occur through CSV upload, API submission, or event-stream integration. The failure mode depends on the channel, but the operational meaning is the same: Zahlens's downstream intelligence may not have received usable evidence.
Ingestion failures matter because issuer intelligence depends on the completeness and correctness of incoming events. If ingestion fails, alerts may not appear, issuer health may not update, recovery curves may be incomplete, and replay evidence may be unavailable.
|
Common Ingestion Symptom |
Likely Cause |
Recommended Action |
|
CSV upload fails |
The file is malformed, empty, too large, missing a header row, or not readable as CSV. |
Re-export the file as UTF-8 CSV, confirm the header row, and retry with a small sample if needed. |
|
Run completes but no findings appear |
The file may lack issuer identity, response_code, recovery outcome, or retry lifecycle fields. |
Review canonical field mappings and confirm issuer_bin, response_code, recovered, and retry_day equivalents. |
|
API events are rejected |
Required fields may be missing, invalid, unauthorized, or incorrectly formatted. |
Review validation errors, tenant context, event_id, event_at, and canonical field mapping. |
|
API events are accepted but not visible |
Events may not have produced downstream signals, or processing may be delayed. |
Check event processing status, platform events, run health, and watermark advancement. |
|
Streaming ingestion stalls |
Consumer lag, topic mismatch, schema drift, or worker failure may be present. |
Check stream topic, consumer group, event envelope, worker heartbeat, and replay offset status. |
CSV ingestion problems should be diagnosed by confirming both file validity and analytical usefulness.
A file can be technically valid but operationally weak. For example, a CSV may load successfully while missing the issuer_bin or response_code fields needed for issuer analysis. In that situation, the issue is not upload failure. The issue is evidence incompleteness.
|
Diagnostic Check |
What It Confirms |
Operational Meaning |
|
Header row exists |
The file has named columns. |
Zahlen can attempt source-to-canonical mapping. |
|
response_code is present or mappable |
Decline and authorization behavior can be interpreted. |
Issuer response-code analysis can run. |
|
issuer_bin is present or mappable |
Issuer identity can be grouped. |
Issuer cognition can operate. |
|
event_at or lifecycle timestamp is present |
Events can be ordered. |
Timeline and replay reconstruction improve. |
|
retry_day or retry lifecycle data exists |
Retry windows can be interpreted. |
Recovery curve analysis becomes more reliable. |
|
recovered or success field exists |
Recovery outcomes can be measured. |
Recovery rates and marginal recovery can be calculated. |
|
Troubleshooting Note If a CSV run completes but produces weak findings, inspect the schema before assuming the analysis engine failed. Missing canonical evidence fields are a common cause of empty or low-confidence results. |
A replay mismatch occurs when Zahlens's replay process does not reproduce the expected historical conclusion.
Replay is the process of reconstructing prior results from preserved evidence and deterministic logic. Replay is central to governance because it allows operators to verify that a conclusion was not caused by unstable processing, hidden state, non-deterministic ordering, or incomplete evidence.
A replay mismatch should be treated as an integrity issue. It does not always mean the original result was wrong, but it does mean the evidence path requires review before the result is used for strong governance decisions.
|
Replay Symptom |
Likely Cause |
Recommended Action |
|
Replay produces a different result |
Input ordering, transformation logic, baseline version, or evidence set changed. |
Compare input digest, output digest, event ordering, and evaluation version. |
|
Replay cannot find source events |
Event lineage or durable storage may be incomplete. |
Check event repository, run artifacts, platform events, and retention settings. |
|
Replay completes partially |
Some evidence was available but not all required records were present. |
Treat the replay as limited and review missing evidence before escalation. |
|
Replay fails with validation errors |
Historical events may no longer satisfy current schema or mapping rules. |
Review schema compatibility and canonical field migrations. |
|
Replay divergence appears after code changes |
Evaluation logic may have changed without compatibility controls. |
Confirm whether the change was intentional and whether historical replay contracts were updated. |
Not every replay mismatch has the same severity. Severity depends on whether the mismatch affects a dashboard count, a telemetry summary, an incident recommendation, a governance decision, or a public-safe intelligence signal.
|
Replay Severity |
Definition |
Recommended Response |
|
Low |
Replay mismatch affects a non-critical display or derived label without changing the operational conclusion. |
Document the mismatch and correct the display or mapping issue. |
|
Medium |
Replay mismatch affects a signal used by operators but not yet escalated. |
Review evidence, rerun replay, and avoid escalation until resolved. |
|
High |
Replay mismatch affects an incident, recommendation, or supervisor decision. |
Pause governance reliance and perform evidence-lineage review. |
|
Critical |
Replay mismatch affects public-safe intelligence, audit evidence, or cross-domain governance. |
Quarantine the signal and escalate for governance review. |
|
Governance Rule A replay-divergent signal should not be treated as fully governance-ready. Resolve the mismatch, explain the limitation, or quarantine the signal before using it for formal decisions. |
A telemetry gap occurs when platform-processing evidence is missing, incomplete, delayed, or not linked to the underlying issuer signal.
Telemetry explains how the platform processed, enriched, validated, and interpreted evidence. It may include ingestion counts, truth matching results, external enrichment status, warning counts, platform event creation, worker status, processing lag, and enrichment outcomes.
Telemetry gaps matter because they reduce the operator’s ability to understand evidence quality. A signal may still be useful without complete telemetry, but the operator should know which processing context is missing.
|
Telemetry Symptom |
Likely Cause |
Recommended Action |
|
truth_confidence_band shows NONE |
Truth enrichment may not have matched evidence or may not have run. |
Check truth_matches_found, truth_matched_by, and external_status before interpreting as a failed signal. |
|
external_status shows NOT_RUN |
External enrichment or validation was not executed for the run. |
Treat the result as internal telemetry-only for that enrichment dimension. |
|
Telemetry event count is zero |
Telemetry generation may not be wired for that workflow or no telemetry was produced. |
Check whether the route, job, or service emits telemetry for that path. |
|
Telemetry exists but is not linked |
Correlation identifiers, run identifiers, or signal identifiers may be missing. |
Review correlation_id, run_id, job_id, issuer context, and event linkage. |
|
Telemetry appears delayed |
Processing lag or worker delay may be present. |
Check latest event time, worker heartbeat, and queue depth. |
Truth data refers to validated reference evidence used to confirm, enrich, or calibrate an observed payment signal.
If truth fields show NONE, zero, or NOT_RUN, the operator should not automatically conclude that the underlying issuer signal is invalid. The correct interpretation is that the signal was not linked to truth evidence for that run or enrichment path.
For example, a telemetry context that shows zero truth-linked events and external_status of NOT_RUN may still indicate that the CSV analysis ran successfully. It simply means live or external truth enrichment was not executed or did not produce matches.
|
Operator Interpretation A telemetry gap weakens enrichment context, not necessarily the underlying payment evidence. Separate the question “did the issuer signal exist?” from the question “was the signal externally truth-linked?” |
A watermark issue occurs when Zahlen cannot reliably determine how far an ingestion, replay, monitoring, or event-processing workflow has advanced.
A watermark is a progress marker. It may record the latest processed event, offset, timestamp, run identifier, replay epoch, or stream position. Watermarks help prevent duplicate processing, missed events, and uncertain replay boundaries.
Watermark issues matter because they affect operational continuity. If a watermark does not advance, downstream signals may stop updating. If a watermark advances incorrectly, events may be skipped. If a watermark regresses unexpectedly, duplicate processing may occur.
|
Watermark Symptom |
Likely Cause |
Recommended Action |
|
Watermark does not advance |
Processing may be stalled, no qualifying events exist, or persistence failed. |
Check event counts, worker status, processing logs, and repository writes. |
|
Watermark advances but no output appears |
Events may be processed but filtered, suppressed, or not converted into signals. |
Review eligibility filters, thresholds, and downstream event creation. |
|
Watermark jumps unexpectedly |
The processor may have skipped events or used an incorrect offset. |
Compare event counts, source offsets, run summaries, and persisted watermark history. |
|
Watermark resets to older value |
Persistence, environment isolation, or state directory mismatch may be present. |
Verify storage path, environment configuration, and deployment state. |
|
Replay watermark differs from live watermark |
Replay and live processing may use different namespaces or epochs. |
Confirm replay namespace, environment classification, and replay-safe boundary rules. |
Watermark troubleshooting should begin with the source event count and end with downstream signal verification.
|
Step |
Question |
Evidence to Review |
|
Confirm source events |
Did new source events exist for the processing window? |
Input records, event stream, CSV rows, API receipts. |
|
Confirm processor execution |
Did the worker or service run? |
Run history, worker heartbeat, job record, logs. |
|
Confirm processed count |
Did the service process any events? |
Run summary, processed count, skipped count. |
|
Confirm persisted watermark |
Was progress written durably? |
Watermark repository, state directory, database record. |
|
Confirm downstream output |
Were signals, alerts, or platform events created? |
Issuer-health rows, alerts, event store, dashboard counts. |
|
Operational Warning A watermark issue can silently affect confidence. The dashboard may look calm because no new events were processed, not because issuer behavior was healthy. |
A routing inconsistency occurs when alerts, incidents, tasks, action-queue items, or escalation guidance do not appear in the expected operational destination.
Routing is the process of assigning an operational item to the correct queue, owner, severity, priority, workflow, or supervisor path. In Zahlen, routing may move an issuer-health alert into an incident, a task, an action queue, an escalation recommendation, or a supervisor dashboard.
Routing inconsistencies matter because they affect operator response. If a serious issuer signal is routed incorrectly, it may not receive timely investigation. If a low-confidence signal is routed too aggressively, operators may waste time or over-escalate.
|
Routing Symptom |
Likely Cause |
Recommended Action |
|
Alert exists but no incident appears |
Auto-creation rules may not have run or thresholds may not have been met. |
Check incident creation settings, alert severity, confidence, and auto-create workflow. |
|
Incident exists but no action-queue item appears |
Task creation or queue routing may not have been triggered. |
Review task linkage, routing service output, and queue eligibility. |
|
Item routed to wrong queue |
Routing rules may map severity, issuer country, metric, or owner incorrectly. |
Review routing reason, target queue, severity, priority, and rule configuration. |
|
Escalation guidance appears unexpectedly |
Aging, unowned, unresolved, or priority rules may be triggering guidance. |
Review item age, owner assignment, resolution status, and escalation reason. |
|
Supervisor dashboard count differs from queue |
Filters, refresh timing, or aggregation logic may differ. |
Compare query filters, latest refresh, severity filters, and source tables. |
Operational routing should be diagnosed by following the item from original signal to final operator surface.
The operator should identify the source signal, confirm whether it generated an alert, determine whether the alert created an incident or task, review the routing reason, and confirm whether the item reached the expected queue or supervisor surface.
|
Routing Check |
What It Confirms |
Why It Matters |
|
Source signal |
The original issuer-health or monitoring signal exists. |
Confirms that routing had evidence to act on. |
|
Alert creation |
The source signal generated an alert. |
Shows whether the alerting threshold was met. |
|
Incident creation |
The alert created or linked to an incident. |
Shows whether case workflow began. |
|
Task creation |
The incident or alert created an operational task. |
Shows whether work entered the action path. |
|
Queue assignment |
The task was assigned to the expected queue. |
Supports operator workflow correctness. |
|
Escalation guidance |
The system recommended escalation based on defined conditions. |
Supports supervisor coordination. |
A dashboard inconsistency occurs when counts, statuses, or tables differ across pages in ways that are not immediately clear.
Some differences are expected. One page may show alerts, another may show action-queue tasks, and another may show incidents. These are related but not identical objects. An alert is a signal. An incident is a case. A task is an operational work item. Escalation guidance is a recommendation layer. Counts may differ because they represent different workflow stages.
A true inconsistency occurs when the same object type should match across surfaces but does not, or when a workflow relationship is expected but missing.
|
Visible Difference |
Possible Explanation |
Recommended Check |
|
Alerts count differs from queue count |
Not every alert may create a queue item, or filters may differ. |
Compare severity filters, queue eligibility, and routing rules. |
|
Incidents count differs from alerts count |
Incidents may be grouped by issuer cohort rather than one incident per alert. |
Check incident IDs and cohort grouping logic. |
|
Supervisor dashboard differs from Action Queue |
Supervisor may aggregate escalation or ownership fields differently. |
Compare source query, filters, and refresh timing. |
|
System Health shows completed run but Monitor has empty Radar |
Issuer-health events may exist without crossing Radar promotion thresholds. |
Check Radar promotion thresholds and behavior-feed eligibility. |
|
Latest timestamp differs across pages |
Pages may summarize different objects or refresh at different times. |
Check object type, run time, alert time, and page refresh cadence. |
|
Operator Note Do not assume count differences are errors. First identify whether the pages are counting the same object type: events, alerts, incidents, tasks, escalations, runs, or public-safe signals. |
Environment problems occur when the running system uses an unexpected database, state directory, jobs directory, API state path, environment namespace, or deployment configuration.
Configuration mistakes can create confusing symptoms. For example, a run may complete in one environment while the dashboard reads another environment’s database. A service may write job artifacts to one directory while the route expects another. A replay process may use a different namespace from live processing.
|
Configuration Symptom |
Likely Cause |
Recommended Action |
|
Run exists but dashboard does not show it |
The route may read a different database or jobs directory. |
Verify database path and job artifact directory for the running service. |
|
Dev site differs from local results |
Different environment, database, or deployment version is active. |
Confirm service deployment, environment variables, and source version. |
|
Replay behavior differs by environment |
Replay namespace or environment classification differs. |
Confirm replay namespace and environment-isolation settings. |
|
Watermark state disappears after restart |
State path may be non-durable or misconfigured. |
Verify persistent state directory and service permissions. |
|
API state differs from UI state |
API and UI may point to different state paths. |
Check service configuration and route dependencies. |
Escalation criteria define when an issue should move from routine troubleshooting to supervisor, governance, or engineering review.
A routine issue can be resolved by correcting input data, mapping fields, rerunning a job, or reviewing filters. A governance issue affects replay consistency, tenant safety, public-safe publication, audit evidence, or cross-domain trust. An engineering issue affects code, persistence, workers, service configuration, or route integration.
|
Escalation Type |
When to Escalate |
Recommended Recipient |
|
Operator escalation |
The issue affects workflow assignment, unresolved tasks, or investigation clarity. |
Supervisor or operations lead. |
|
Governance escalation |
The issue affects replay consistency, lineage, confidence, tenant safety, or public-safe signals. |
Governance reviewer or compliance owner. |
|
Engineering escalation |
The issue appears to involve code, database schema, worker execution, persistence, or deployment. |
Engineering team. |
|
Security escalation |
The issue may expose tenant data or violate access boundaries. |
Security or platform owner. |
|
Product escalation |
The issue reflects confusing workflow design or ambiguous operator experience. |
Product owner or documentation owner. |
A troubleshooting record is a concise written account of the issue, investigation steps, findings, and resolution.
Troubleshooting records matter because Zahlen is an operational intelligence platform. When evidence quality or governance trust is affected, the organization should preserve what happened and how the issue was resolved.
|
Record Field |
Definition |
Why It Matters |
|
Issue summary |
A short description of the visible problem. |
Helps future readers understand the symptom. |
|
Affected page or workflow |
The dashboard, route, job, export, replay, or ingestion path involved. |
Locates the problem in the operator experience. |
|
Evidence reviewed |
The run, file, event, alert, incident, task, or telemetry records checked. |
Documents the evidence path. |
|
Root cause |
The underlying condition that caused the issue. |
Prevents repeated troubleshooting. |
|
Impact assessment |
The operational or governance impact of the issue. |
Explains whether confidence was affected. |
|
Resolution |
The fix or corrective action taken. |
Creates durable operational memory. |
|
Follow-up |
Any remaining work, tests, monitoring, or documentation updates. |
Ensures the issue is fully closed. |
The following matrix provides a high-level guide for common issue patterns.
|
Problem Area |
First Check |
Second Check |
Likely Next Action |
|
Ingestion failure |
Input file or payload validity. |
Canonical field mapping. |
Correct schema, resubmit, or review validation logs. |
|
Replay mismatch |
Input and output digests. |
Event ordering and lineage. |
Quarantine or escalate if governance-impacting. |
|
Telemetry gap |
Telemetry event count and external status. |
Correlation and truth matching fields. |
Interpret as enrichment-limited or fix telemetry linkage. |
|
Watermark issue |
Source event count and worker execution. |
Persisted watermark and downstream outputs. |
Repair state, rerun processing, or escalate engineering. |
|
Routing inconsistency |
Source signal and alert creation. |
Incident/task routing reason. |
Review routing rules and queue eligibility. |
|
Dashboard count mismatch |
Object type being counted. |
Filters and refresh timing. |
Confirm whether mismatch is expected or a true defect. |
Troubleshooting in Zahlen should protect evidence quality, replay safety, governance confidence, and operator trust.
Ingestion failures indicate that incoming evidence may be missing, malformed, unmapped, or unprocessed. Replay mismatches indicate that historical conclusions may not be reconstructing as expected. Telemetry gaps indicate missing processing or enrichment context. Watermark issues indicate uncertainty about processing progress. Routing inconsistencies indicate that operational work may not be reaching the expected queue, owner, or supervisor surface.
The safest troubleshooting approach is to follow the evidence path from user-facing symptom to source event, canonical mapping, derived signal, telemetry, replay, routing, and governance status.
A well-documented troubleshooting practice makes Zahlen more operationally trustworthy because it preserves not only what the platform observed, but also how the organization resolved uncertainty when something did not behave as expected.