Zahlen Documentation
8.1 — Troubleshooting Guide

Phase 8 — Supporting Documentation

This guide helps operators, supervisors, and technical teams diagnose ingestion failures, replay mismatches, telemetry gaps, watermark issues, and routing inconsistencies in Zahlen.

Chapter Purpose

The Troubleshooting Guide explains how to investigate common operational problems in Zahlen without losing sight of the platform’s central purpose: turning payment behavior into reliable issuer intelligence.

Troubleshooting in Zahlen is not only a technical activity. It is an evidence-quality activity. When ingestion fails, replay diverges, telemetry is missing, watermarks stop advancing, or routing behaves unexpectedly, the operator must determine whether the issue affects operational confidence, governance trust, or downstream decision-making.

This chapter provides a structured approach to diagnosing problems. It defines each issue type, explains why it matters, describes likely causes, and recommends safe operator actions.

Operator Principle

When troubleshooting Zahlen, first protect evidence integrity. Do not treat missing data, replay divergence, telemetry gaps, or routing anomalies as cosmetic issues. Each can change how much confidence operators should place in the resulting issuer intelligence.

Troubleshooting Mindset

A troubleshooting mindset is the disciplined approach an operator uses to separate symptoms from causes.

A symptom is the visible problem, such as an empty dashboard, a failed ingestion run, or a missing alert. A cause is the underlying condition, such as a malformed CSV, a missing response_code field, an unadvanced watermark, or a replay mismatch.

Zahlen troubleshooting should move from the outer user-facing surface toward the underlying evidence path. Operators should start with what the page or report shows, then confirm the run, then confirm the input data, then confirm canonical mapping, then confirm downstream event creation, then confirm replay and telemetry status.

Troubleshooting Layer	Question to Ask	Why It Matters
User-facing surface	What did the operator see?	Defines the visible symptom and affected workflow.
Run or job record	Did the analysis, ingestion, or health run complete?	Shows whether the system executed the expected workflow.
Input evidence	Was the CSV, API event, or stream payload valid?	Determines whether the system had usable evidence.
Canonical mapping	Were source fields mapped to canonical fields correctly?	Protects response_code, issuer identity, recovery, and retry semantics.
Derived signals	Were issuer-health events, alerts, telemetry, or tasks generated?	Shows whether data moved through the intelligence pipeline.
Replay and governance	Can the conclusion be reconstructed and trusted?	Determines whether the result is governance-ready.

Ingestion Failures

An ingestion failure occurs when Zahlen cannot accept, parse, validate, normalize, or process incoming payment evidence.

Ingestion may occur through CSV upload, API submission, or event-stream integration. The failure mode depends on the channel, but the operational meaning is the same: Zahlens's downstream intelligence may not have received usable evidence.

Ingestion failures matter because issuer intelligence depends on the completeness and correctness of incoming events. If ingestion fails, alerts may not appear, issuer health may not update, recovery curves may be incomplete, and replay evidence may be unavailable.

Common Ingestion Symptom	Likely Cause	Recommended Action
CSV upload fails	The file is malformed, empty, too large, missing a header row, or not readable as CSV.	Re-export the file as UTF-8 CSV, confirm the header row, and retry with a small sample if needed.
Run completes but no findings appear	The file may lack issuer identity, response_code, recovery outcome, or retry lifecycle fields.	Review canonical field mappings and confirm issuer_bin, response_code, recovered, and retry_day equivalents.
API events are rejected	Required fields may be missing, invalid, unauthorized, or incorrectly formatted.	Review validation errors, tenant context, event_id, event_at, and canonical field mapping.
API events are accepted but not visible	Events may not have produced downstream signals, or processing may be delayed.	Check event processing status, platform events, run health, and watermark advancement.
Streaming ingestion stalls	Consumer lag, topic mismatch, schema drift, or worker failure may be present.	Check stream topic, consumer group, event envelope, worker heartbeat, and replay offset status.

How to Diagnose CSV Ingestion Problems

CSV ingestion problems should be diagnosed by confirming both file validity and analytical usefulness.

A file can be technically valid but operationally weak. For example, a CSV may load successfully while missing the issuer_bin or response_code fields needed for issuer analysis. In that situation, the issue is not upload failure. The issue is evidence incompleteness.

Diagnostic Check	What It Confirms	Operational Meaning
Header row exists	The file has named columns.	Zahlen can attempt source-to-canonical mapping.
response_code is present or mappable	Decline and authorization behavior can be interpreted.	Issuer response-code analysis can run.
issuer_bin is present or mappable	Issuer identity can be grouped.	Issuer cognition can operate.
event_at or lifecycle timestamp is present	Events can be ordered.	Timeline and replay reconstruction improve.
retry_day or retry lifecycle data exists	Retry windows can be interpreted.	Recovery curve analysis becomes more reliable.
recovered or success field exists	Recovery outcomes can be measured.	Recovery rates and marginal recovery can be calculated.

Troubleshooting Note

If a CSV run completes but produces weak findings, inspect the schema before assuming the analysis engine failed. Missing canonical evidence fields are a common cause of empty or low-confidence results.

Replay Mismatches

A replay mismatch occurs when Zahlens's replay process does not reproduce the expected historical conclusion.

Replay is the process of reconstructing prior results from preserved evidence and deterministic logic. Replay is central to governance because it allows operators to verify that a conclusion was not caused by unstable processing, hidden state, non-deterministic ordering, or incomplete evidence.

A replay mismatch should be treated as an integrity issue. It does not always mean the original result was wrong, but it does mean the evidence path requires review before the result is used for strong governance decisions.

Replay Symptom	Likely Cause	Recommended Action
Replay produces a different result	Input ordering, transformation logic, baseline version, or evidence set changed.	Compare input digest, output digest, event ordering, and evaluation version.
Replay cannot find source events	Event lineage or durable storage may be incomplete.	Check event repository, run artifacts, platform events, and retention settings.
Replay completes partially	Some evidence was available but not all required records were present.	Treat the replay as limited and review missing evidence before escalation.
Replay fails with validation errors	Historical events may no longer satisfy current schema or mapping rules.	Review schema compatibility and canonical field migrations.
Replay divergence appears after code changes	Evaluation logic may have changed without compatibility controls.	Confirm whether the change was intentional and whether historical replay contracts were updated.

How to Interpret Replay Mismatch Severity

Not every replay mismatch has the same severity. Severity depends on whether the mismatch affects a dashboard count, a telemetry summary, an incident recommendation, a governance decision, or a public-safe intelligence signal.

Replay Severity	Definition	Recommended Response
Low	Replay mismatch affects a non-critical display or derived label without changing the operational conclusion.	Document the mismatch and correct the display or mapping issue.
Medium	Replay mismatch affects a signal used by operators but not yet escalated.	Review evidence, rerun replay, and avoid escalation until resolved.
High	Replay mismatch affects an incident, recommendation, or supervisor decision.	Pause governance reliance and perform evidence-lineage review.
Critical	Replay mismatch affects public-safe intelligence, audit evidence, or cross-domain governance.	Quarantine the signal and escalate for governance review.

Governance Rule

A replay-divergent signal should not be treated as fully governance-ready. Resolve the mismatch, explain the limitation, or quarantine the signal before using it for formal decisions.

Telemetry Gaps

A telemetry gap occurs when platform-processing evidence is missing, incomplete, delayed, or not linked to the underlying issuer signal.

Telemetry explains how the platform processed, enriched, validated, and interpreted evidence. It may include ingestion counts, truth matching results, external enrichment status, warning counts, platform event creation, worker status, processing lag, and enrichment outcomes.

Telemetry gaps matter because they reduce the operator’s ability to understand evidence quality. A signal may still be useful without complete telemetry, but the operator should know which processing context is missing.

Telemetry Symptom	Likely Cause	Recommended Action
truth_confidence_band shows NONE	Truth enrichment may not have matched evidence or may not have run.	Check truth_matches_found, truth_matched_by, and external_status before interpreting as a failed signal.
external_status shows NOT_RUN	External enrichment or validation was not executed for the run.	Treat the result as internal telemetry-only for that enrichment dimension.
Telemetry event count is zero	Telemetry generation may not be wired for that workflow or no telemetry was produced.	Check whether the route, job, or service emits telemetry for that path.
Telemetry exists but is not linked	Correlation identifiers, run identifiers, or signal identifiers may be missing.	Review correlation_id, run_id, job_id, issuer context, and event linkage.
Telemetry appears delayed	Processing lag or worker delay may be present.	Check latest event time, worker heartbeat, and queue depth.

How to Interpret Missing Truth Data

Truth data refers to validated reference evidence used to confirm, enrich, or calibrate an observed payment signal.

If truth fields show NONE, zero, or NOT_RUN, the operator should not automatically conclude that the underlying issuer signal is invalid. The correct interpretation is that the signal was not linked to truth evidence for that run or enrichment path.

For example, a telemetry context that shows zero truth-linked events and external_status of NOT_RUN may still indicate that the CSV analysis ran successfully. It simply means live or external truth enrichment was not executed or did not produce matches.

Operator Interpretation

A telemetry gap weakens enrichment context, not necessarily the underlying payment evidence. Separate the question “did the issuer signal exist?” from the question “was the signal externally truth-linked?”

Watermark Issues

A watermark issue occurs when Zahlen cannot reliably determine how far an ingestion, replay, monitoring, or event-processing workflow has advanced.

A watermark is a progress marker. It may record the latest processed event, offset, timestamp, run identifier, replay epoch, or stream position. Watermarks help prevent duplicate processing, missed events, and uncertain replay boundaries.

Watermark issues matter because they affect operational continuity. If a watermark does not advance, downstream signals may stop updating. If a watermark advances incorrectly, events may be skipped. If a watermark regresses unexpectedly, duplicate processing may occur.

Watermark Symptom	Likely Cause	Recommended Action
Watermark does not advance	Processing may be stalled, no qualifying events exist, or persistence failed.	Check event counts, worker status, processing logs, and repository writes.
Watermark advances but no output appears	Events may be processed but filtered, suppressed, or not converted into signals.	Review eligibility filters, thresholds, and downstream event creation.
Watermark jumps unexpectedly	The processor may have skipped events or used an incorrect offset.	Compare event counts, source offsets, run summaries, and persisted watermark history.
Watermark resets to older value	Persistence, environment isolation, or state directory mismatch may be present.	Verify storage path, environment configuration, and deployment state.
Replay watermark differs from live watermark	Replay and live processing may use different namespaces or epochs.	Confirm replay namespace, environment classification, and replay-safe boundary rules.

Watermark Troubleshooting Workflow

Watermark troubleshooting should begin with the source event count and end with downstream signal verification.

Step	Question	Evidence to Review
Confirm source events	Did new source events exist for the processing window?	Input records, event stream, CSV rows, API receipts.
Confirm processor execution	Did the worker or service run?	Run history, worker heartbeat, job record, logs.
Confirm processed count	Did the service process any events?	Run summary, processed count, skipped count.
Confirm persisted watermark	Was progress written durably?	Watermark repository, state directory, database record.
Confirm downstream output	Were signals, alerts, or platform events created?	Issuer-health rows, alerts, event store, dashboard counts.

Operational Warning

A watermark issue can silently affect confidence. The dashboard may look calm because no new events were processed, not because issuer behavior was healthy.

Routing Inconsistencies

A routing inconsistency occurs when alerts, incidents, tasks, action-queue items, or escalation guidance do not appear in the expected operational destination.

Routing is the process of assigning an operational item to the correct queue, owner, severity, priority, workflow, or supervisor path. In Zahlen, routing may move an issuer-health alert into an incident, a task, an action queue, an escalation recommendation, or a supervisor dashboard.

Routing inconsistencies matter because they affect operator response. If a serious issuer signal is routed incorrectly, it may not receive timely investigation. If a low-confidence signal is routed too aggressively, operators may waste time or over-escalate.

Routing Symptom	Likely Cause	Recommended Action
Alert exists but no incident appears	Auto-creation rules may not have run or thresholds may not have been met.	Check incident creation settings, alert severity, confidence, and auto-create workflow.
Incident exists but no action-queue item appears	Task creation or queue routing may not have been triggered.	Review task linkage, routing service output, and queue eligibility.
Item routed to wrong queue	Routing rules may map severity, issuer country, metric, or owner incorrectly.	Review routing reason, target queue, severity, priority, and rule configuration.
Escalation guidance appears unexpectedly	Aging, unowned, unresolved, or priority rules may be triggering guidance.	Review item age, owner assignment, resolution status, and escalation reason.
Supervisor dashboard count differs from queue	Filters, refresh timing, or aggregation logic may differ.	Compare query filters, latest refresh, severity filters, and source tables.

How to Troubleshoot Operational Routing

Operational routing should be diagnosed by following the item from original signal to final operator surface.

The operator should identify the source signal, confirm whether it generated an alert, determine whether the alert created an incident or task, review the routing reason, and confirm whether the item reached the expected queue or supervisor surface.

Routing Check	What It Confirms	Why It Matters
Source signal	The original issuer-health or monitoring signal exists.	Confirms that routing had evidence to act on.
Alert creation	The source signal generated an alert.	Shows whether the alerting threshold was met.
Incident creation	The alert created or linked to an incident.	Shows whether case workflow began.
Task creation	The incident or alert created an operational task.	Shows whether work entered the action path.
Queue assignment	The task was assigned to the expected queue.	Supports operator workflow correctness.
Escalation guidance	The system recommended escalation based on defined conditions.	Supports supervisor coordination.

Dashboard and Count Inconsistencies

A dashboard inconsistency occurs when counts, statuses, or tables differ across pages in ways that are not immediately clear.

Some differences are expected. One page may show alerts, another may show action-queue tasks, and another may show incidents. These are related but not identical objects. An alert is a signal. An incident is a case. A task is an operational work item. Escalation guidance is a recommendation layer. Counts may differ because they represent different workflow stages.

A true inconsistency occurs when the same object type should match across surfaces but does not, or when a workflow relationship is expected but missing.

Visible Difference	Possible Explanation	Recommended Check
Alerts count differs from queue count	Not every alert may create a queue item, or filters may differ.	Compare severity filters, queue eligibility, and routing rules.
Incidents count differs from alerts count	Incidents may be grouped by issuer cohort rather than one incident per alert.	Check incident IDs and cohort grouping logic.
Supervisor dashboard differs from Action Queue	Supervisor may aggregate escalation or ownership fields differently.	Compare source query, filters, and refresh timing.
System Health shows completed run but Monitor has empty Radar	Issuer-health events may exist without crossing Radar promotion thresholds.	Check Radar promotion thresholds and behavior-feed eligibility.
Latest timestamp differs across pages	Pages may summarize different objects or refresh at different times.	Check object type, run time, alert time, and page refresh cadence.

Operator Note

Do not assume count differences are errors. First identify whether the pages are counting the same object type: events, alerts, incidents, tasks, escalations, runs, or public-safe signals.

Environment and Configuration Problems

Environment problems occur when the running system uses an unexpected database, state directory, jobs directory, API state path, environment namespace, or deployment configuration.

Configuration mistakes can create confusing symptoms. For example, a run may complete in one environment while the dashboard reads another environment’s database. A service may write job artifacts to one directory while the route expects another. A replay process may use a different namespace from live processing.

Configuration Symptom	Likely Cause	Recommended Action
Run exists but dashboard does not show it	The route may read a different database or jobs directory.	Verify database path and job artifact directory for the running service.
Dev site differs from local results	Different environment, database, or deployment version is active.	Confirm service deployment, environment variables, and source version.
Replay behavior differs by environment	Replay namespace or environment classification differs.	Confirm replay namespace and environment-isolation settings.
Watermark state disappears after restart	State path may be non-durable or misconfigured.	Verify persistent state directory and service permissions.
API state differs from UI state	API and UI may point to different state paths.	Check service configuration and route dependencies.

Escalation Criteria

Escalation criteria define when an issue should move from routine troubleshooting to supervisor, governance, or engineering review.

A routine issue can be resolved by correcting input data, mapping fields, rerunning a job, or reviewing filters. A governance issue affects replay consistency, tenant safety, public-safe publication, audit evidence, or cross-domain trust. An engineering issue affects code, persistence, workers, service configuration, or route integration.

Escalation Type	When to Escalate	Recommended Recipient
Operator escalation	The issue affects workflow assignment, unresolved tasks, or investigation clarity.	Supervisor or operations lead.
Governance escalation	The issue affects replay consistency, lineage, confidence, tenant safety, or public-safe signals.	Governance reviewer or compliance owner.
Engineering escalation	The issue appears to involve code, database schema, worker execution, persistence, or deployment.	Engineering team.
Security escalation	The issue may expose tenant data or violate access boundaries.	Security or platform owner.
Product escalation	The issue reflects confusing workflow design or ambiguous operator experience.	Product owner or documentation owner.

Recommended Troubleshooting Record

A troubleshooting record is a concise written account of the issue, investigation steps, findings, and resolution.

Troubleshooting records matter because Zahlen is an operational intelligence platform. When evidence quality or governance trust is affected, the organization should preserve what happened and how the issue was resolved.

Record Field	Definition	Why It Matters
Issue summary	A short description of the visible problem.	Helps future readers understand the symptom.
Affected page or workflow	The dashboard, route, job, export, replay, or ingestion path involved.	Locates the problem in the operator experience.
Evidence reviewed	The run, file, event, alert, incident, task, or telemetry records checked.	Documents the evidence path.
Root cause	The underlying condition that caused the issue.	Prevents repeated troubleshooting.
Impact assessment	The operational or governance impact of the issue.	Explains whether confidence was affected.
Resolution	The fix or corrective action taken.	Creates durable operational memory.
Follow-up	Any remaining work, tests, monitoring, or documentation updates.	Ensures the issue is fully closed.

Quick Reference: Troubleshooting Decision Matrix

The following matrix provides a high-level guide for common issue patterns.

Problem Area	First Check	Second Check	Likely Next Action
Ingestion failure	Input file or payload validity.	Canonical field mapping.	Correct schema, resubmit, or review validation logs.
Replay mismatch	Input and output digests.	Event ordering and lineage.	Quarantine or escalate if governance-impacting.
Telemetry gap	Telemetry event count and external status.	Correlation and truth matching fields.	Interpret as enrichment-limited or fix telemetry linkage.
Watermark issue	Source event count and worker execution.	Persisted watermark and downstream outputs.	Repair state, rerun processing, or escalate engineering.
Routing inconsistency	Source signal and alert creation.	Incident/task routing reason.	Review routing rules and queue eligibility.
Dashboard count mismatch	Object type being counted.	Filters and refresh timing.	Confirm whether mismatch is expected or a true defect.

Chapter Summary

Troubleshooting in Zahlen should protect evidence quality, replay safety, governance confidence, and operator trust.

Ingestion failures indicate that incoming evidence may be missing, malformed, unmapped, or unprocessed. Replay mismatches indicate that historical conclusions may not be reconstructing as expected. Telemetry gaps indicate missing processing or enrichment context. Watermark issues indicate uncertainty about processing progress. Routing inconsistencies indicate that operational work may not be reaching the expected queue, owner, or supervisor surface.

The safest troubleshooting approach is to follow the evidence path from user-facing symptom to source event, canonical mapping, derived signal, telemetry, replay, routing, and governance status.

A well-documented troubleshooting practice makes Zahlen more operationally trustworthy because it preserves not only what the platform observed, but also how the organization resolved uncertainty when something did not behave as expected.