Zahlen Documentation

3.7 - System Health Documentation

Health Runs, Replay Integrity, Watermark Advancement, Event Durability, and Operational Survivability

Operator Manual - Phase 3

3.7 - System Health Documentation

Purpose of this chapter
This chapter explains how operators should use the System Health surface to confirm that Zahlen is processing issuer-health runs reliably, preserving replay integrity, advancing watermarks correctly, emitting durable events, and maintaining operational survivability.

Overview

The System Health documentation describes the operational health layer of Zahlen. In the current operator navigation, the System surface is represented by the Issuer Health Runs Health page. This page is intentionally narrow and operational. It does not attempt to explain every issuer behavior pattern. Instead, it answers a more fundamental question: is the issuer-health processing system functioning correctly and preserving the evidence needed for trustworthy operations?

This distinction is important because the System Health surface supports every higher-level intelligence layer. Dashboard metrics, monitor alerts, incident workspaces, action queues, supervisor views, and network intelligence surfaces all depend on reliable run execution, durable event generation, replay-safe evidence, and stable operational continuity.

In practice, System Health is the operator’s first confirmation that Zahlen is alive, processing, and preserving operational truth. If this layer is unstable, higher-level intelligence should be interpreted cautiously until the underlying processing health is understood.

Health Runs

A health run is a recorded execution of the issuer-health processing pipeline. The run represents a bounded processing event in which Zahlen reads issuer-related operational inputs, evaluates signals, updates health state, and emits downstream platform events where appropriate.

The Health card indicates whether the run-history subsystem is currently reporting a healthy state. A healthy state means that recent processing completed without failure and that the system has enough run-history evidence to support normal operator use. A degraded or failed state would indicate that the operational foundation should be reviewed before relying on downstream alerts or dashboard conclusions.

The Runs metric counts the total number of recorded issuer-health runs. This number gives operators a simple sense of execution history. A small number may be normal in a new environment, while a sudden stop in run growth may indicate that ingestion, scheduling, or background processing has stopped.

The Completed metric counts runs that finished successfully. Completed runs are important because they indicate that the pipeline reached a terminal successful state and produced usable operational evidence. The Failed metric counts runs that did not complete successfully. A failed run should be treated as an operational issue because it may prevent issuer-health signals from reaching dashboards, alerts, action queues, or network intelligence surfaces.

Dry Runs represent executions that are used for validation, testing, or non-production rehearsal rather than normal operational processing. Dry runs matter because they allow operators and engineers to test ingestion and replay behavior without treating the result as a live operational event.

Term	Operational Meaning	How Operators Should Interpret It
Health	The current high-level operating state of the issuer-health run subsystem.	Healthy indicates normal processing. Warning, degraded, or failed states should trigger investigation before relying on downstream intelligence.
Runs	The number of recorded issuer-health processing executions.	A rising count indicates continuing processing activity. A stalled count may indicate scheduler, ingestion, or runtime interruption.
Completed	The number of runs that reached a successful terminal state.	Completed runs confirm that usable operational evidence was produced.
Failed	The number of runs that did not complete successfully.	Failures should be reviewed because they may interrupt monitoring, alerting, incident creation, or event emission.
Dry Runs	Validation or rehearsal executions that do not represent ordinary live processing.	Dry runs are useful for testing but should not be confused with normal production evidence.

Replay Integrity

Replay integrity is the ability of Zahlen to reconstruct operational conclusions from preserved event lineage and deterministic processing rules. In a payment intelligence system, replay integrity is essential because operators need to trust that historical evidence can be reprocessed, audited, and compared without producing unexplained changes in conclusions.

The Latest Start and Latest End cards support replay integrity indirectly by showing the execution window of the most recent run. Latest Start records when processing began. Latest End records when processing completed. Together, these timestamps allow operators to confirm that the run was bounded and did not remain stuck in an incomplete execution state.

Latest Status indicates the terminal status of the most recent run. A completed status indicates that the run finished normally. A failed, partial, or stuck status would indicate that the run may not have produced complete replay-safe evidence.

Replay integrity is not only a technical concern. It is a governance concern. If a payment intelligence platform cannot reconstruct why it produced a conclusion, then the conclusion becomes difficult to trust during audits, incident reviews, supervisor escalations, or public-safe intelligence generation.

Term	Operational Meaning	How Operators Should Interpret It
Replay Integrity	The preservation of enough deterministic event lineage and processing structure to reconstruct historical operational conclusions.	Strong replay integrity means operators can trust historical analysis. Weak replay integrity means conclusions may be difficult to audit or reproduce.
Latest Start	The timestamp when the most recent issuer-health run began.	Use it to confirm recent processing activity and identify stale or missing runs.
Latest End	The timestamp when the most recent issuer-health run completed.	Use it to detect stuck, incomplete, or unusually long processing windows.
Latest Status	The final state of the most recent run.	Completed supports normal trust. Failed or partial statuses require operational review.

Watermark Advancement

Watermark advancement is the mechanism by which Zahlen tracks incremental processing progress. A watermark represents a durable marker that tells the system how far it has progressed through an ordered stream or batch of operational events.

The Watermark Advanced metric indicates whether a run moved the processing boundary forward. A positive advancement means that the system processed new evidence and updated its durable progress marker. A value of zero may be normal when no new input exists, but it may also indicate that data did not flow, that duplicate evidence was ignored, or that incremental processing did not progress.

Watermarks are important because they protect the platform from duplicate processing, missing event windows, inconsistent replay behavior, and ambiguous operational boundaries. In a production-scale event-driven architecture, watermark durability becomes one of the central controls that allows the system to scale without losing deterministic continuity.

Operators do not need to manage watermarks manually during ordinary use. However, operators should understand that watermark behavior explains whether the system is advancing through new operational evidence or simply confirming that no new evidence was available.

Term	Operational Meaning	How Operators Should Interpret It
Watermark	A durable processing marker that records how far Zahlen has progressed through ordered operational evidence.	Watermarks help prevent duplicate processing and support deterministic incremental execution.
Watermark Advanced	A count or indicator showing whether the latest run moved the processing boundary forward.	A positive value indicates new progress. Zero may be normal when idle, but repeated zero values should be interpreted alongside run activity and input availability.
Incremental Processing	Processing that handles only new or not-yet-processed evidence rather than rebuilding everything from scratch.	Incremental processing supports scale, but it requires trustworthy watermarks.

Event Durability

Event durability is the ability of Zahlen to preserve operational events so that downstream systems can rely on them. In the System Health surface, Total Rows, Total Processed, and Platform Events help operators understand whether evidence entered the system, whether it was processed, and whether it generated downstream event records.

Total Rows represents the number of input rows observed by the run. This value describes the size of the evidence set presented to the processing pipeline. Total Processed represents the number of rows that were successfully converted into usable operational evidence. If Total Rows and Total Processed differ, the difference may be normal filtering, validation failure, duplicate suppression, or ingestion mismatch.

Platform Events represent durable operational events emitted by the processing layer. These events are important because they connect System Health to the rest of Zahlen. Alerts, dashboards, investigations, network intelligence, and governance surfaces depend on the presence of reliable platform events.

Event durability matters because operational intelligence cannot be stronger than the evidence it preserves. If events are not durable, then downstream intelligence may become incomplete, non-replayable, or difficult to audit.

Term	Operational Meaning	How Operators Should Interpret It
Total Rows	The number of input rows observed during the processing run.	Use this to understand the size of the evidence set entering the system.
Total Processed	The number of rows successfully converted into usable operational evidence.	Compare this against Total Rows to identify filtering, validation, or ingestion issues.
Platform Events	Durable operational events emitted for downstream monitoring, alerting, investigation, and governance systems.	A healthy event count confirms that processed evidence is moving into the broader Zahlen intelligence architecture.
Event Durability	The preservation of operational events in a form that can be trusted by downstream systems and replay processes.	Weak event durability undermines dashboards, alerts, investigations, and governance confidence.

Operational Survivability

Operational survivability is the ability of the platform to continue preserving deterministic reasoning, event continuity, replay integrity, and system visibility during instability or scale. System Health is one of the primary surfaces operators use to confirm that this foundation remains intact.

A survivable platform does not merely function when conditions are normal. It preserves operational trust when runs fail, evidence volume changes, event emission fluctuates, watermarks stall, or replay validation becomes necessary.

In Zahlen, operational survivability connects technical health to institutional trust. A healthy System surface tells operators that the platform is preserving the foundation needed for issuer intelligence. A degraded System surface tells operators that higher-level intelligence may need careful interpretation until the underlying processing issue is resolved.

System Health should therefore be reviewed during startup, after ingestion changes, after large CSV uploads, after event-stream changes, after replay validation work, and whenever dashboards or alerts appear unexpectedly quiet or unusually noisy.

Term	Operational Meaning	How Operators Should Interpret It
Operational Survivability	The platform’s ability to preserve deterministic reasoning, event continuity, replay integrity, and operator visibility during adverse or changing conditions.	Strong survivability means Zahlen remains trustworthy under stress. Weak survivability means operators should investigate the processing foundation.
Processing Continuity	The continued execution of issuer-health runs over time.	Gaps in continuity may indicate scheduler, ingestion, or runtime issues.
System Visibility	The ability of operators to see whether processing, event emission, and replay foundations are healthy.	Visibility reduces uncertainty and helps operators avoid trusting stale or incomplete intelligence.

Status Counts and Mode Counts

The Status Counts section summarizes run outcomes by state. For example, a completed status count shows how many runs reached a successful terminal state. This table helps operators quickly understand whether the run history is dominated by healthy execution or operational failure.

The Mode Counts section summarizes the types of processing modes that produced runs. A mode describes the operational pathway used by the run, such as CSV job signal synchronization. Mode visibility matters because different processing pathways may have different reliability, evidence, replay, and operational implications.

CSV job signal synchronization is a mode in which signals derived from uploaded or processed CSV job outputs are synchronized into issuer-health events. This mode is important because it connects first-time analysis workflows to the broader issuer-health monitoring and operational intelligence layers.

Term	Operational Meaning	How Operators Should Interpret It
Status Counts	A summary of run outcomes grouped by terminal status.	Use this to identify whether failures, partial runs, or completed runs dominate the recent run history.
Mode Counts	A summary of run executions grouped by processing mode.	Use this to understand which ingestion or processing pathways are active.
CSV Job Signal Sync	A processing mode that synchronizes CSV-derived job signals into issuer-health operational evidence.	This mode confirms that CSV analysis is feeding downstream monitoring and intelligence surfaces.

Recommended Operator Workflow

Operators should begin by reviewing the Health card. A healthy state supports normal confidence in downstream surfaces. If health is degraded or failed, the operator should review run history before relying on dashboards, alerts, or investigations.

Next, operators should compare Runs, Completed, and Failed. A healthy environment should show recent completed runs and limited or no failures. Failed runs should be treated as operational evidence that the pipeline may not be fully preserving intelligence continuity.

Operators should then review Latest Start, Latest End, and Latest Status. These fields reveal whether recent processing occurred and whether it completed successfully. A missing or stale latest timestamp may indicate that ingestion or scheduling is not active.

After confirming run status, operators should inspect Watermark Advanced. This field should be interpreted in context. Zero advancement may be normal during idle periods, but repeated zero values combined with expected new input may indicate a processing or ingestion problem.

Finally, operators should compare Total Rows, Total Processed, and Platform Events. These values show whether input evidence was observed, converted into usable records, and emitted as durable operational events. Significant discrepancies should be investigated before trusting downstream conclusions.

Operator interpretation
The System Health page is not a substitute for issuer investigation. It is the foundation that tells the operator whether issuer investigation can be trusted. If System Health is unstable, higher-level conclusions should be treated as provisional until processing continuity, event durability, and replay integrity are confirmed.

Relationship to Other Operator Pages

The Dashboard depends on System Health because dashboard metrics are only meaningful when the underlying run and event pipeline is current. The Monitor Console depends on System Health because issuer-health alerts and monitoring surfaces require reliable processing and event emission. The Investigation Workspace depends on System Health because incident evidence must be traceable and replay-safe.

The Action Queue depends on System Health because operational tasks are derived from alert and signal activity. The Supervisor Dashboard depends on System Health because escalation pressure and workload visibility require durable and current operational evidence. The Network Intelligence Dashboard depends on System Health because ecosystem intelligence requires stable event lineage and replay-safe aggregation.

For this reason, System Health is best understood as the operational foundation beneath the entire Zahlen intelligence environment.

Summary

System Health documentation explains how operators should interpret the processing foundation of Zahlen. The page shows whether issuer-health runs are executing, whether processing is completing, whether watermarks are advancing, whether events are durable, and whether the platform is preserving the evidence required for replay-safe operational intelligence.

In an ordinary payment dashboard, system health may be treated as an infrastructure detail. In Zahlen, System Health is part of the intelligence model. It verifies that the platform’s conclusions are supported by current, durable, replay-safe operational evidence.

A healthy System surface means that operators can proceed with greater confidence into dashboards, monitoring, investigations, action queues, supervisor workflows, and network intelligence. An unhealthy System surface is an early warning that the operational evidence foundation should be reviewed before acting on higher-level conclusions.