How HTI Uses Metrics and Telemetry to Predict Failures in Clusters and Replications

Traditional monitoring of distributed data infrastructures, such as clusters and replication topologies, operates under a fundamentally reactive paradigm. It is designed to respond to binary state failures: a node is online or offline, replication is active or broken, disk usage has exceeded 90% or not. This model is insufficient for complex systems, as a catastrophic failure is not an instantaneous event, but the end result of a cascade of subtle degradations that are perfectly observable if we know what to look for.

Modern reliability engineering demands a shift in focus: from failure detection to failure prediction. This is achieved through the collection and analysis of high-granularity telemetry, not to generate alerts, but to identify the precursors of instability. The analysis does not focus on the absolute value of a metric, but on its temporal derivative (the rate of change), its correlation with other seemingly unrelated metrics, and deviations from historical patterns (baselines).

At HTI Tecnologia, the 24/7 support of critical PostgreSQL, Oracle, SQL Server, and other DBMS environments is built on this philosophy of predictive analysis. Our specialists use telemetry not to see what broke, but to model what is about to break. This approach is the difference between an avoided downtime and a severity 1 incident in progress.

This technical article details the specific methods and metrics we use to predict and prevent failures in two of the most critical high-availability architectures: database clusters and replications.

The Fallacy of “Traffic Light” Monitoring in Distributed Systems

Distributed systems do not fail monolithically. The failure of a Galera cluster or a PostgreSQL replication rarely begins with an explicit error in the logs. It starts as a latent problem in one of the following three domains:

Network: Increased latency, packet loss, or bandwidth saturation.
Node I/O: Degradation of the disk subsystem on a single node.
Workload: A change in the application’s query pattern that overloads a specific component of the system (e.g., the write certification mechanism)..

“Traffic light” monitoring (green/red) is blind to these degradations. A node can be “online” (responding to pings), but with such high disk latency that its participation in the cluster becomes a bottleneck for the entire system. Replication can be “active” (no errors), but with a growing lag that represents an imminent RPO (Recovery Point Objective) risk.

Predictive analysis, on the other hand, focuses on the stress indicators that precede the failure.

4 Predictive Indicators that HTI Monitors in Clusters and Replications

Below, we detail four examples of predictive analyses that we apply, which transcend the scope of generic monitoring tools.

1. Replication: Analysis of the “Replication Lag” Derivative

What traditional monitoring measures: The Replication Lag in seconds or bytes. An alert is triggered if lag > 300 seconds.
Why this fails: The absolute lag is a lagging indicator; it informs you about a problem that is already occurring. A 60-second lag might be acceptable if it is stable, while a 20-second lag could be a sign of impending disaster if it was at 2 seconds a minute ago. The isolated metric has no context.
We focus on the first and second derivatives of the lag (its velocity and acceleration).
- Lag Velocity Analysis (d(lag)/dt): We collect the lag value (e.g., pg_wal_lsn_diff in PostgreSQL, Seconds_Behind_Master in MySQL) at short intervals (10-15 seconds). A consistently positive derivative indicates that the replica is not able to apply the changes at the same speed they arrive. Even if the absolute lag is still within the threshold, this is a precursor to failure. The system is in a state of imbalance that, if not corrected, will inevitably lead to an SLA violation.
Causal Correlation: An increase in the lag’s derivative triggers an automated correlational analysis to find the root cause. The main hypotheses we investigate are:
- I/O Contention on the Replica: We correlate the increased lag with the replica’s disk subsystem metrics (iowait, disk queue depth). Often, the cause is a reporting or backup workload running on the replica, which competes for I/O with the log application process.
- Long Transactions on the Primary: A transaction that remains open for a long time on the primary can prevent cleanup processes (like VACUUM in PostgreSQL) from advancing on the replica, causing degradation.
- Network Latency: We correlate the lag with the round-trip latency and packet loss between the primary and replica nodes.
Technical Example: In a MySQL cluster, we observed that the derivative of Seconds_Behind_Master became positive. The correlational analysis showed that the Innodb_buffer_pool_wait_free metrics on the replica were increasing. This indicated that the replica was struggling to find free pages in the buffer pool, a sign of I/O contention. The investigation revealed that an analytics job was running queries with full table scans on the replica. Optimizing these queries resolved the growing lag trend before it became an operational problem.

2. Synchronous Clusters: Monitoring “Flow Control”

What traditional monitoring measures: The status of the nodes in the cluster (e.g., wsrep_cluster_size in Galera/Percona XtraDB Cluster).
Why this fails: A cluster can have all nodes “healthy” and still have its performance completely degraded due to a self-protection mechanism called Flow Control. Flow Control is activated when a node cannot keep up with the cluster’s write speed, causing the faster nodes to pause to wait for it. For the application, this manifests as a “freeze” or a drastic increase in commit latency.
We directly monitor the Flow Control metrics and their precursors.
- State Metrics: In Galera-based clusters, we monitor wsrep_flow_control_paused (the fraction of time the node has spent paused) and wsrep_flow_control_sent (the number of times the node has signaled the cluster to slow down). An increase in these metrics is an unequivocal sign that at least one node is acting as a bottleneck.
- “Noisy Node” Analysis: Flow Control is a symptom. The cause is a “noisy node” (or “slow node”). To identify it predictively, we monitor the certification queue depth (wsrep_cert_queue_size) and the writeset application queue depth (wsrep_apply_queue_size) on each node individually. A node that consistently has larger queues than the others is the candidate to cause the next Flow Control event.
- Correlation with Node Resources: The cause of a node being slow is usually the same as a slow replica: I/O contention, CPU contention, or network fragmentation. Our analysis correlates the cluster’s queue metrics with the operating system’s resource metrics on each node.
Technical Example: In a Percona XtraDB cluster, wsrep_flow_control_paused alerts started to appear. Instead of just logging the event, our retrospective analysis of the telemetry history showed that one of the three nodes had been showing a gradual increase in the wsrep_cert_queue_size metric for hours. The investigation on that specific node revealed a hardware problem in its disk subsystem that was not severe enough to cause a failure but made it marginally slower than the others, enough to degrade the entire cluster under load.

3. Data Drift Detection

What traditional monitoring measures: The state of replication (active/inactive).
Why this fails: Replication can be technically “working” (the process is running and applying events), but logical errors, DBMS bugs, or manual interventions can lead to a silent data divergence between the primary and the replicas. This is one of the most dangerous types of failures, as it can go unnoticed for weeks and completely invalidate the purpose of the replica as a failover or backup copy.
We implement a data consistency verification layer.
- Table Checksumming: We use tools like pt-table-checksum (from the Percona ecosystem, but also applicable to PostgreSQL) to perform low-impact checksum verifications in the background. These tools divide the tables into chunks and compare the checksums between the primary and the replica. Any divergence is detected and logged long before it becomes a business problem.
- Validation Metrics Monitoring: The execution of the checksum is only half the solution. We ingest the results of these tools into our monitoring system. We not only alert on a found divergence; we monitor the duration of the checksum execution. An increase in the time required to check a table can indicate a performance problem on the replica or an increase in the write load, both of which are precursors to problems.

4. Consensus Mechanism Saturation

What traditional monitoring measures: The average commit latency.
Why this fails: The average can hide dangerous outliers. In a cluster, the latency of a commit is determined by the time it takes for the transaction to be replicated and certified by a quorum of nodes. A problem in the communication between the nodes can cause latency spikes that are “diluted” in the average but severely affect the user experience.
We analyze the health of the group communication layer.
- Group Round-Trip Latency: In clusters like Galera, we monitor wsrep_gcomm_last_packet_usec, which measures the latency of the last group communication packet. We analyze the percentile distribution (p95, p99) of this metric, not the average. An increase in the p99 latency indicates that the network is starting to become a bottleneck for consensus, even if the average latency seems normal.
- Potential “Split Brain” Analysis: We monitor the network topology between the cluster nodes. A “flapping” event on a network switch, causing intermittent connectivity loss between subsets of nodes, is a precursor to a split brain event. Our network telemetry analysis looks for these patterns of instability.

Human Expertise: The Final Layer of Predictive Analysis

The collection of telemetry and the application of statistical analysis can be automated. However, the contextual interpretation and the intervention decision require human expertise. Does a high lag derivative require the same action if the cause is I/O or a long transaction? The answer depends on the workload, the criticality of the application, and the business objectives.

This is where the partnership with HTI Tecnologia transcends the simple provision of a monitoring service.

Multi-Platform Expertise: Our team understands the idiosyncrasies of each replication and clustering mechanism, be it Oracle Data Guard, SQL Server Always On, PostgreSQL logical replication, or Galera clusters. This ensures that the analysis is accurate.
24/7 Incident Management: The prediction of a failure is useless without the ability to act to prevent it, at any time. Our 24/7 Support and Sustaining service ensures that an expert not only sees the predictive alert but also executes the action plan to mitigate the risk.

From Reaction to Prevention

The management of high-availability distributed data systems requires an evolution from reactive monitoring to predictive analysis. The focus must shift from failure indicators to stress and degradation indicators.

Analyzing the rate of change of metrics, correlating events between different layers of the system, and understanding the internal mechanisms of each technology are the pillars of this approach. It is a process that combines automation in data collection with human expertise in interpretation, ensuring that failures are predicted and avoided, rather than just remediated.

Is your operation still stuck in the cycle of alerts and remediation? Schedule a conversation with one of our specialists and discover how predictive analysis can ensure your business continuity.

Schedule a conversation with one of our specialists and discover the blind spots in your monitoring.

Schedule a meeting here