Oracle RAC: Monitoring and Preventing Performance-Tipped Evictions
After my last blog post on Oracle Real Application Clusters (RAC) I was asked to talk about both health and how performance impact can affect a RAC database. Its architecture enables failover, workload distribution, and offers an option to scale performance, but only when all nodes play well together. When one node drags behind or becomes unstable, RAC has no choice but to protect the rest of the cluster- so help me, Oracle Gods. This protection can come in the form of node eviction, which can be both disruptive and at times avoidable with proactive monitoring and intervention.
Why Node Health Matters
In a RAC environment, node health is constantly evaluated to ensure that each member of the cluster can respond to workload demands, maintain consistency, and prevent split-brain scenarios. Oracle Clusterware uses heartbeat mechanisms, voting disks, and network checks to determine node status. If one node becomes unresponsive or performs below an acceptable threshold, it can be evicted (forcefully removed) from the cluster.
This eviction is not a bug, but a protective feature. It can also come as a surprise when the root cause isn’t hardware failure, but performance degradation.
How Oracle Monitors Node Health
Oracle RAC uses several mechanisms to assess the health of nodes:
- Cluster Heartbeats: Exchanged between nodes via the private interconnect and voting disks. Failure to receive a heartbeat within a specific interval can trigger eviction.
- Cluster Health Monitor (CHM): Monitors OS-level metrics like CPU, memory, and I/O at sub-second intervals.
- Hang Manager (HANGANALYZE): Detects database hangs and attempts resolution or escalates to eviction.
- High Availability Daemon (OHASD/CRSD): Core Clusterware processes that initiate restarts or evictions when they detect resource failures.
When Performance Tips the Scale
Node eviction isn’t always due to a host level crash. In many cases, performance bottlenecks can cause a node to respond slowly or not at all, which results in tipping the node into eviction territory. There’s an almost limitless number of situations that could cause a node eviction, but I chose a variety that should demonstrate how diverse the issues can be which impact RAC node stability.
Example 1: CPU Starvation
If a node’s CPU is pinned at 100% due to runaway processes or OS contention, it may not respond to heartbeat checks quickly enough. RAC considers this node unresponsive and evicts it. This can happen even if the database itself is still technically “running”.
Alert log example:
2025-05-21 10:45:12.123
[cssd(3474)]CRS-1610:node DBnode01 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2025-05-21 10:45:12.245
Example 2: Network Latency
A saturated or flaky interconnect network can lead to heartbeat timeouts and poor Global Enqueue/Global Cluster (GES/GCS) performance, especially under heavy load. If nodes can’t communicate between each other, its essential for the node which is the culprit to be evicted so communication can be recommenced.
CSSD log example:
Fatal NI connect error 12170.
The network between racnode1 and racnode3 exceeded 2000 ms latency.
Example 3: I/O Bottlenecks
When a node can’t write to or read from shared storage in time (e.g., due to inefficient, overloaded or degraded storage), it may hang or timeout during disk operations. RAC could interpret this as a node failure. All nodes must have timely and full read/write access to the shared storage layer or node eviction will be eminent.
ASM alert example:
Disk group DATA slow to respond for 3.4s. IO hang suspected.
Eviction initiated for DBnode4.
Clusterware alert log, (combined) example:
[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
Proactive Monitoring Strategies
There are various utilities that come packaged with Grid Control clusterware that DBAs use to monitor and manage a RAC cluster. Due to RAC’s complexity, the cluster must be monitored, not just the separate nodes, (instances) that are part of the RAC environment.
Utilities and Tools
- Oracle Cluster Health Monitor (CHM / oclumon)
Use oclumon dumpnodeview -n <nodename> to analyze real-time node metrics.- Gather and review the CHM data for the time of eviction to prevent data from aging out of the system.
- Oracle Grid Infrastructure Logs
Investigate logs at:- $GRID_HOME/log/<node>/cssd/
- $GRID_HOME/log/<node>/crsd/
- AWR and ADDM Reports
Review interconnect latencies, wait events, and cluster wait classes. - Oracle Monitoring Products
Visual dashboards that alert on CPU spikes, I/O latency, and memory bottlenecks before they become eviction events. - Network and Storage Health Checks
Regular checks on interconnect bandwidth, jitter, and disk throughput using tools like iperf, orion, and OSWatcher.- For OSWatcher, consider using oswnetstat and/or oswprvtnet for diagnosis assistance if issues occur.
- Check network communication and UDP settings.
Best Practices to Avoid Performance-Based Eviction
- Reserve CPU using cgroups or CPU pinning for Oracle processes.
- Consider using Oracle Resource Manager to assist in throttling resource usage for Oracle processes and users.
- Use high-throughput private interconnects, ideally on a separate NIC and VLAN.
- Implement storage QoS to ensure database I/O isn’t starved.
- Monitor CHM for early warning signs like high OS load averages or paging.
Avoid overcommitting nodes with excessive workloads or background jobs to begin with. You should be scaling or optimizing before over-commitment becomes an issue. A healthy RAC environment is designed, so that if there is a node eviction, the remaining node(s) can support the weight of the workload reallocated to it.
In Summary
Node eviction in Oracle RAC isn’t just about hardware failures – although that absolutely happens, node eviction often happens as a direct result of system-level performance issues. With proper understanding and monitoring for the right metrics/issues, you can catch the early warning signs before Oracle is forced to take drastic action. RAC is powerful, but it demands a level of visibility and care to operate optimally.
By understanding how Oracle evaluates node health, and setting up tools and policies to keep your cluster balanced, you’ll ensure it doesn’t come at the cost of stability.