Docs
Backend

🛡️ Guardian Self-Healing System

Updated May 15, 2026

🛡️ Guardian Self-Healing System

The Guardian is the manager's autonomous watchdog service, responsible for monitoring server process health, tracking resource consumption, and providing self-healing capabilities through automated crash recovery.

📝 Service Overview

  • File Path: src-tauri/src/services/guardian.rs
  • Architecture: Low-overhead OS process monitoring (sysinfo).
  • Core Functionality: Crash Detection, Auto-Restart Orchestration, Resource Telemetry (CPU/RAM).

🚀 Key Features

1. High-Performance Monitoring (⚡)

The Guardian utilizes a specialized "Partial Refresh" strategy to monitor server health without impacting game performance:

  • PID Tracking: Securely maps database server IDs to their active OS Process IDs.
  • Efficient Telemetry: Only refreshes relevant process data from the OS kernel, avoiding the heavy overhead of full system hardware scans.
  • Resource Profiling: Calculates real-time Memory Usage (MB) and CPU Usage (%) for every running ARK instance.

2. Autonomous Self-Healing (💊)

  • Crash Detection: The Guardian identifies process failures within seconds of a crash event.
  • Configurable Auto-Restart: Administrators can toggle "Auto-Restart" on a per-server basis. When enabled, the Guardian autonomously re-provisions the server process if it terminates unexpectedly.
  • Crash Counter: Tracks cumulative failure counts for each server, helping developers identify unstable mod configurations or hardware bottlenecks.

3. Crash Intelligence & Logging

  • Event Journal: Maintains a persistent rolling log of the last 100 crash events across the entire cluster.
  • Detailed Audits: Each crash log includes the server name, precise UTC timestamp, and a record of whether the self-healing system successfully triggered a recovery.

4. Developer API

The Guardian exposes several Tauri commands for frontend integration:

  • get_server_health: Fetches detailed metrics for a specific server.
  • set_auto_restart: Updates the autonomous healing policy for a server.
  • get_crash_log: Retrieves the cluster-wide failure history for display in the admin panel.

🛠️ Technical Details

Server Health Model

pub struct ServerHealth {
    pub server_id: i64,
    pub is_alive: bool,
    pub crash_count: u32,
    pub memory_mb: f64,
    pub cpu_percent: f32,
    pub auto_restart_enabled: bool,
}

Lightweight Process Refresh

To ensure minimal impact on the host system, the service uses a surgical refresh pattern:

let mut sys = System::new();
sys.refresh_processes(sysinfo::ProcessesToUpdate::All, true);
let process = sys.process(Pid::from_u32(pid));

🎨 Developer Notes

  • Safety First: The Guardian is "crash-aware" but not "restart-happy." It validates process state multiple times before confirming a failure to prevent race conditions during server reboots.
  • Visibility: All Guardian actions (registration, auto-restarts, crash logging) are output to the manager's console with a distinct 🛡️ icon for easy log auditing.