How RecoveryPulse Works

RecoveryPulse monitors your websites and automatically recovers them when issues are detected. Here's how it works.

Recovery actions run only if you enable them. You control the commands. Keep permissions least privilege and use retry limits so recovery cannot loop forever.

1. Continuous Monitoring

RecoveryPulse checks your websites at configurable intervals (default: 60 seconds). Each check verifies:

HTTP Status: Ensures the expected status code (usually 200) is returned
Response Time: Measures how long the response takes
SSL Certificate: Validates your SSL certificate and checks expiry
Content Match: Optionally verifies specific text appears on the page

2. Incident Detection

When a check fails, RecoveryPulse waits for a second failure to confirm the issue isn't transient. After two consecutive failures:

An incident is created with full details
The site status is marked as "down"
Notifications are sent (if configured)
Auto-recovery begins (if enabled)

3. Automated Recovery

RecoveryPulse connects to your server via SSH and executes recovery actions in order. Between each action, it checks if the site is back online before proceeding.

Typical Recovery Flow:

Restart application service
Wait 30 seconds, check site
If still down: restart web server
Wait 30 seconds, check site
If still down: restart database
Continue until recovered or max attempts reached

Available Recovery Actions

Action	Description	When to Use
`restart_app`	Restarts your application service via systemd	First action for most issues
`restart_nginx`	Restarts the nginx web server	502/504 errors, proxy issues
`restart_apache`	Restarts Apache web server	Apache-based setups
`restart_mysql`	Restarts MySQL database	Database connection errors
`restart_postgresql`	Restarts PostgreSQL database	PostgreSQL setups
`restart_php_fpm`	Restarts PHP-FPM service	PHP applications
`clear_nginx_cache`	Clears nginx cache and reloads	Stale cache issues
`rollback_nginx_config`	Restores nginx.conf from backup	After config changes
`reboot_server`	Full server reboot	Last resort
`custom_script`	Run any custom command	Special recovery needs

Best Practices

Rule Order

Start with least disruptive actions (app restart) and escalate to more drastic measures (server reboot) only if needed.

Wait Times

Give services enough time to fully restart before checking. 30 seconds is usually sufficient for most services.

Max Attempts

Set a reasonable limit (5-10) to prevent infinite recovery loops. Some issues need manual intervention.

SSH Security

Use a dedicated SSH key for RecoveryPulse with limited sudo permissions for only the commands you need.

Start Monitoring Your Sites