How RecoveryPulse Works
RecoveryPulse monitors your websites and automatically recovers them when issues are detected. Here's how it works.
1. Continuous Monitoring
RecoveryPulse checks your websites at configurable intervals (default: 60 seconds). Each check verifies:
- HTTP Status: Ensures the expected status code (usually 200) is returned
- Response Time: Measures how long the response takes
- SSL Certificate: Validates your SSL certificate and checks expiry
- Content Match: Optionally verifies specific text appears on the page
2. Incident Detection
When a check fails, RecoveryPulse waits for a second failure to confirm the issue isn't transient. After two consecutive failures:
- An incident is created with full details
- The site status is marked as "down"
- Notifications are sent (if configured)
- Auto-recovery begins (if enabled)
3. Automated Recovery
RecoveryPulse connects to your server via SSH and executes recovery actions in order. Between each action, it checks if the site is back online before proceeding.
- Restart application service
- Wait 30 seconds, check site
- If still down: restart web server
- Wait 30 seconds, check site
- If still down: restart database
- Continue until recovered or max attempts reached
Available Recovery Actions
| Action | Description | When to Use |
|---|---|---|
restart_app | Restarts your application service via systemd | First action for most issues |
restart_nginx | Restarts the nginx web server | 502/504 errors, proxy issues |
restart_apache | Restarts Apache web server | Apache-based setups |
restart_mysql | Restarts MySQL database | Database connection errors |
restart_postgresql | Restarts PostgreSQL database | PostgreSQL setups |
restart_php_fpm | Restarts PHP-FPM service | PHP applications |
clear_nginx_cache | Clears nginx cache and reloads | Stale cache issues |
rollback_nginx_config | Restores nginx.conf from backup | After config changes |
reboot_server | Full server reboot | Last resort |
custom_script | Run any custom command | Special recovery needs |
Best Practices
Rule Order
Start with least disruptive actions (app restart) and escalate to more drastic measures (server reboot) only if needed.
Wait Times
Give services enough time to fully restart before checking. 30 seconds is usually sufficient for most services.
Max Attempts
Set a reasonable limit (5-10) to prevent infinite recovery loops. Some issues need manual intervention.
SSH Security
Use a dedicated SSH key for RecoveryPulse with limited sudo permissions for only the commands you need.