Back to Blog
Engineering·10 min read

One OpenClaw Instance. Four Monitoring Channels. Zero 3 AM Phone Calls in Six Weeks.

An octopus monitoring server dashboards at 3 AM while a lobster sleeps peacefully with headphones nearby

Here's exactly how we set it up.

Last year we had a pretty standard ops setup. Datadog for metrics, PagerDuty for routing, a four-person on-call rotation.

It worked. Mostly.

The problem wasn't the tools. The problem was the 3 AM calls about things that didn't need a human.

Connection pool spikes that resolve themselves. Disk space warnings on a volume with auto-expansion. A brief latency blip from a deployment that was already rolling back.

Our on-call engineer would wake up, squint at their phone, open a laptop, check the dashboard, confirm it was nothing, and go back to sleep.

Two to three times a week. For months.

The Thing Everyone Does (That Doesn't Work)

The standard playbook: tune your alerts.

Raise thresholds. Add cooldown windows. Create runbooks. Maybe hire a dedicated night-shift SRE if the budget allows.

We tried all of that. Tuning helped — fewer false positives, sure.

But you can't tune your way out of a fundamental problem: most alerts need triage, not action. Someone has to look at the context, check if it's real, and decide whether to escalate.

That "someone" was always a human. At 3 AM. On a Tuesday.

What Changed

We started using OpenClaw — an open-source, self-hosted AI gateway — as an always-on ops agent.

Not as a replacement for monitoring. As an intelligent layer between our monitoring stack and our team.

The architecture is simple.

One OpenClaw instance runs on a small VPS. It connects to four channels: Slack (engineering team), Telegram (on-call lead), email (audit trails), and Discord (broader ops group).

The agent doesn't "monitor" anything directly. It uses two mechanisms OpenClaw provides out of the box: cron jobs and heartbeats.

Cron: The Precision Instrument

OpenClaw has a built-in scheduler that persists jobs to disk and survives restarts. You define a schedule, a prompt, and where to deliver the output.

We set up four cron jobs:

1. The health check (every 15 minutes)

An isolated cron job that hits our internal status endpoint, parses the JSON, and reports anomalies. If everything is green, it stays silent.

openclaw cron add \
  --name "Infra health" \
  --every "15m" \
  --session isolated \
  --message "Fetch https://status.internal/api/health. \
    If all services are healthy, reply HEARTBEAT_OK. \
    If any service is degraded, summarize which ones \
    and since when." \
  --announce \
  --channel slack \
  --to "channel:C0ALERTCHAN"

2. The morning briefing (daily at 7 AM)

A summary of overnight events. What fired, what resolved on its own, what needs human attention today.

openclaw cron add \
  --name "Morning ops brief" \
  --cron "0 7 * * *" \
  --tz "Europe/Berlin" \
  --session isolated \
  --message "Summarize overnight monitoring events. \
    List alerts that fired, their resolution status, \
    and any items requiring human follow-up today." \
  --model opus \
  --announce \
  --channel slack \
  --to "channel:C0OPSCHAN"

3. The weekly trend report (Mondays at 9 AM)

Uses a more powerful model to spot patterns across a week of incident data.

openclaw cron add \
  --name "Weekly ops trends" \
  --cron "0 9 * * 1" \
  --tz "Europe/Berlin" \
  --session isolated \
  --message "Analyze this week's ops incidents. \
    Identify recurring patterns, services with \
    increasing error rates, and suggest preventive \
    actions." \
  --model opus \
  --thinking high \
  --announce \
  --channel slack \
  --to "channel:C0OPSCHAN"

4. The escalation relay (via webhook)

When a critical alert fires, our monitoring tool hits an OpenClaw webhook. The agent triages it — checks recent context, correlates with known issues — and decides whether to wake the on-call lead via Telegram or just log it.

Each runs in an isolated session. No shared history. No context leakage. Clean runs every time.

Heartbeat: The Awareness Layer

Cron handles the scheduled stuff. But ops isn't all scheduled.

OpenClaw's heartbeat system runs periodic agent turns in the main session. The agent wakes up every 30 minutes, checks if anything needs attention, and goes back to sleep if not.

We use it for ambient awareness. Our HEARTBEAT.md is tiny:

# Ops heartbeat

- Check #alerts for unacknowledged items older than 10 minutes
- If any critical alert is unacked for 30+ minutes, escalate via Telegram
- If nothing needs attention, reply HEARTBEAT_OK

The heartbeat has full main-session context. It remembers the last few hours of conversation.

So when it sees a new alert, it can check: "Did someone already acknowledge this in Slack five minutes ago?" If yes, it stays quiet.

That's the piece you can't get from static alert rules.

Multi-Channel Routing

Here's where it gets interesting.

OpenClaw routes replies back to the channel where a message came from. But you can also deliver across multiple channels at once.

Our setup:

  • Slack gets the detailed technical output. Full context, runbook links, related recent incidents.
  • Telegram gets the urgent stuff. Short and actionable. "DB replica lag > 30s on prod-east. Runbook: [link]. Ack in Slack to suppress escalation."
  • Email gets the audit trail. A daily digest of every decision the agent made.
  • Discord gets the weekly summary for the broader team.

One agent. Four channels. Each gets what it needs in the format it needs.

OpenClaw supports WhatsApp, Telegram, Slack, Discord, Signal, iMessage, and more — all from the same gateway process. No Zapier. No webhook middleware. One config file.

What We Got

Six weeks in:

  • 3 AM phone calls went from ~3 per week to zero. The agent handles triage while the team sleeps. Real escalations (two in six weeks) still wake the on-call lead via Telegram.
  • Mean-time-to-acknowledge dropped from 14 minutes to under 30 seconds for automated triage.
  • Alert fatigue fell off a cliff. The team sees curated, contextualized summaries instead of raw alert spam. The #alerts channel went from 200+ messages a day to about 15 that matter.

This isn't hypothetical. This is production.

The Honest Caveats

This approach has limits.

The agent isn't running kubectl commands or restarting services. It's triaging, correlating, and routing. Actual remediation is still human (for now).

You need to be thoughtful about what you let it do. We run OpenClaw on our own infrastructure — it's MIT-licensed and self-hosted by design. Your monitoring data never leaves your network.

The model can get it wrong. We had one case where it misclassified a genuine disk issue as a known false positive because the symptoms looked similar. We caught it in the morning brief.

So we trust-but-verify. The agent handles the 2 AM noise. Humans review the morning summary. Remediation requires a human in the loop.

If You Want to Try This

The whole setup took about two hours. Most of that was deciding our escalation logic.

Here's the minimum:

  • Spin up an OpenClaw instance (getting started). A small VPS or even a Raspberry Pi works.
  • Connect your channels. Slack + Telegram is a good starting pair. Each takes about five minutes to configure.
  • Add your first cron job. Start with a simple health check that hits an endpoint and reports anomalies. You'll iterate from there.

The cron docs and heartbeat docs cover the full API. The cron vs heartbeat guide helps you pick which to use for what.

The Bottom Line

Your on-call team is burning hours on triage that an agent can handle in seconds.

The tools exist. They're open source. They run on your hardware.

The question isn't whether AI can handle overnight ops triage. It's why your team is still doing it manually.