Fleet Monitor: Podman Compose–Powered Linux Server Monitoring
Posted: February 23, 2026 to Insights.
Fleet Monitor — a Podman Compose stack that monitors Linux servers
Monitoring a Linux fleet should be repeatable, transparent, and easy to automate. If you already use Podman to run containers on your hosts, it makes sense to use Podman Compose to stand up your observability stack too. Fleet Monitor is a practical blueprint for doing exactly that: a compact, composable set of containers that monitor system health, services, and applications across many servers with minimal friction and strong security defaults.
This post lays out a full architecture, explains why Podman is a strong fit, and walks through how to deploy, scale, and operate the stack. You’ll get real-world examples, tips for reliable alerts, and suggestions for extending the stack as your needs grow.
Why Podman for a monitoring stack
Podman’s design aligns well with how most organizations prefer to run observability on Linux:
- Rootless by default: Run containers under unprivileged users for reduced blast radius and simpler audits.
- Daemonless: No single long-lived daemon process. Each container is a child process, simplifying lifecycle management.
- Systemd integration: Generate systemd units and let the OS manage container services, logging, and restarts.
- Compatible orchestration surface: Podman Compose supports familiar docker-compose.yml syntax with many of the same semantics, while staying inside the Podman ecosystem.
- SELinux-aware: Labels and confinement options help lock down persistent volumes and mount points on SELinux-enabled distributions.
For many small and mid-sized fleets, this means you can keep your monitoring stack on a single, modest VM or bare-metal box, and still get the security and reliability you want without adding another orchestrator.
Core architecture overview
Fleet Monitor is built around a mature open-source toolchain that has proven itself in production at every scale. You can adopt it incrementally and swap components over time without deep lock-in.
Metrics collection and storage
- Prometheus: Time-series database and metrics scraper. It pulls metrics over HTTP from targets on a fixed interval, evaluates alerting rules, and stores recent history.
- Node Exporter: Agent on each Linux server that exposes host-level metrics (CPU, memory, disk, network, filesystems, systemd, and more).
- Podman exporter: A lightweight exporter that exposes per-container metrics (CPU, memory, block I/O, restarts) for containers running under Podman.
- (Optional) cAdvisor alternative: If you’ve standardized on cAdvisor elsewhere, note that it is typically used with Docker and containerd. For Podman, a dedicated Podman exporter or embedded Podman metrics endpoint is the better fit.
Dashboards and visualization
- Grafana: Flexible dashboarding with alerting, annotations, and a rich app ecosystem. Grafana reads from Prometheus and provides shared, templated dashboards for the whole fleet.
Alerting and notification
- Alertmanager: Routes, deduplicates, silences, and delivers alerts from Prometheus to email, Slack, PagerDuty, Opsgenie, and webhooks.
Logs and traces (optional, recommended)
- Loki: Log aggregation designed for efficiency. It pairs well with Prometheus’s label-centric model.
- Promtail: Lightweight log forwarder for journald and file logs on each server, with relabeling support.
Synthetic checks (optional)
- Blackbox Exporter: Probes endpoints (HTTP, TCP, ICMP) from one or more vantage points to catch user-visible failures.
Network and security model
A strong default approach is to run the stack rootless on a dedicated monitoring host. Use a dedicated non-privileged user such as “monitor” and keep long-lived state on bind-mounted directories owned by that user. When using SELinux, assign appropriate labels to volume directories so the containers can write safely without disabling SELinux globally.
Expose Grafana and Alertmanager via a reverse proxy on HTTPS (Caddy, Traefik, or Nginx), keeping Prometheus and exporter endpoints on private subnets or host firewalls. Use separate networks inside Podman Compose so the UI tier is reachable from users, while scrape targets remain internal.
On scrape targets (the Linux servers), run node_exporter and optional promtail as systemd units. Only expose their listen ports to your monitoring host’s IPs through host firewalls. If you cannot open ports across networks, consider SSH reverse tunnels or a small satellite Prometheus that remote_writes back to a central instance.
Deploying with Podman Compose
At its simplest, you can deploy Fleet Monitor on a single VM with persistent volumes. The following process is representative; adapt to your distro and conventions.
Prerequisites
- Podman and podman-compose installed.
- A service account (e.g., “monitor”) with a home directory on a filesystem with enough space for Prometheus and Loki data.
- Firewall rules that allow UI access (Grafana/Alertmanager) and inbound scrapes from the monitoring host to exporters on target servers.
Directory layout
- config/prometheus/ (prometheus.yml, rules/*.yml, file_sd/*.json)
- config/grafana/ (provisioning/datasources, provisioning/dashboards, dashboards/*.json)
- config/alertmanager/ (alertmanager.yml)
- data/prometheus/ (Prometheus TSDB)
- data/grafana/ (Grafana database)
- data/loki/ (if using Loki)
- compose.yml (Podman Compose file)
Compose services
Define services for Prometheus, Grafana, Alertmanager, and optional Loki. Mount the config and data directories as volumes. Use a user-defined bridge network for internal connectivity. Expose only Grafana and Alertmanager ports publicly; keep Prometheus UI internal or restricted to VPN/IP allowlist.
Bringing it up
- From the project directory, run podman-compose up -d to start all services.
- Verify container health logs with podman logs.
- Configure your reverse proxy to publish Grafana at HTTPS.
Configuring Prometheus to scrape your fleet
Prometheus works best when you declare targets explicitly and group them by role. Use file-based service discovery (file_sd) to decouple infrastructure inventory from Prometheus configuration. External automation tools (Ansible, Terraform, custom scripts) can update JSON target files without reloading Prometheus configuration each time.
Static and file-based discovery
- prometheus.yml references one or more file_sd_configs pointing to file patterns, e.g., config/prometheus/file_sd/node/*.json.
- Each JSON file lists targets (host:port) and labels such as job, environment, and region.
- Reload Prometheus with a POST to /-/reload or by sending SIGHUP when rule files change; file_sd target updates are picked up automatically.
Labels you’ll appreciate later
- job: The exporter role (node, podman, blackbox_http, etc.).
- env: Environment (prod, staging, dev).
- region/az: Physical or cloud locality to understand blast radius.
- team: Ownership for routing alerts.
- instance: Canonical hostname or FQDN (Prometheus sets one by default but make it consistent).
Scrape hygiene
- Stick to a default 15s scrape interval, 10s timeout. For low-power edge nodes, relax to 30s/5s.
- Keep federation and remote_write off at first; add later if cardinality or retention grows.
- Set external_labels on Prometheus for identity if you later aggregate metrics across instances.
Visualizing with Grafana
Grafana brings the fleet to life by turning labels into filters and providing reusable building blocks. Use provisioning to seed Datasources (Prometheus and Loki) and import a few curated dashboards to start, then layer your own panels.
Dashboards that pay for themselves
- Node overview: CPU saturation, memory pressure (including cache/buffer context), disk usage and IOPS, network throughput, top processes by CPU and RSS.
- Container overview: Podman exporter metrics showing per-container CPU/memory, restart counts, and throttling.
- Filesystem health: Per-mount usage, inode exhaustion risk, filesystem errors (exposed via node_exporter collectors).
- Latency and reachability: Blackbox HTTP/ICMP dashboards mapping failures by region or environment.
Practical templating
- Templated variables: environment, region, job, and instance. Save the defaults but let teams bookmark filtered views.
- Annotations: Pull Alertmanager events into Grafana for context during incidents.
- Links: Add panel links to runbooks or service pages that map from labels to documentation.
Monitoring Linux servers across the fleet
The minimal agent footprint is node_exporter and (optionally) promtail. Both can run as systemd services with little to no custom configuration. Keep the listening interfaces locked down and enable only the collectors you need.
Node Exporter on each server
- Install from your distro’s package repo or vendor tarball.
- Run as an unprivileged user exposed on 9100, bound to a private interface.
- SELinux: label the binary and unit file directories as appropriate if enforcing mode is on.
- Collectors: default is fine; consider enabling systemd collector to track service failures.
Promtail for logs (to Loki)
- Use the journald target to capture system logs with labels for unit, priority, and hostname.
- Scrape application log files with per-path pipelines (multiline parsing for stack traces).
- Relay to a central Loki over HTTPS with basic auth or tokens.
Exporting Podman container metrics
- Deploy a Podman exporter on hosts that run containers. It discovers containers via Podman’s API or sockets and exposes metrics on a dedicated port.
- Label containers (e.g., com.example.team, com.example.service) and propagate these as Prometheus labels through relabeling.
Alerts that matter
Alert overload kills signal. Start with high-confidence, actionable alerts, then expand carefully. Ensure every alert has owners, runbooks, and clear thresholds.
Essential host alerts
- Instance down: Prometheus target misses for 5 minutes, tagged by environment and region to assess impact.
- Filesystem usage: Predictive alert when time-to-full is under a few days (based on growth rate), and a hard alert at 90–95% usage.
- Memory pressure: High working set and page fault rates, not just “low free memory.” Focus on sustained major page faults and swap activity.
- CPU saturation: High CPU usage plus run queue length over cores for N intervals.
- Systemd unit failures: New failed units detected by node_exporter’s systemd collector.
Container and service alerts
- Container restarts: Sudden bursts of restarts in 5–10 minutes.
- Throttling: CPU throttled periods exceeding a safe budget, hinting at mis-sized cgroups.
- Blackbox probes: SLA-based latency SLO breaches and 5xx/connection failures for public endpoints.
Alert routing that reduces noise
- Route by team and service with labels team and service.
- Group alerts by cluster/region to collapse stormy symptoms.
- Silence during maintenance with labels or schedules in Alertmanager.
Operating the stack day to day
Observability is software you operate. A few operational habits keep Fleet Monitor reliable and predictable.
Upgrades
- Pin images to known-good tags and bump in controlled batches. Keep release notes handy for Prometheus and Grafana breaking changes.
- Use rolling restarts for exporters; they are stateless and quick to recover.
Backups and retention
- Prometheus TSDB: Rely on redundancy and remote_write for long-term durability rather than file-level backups. If you must back up locally, stop Prometheus first.
- Grafana: Back up the SQLite/PG DB and the dashboards/provisioning directories regularly.
- Loki: If running monolith, back up its index/chunks if durability is a requirement; otherwise accept it as a best-effort cache and route critical logs to archival storage too.
Capacity planning
- Cardinality: Watch the number of active series. Labels that explode (like per-request IDs) should be avoided at ingestion time.
- Retention: Start with 15–30 days of Prometheus retention. Increase only when necessary; older data can live in a cheaper long-term store.
- Storage: For Prometheus, favor fast local SSD. Isolate the data directory from noisy neighbors.
Troubleshooting checklist
When something’s off, check these first:
- Targets are “down”: Firewall or listen address mismatches. Verify that node_exporter binds to the interface you scrape from.
- No metrics after upgrade: Exporter flags changed. Compare running flags to previous unit files.
- High scrape durations: Network latency, overloaded targets, or expensive exporter collectors. Reduce scrape concurrency or disable heavy collectors.
- Missing container metrics: Podman exporter cannot access the Podman socket. Ensure the exporter’s user is in the right group and mounts the socket correctly.
- SELinux denials: Check audit logs, add proper labels to volume paths, and avoid :Z on shared volumes you don’t intend to relabel globally.
Extending the stack
As your fleet grows or your services diversify, add exporters and data paths incrementally:
- Database exporters: Postgres, MySQL, Redis for internal KPIs and health checks.
- Web servers: Nginx/Apache exporters to see request rates, latencies, and saturation.
- SNMP exporter: Network devices, UPSes, and PDUs for power and link health.
- Blackbox multi-region: Run blackbox exporters in multiple regions to compare user experience.
- Remote_write: Ship a subset of metrics to a long-term store like VictoriaMetrics or Thanos for historical analysis.
Security hardening
Monitoring tools can reveal sensitive metadata. Secure them like production systems.
- Transport security: Put Grafana, Alertmanager, and any public Prometheus endpoints behind TLS. If you keep Prometheus UI internal-only, restrict by firewall and VPN.
- AuthN/AuthZ: Use Grafana’s OIDC/SAML with your identity provider. Restrict admin privileges and use folders and teams for dashboard permissions.
- Network segmentation: Separate the scrape network from user-facing UI. Block exporter ports at borders and only allow from the monitoring host(s).
- Secrets management: Avoid embedding credentials in compose files. Use environment files, secrets support, or one-time mounts. Scope tokens to read-only where possible.
- Container profiles: Drop capabilities, use read-only filesystems where possible, and set resource limits to prevent runaway containers.
Alternative topologies
While a single-host stack is perfect for many teams, you can grow horizontally without rearchitecting.
- Centralized Prometheus with remote exporters: Standard model, simple to run, low moving parts.
- Hierarchical Prometheus: Region-level Prometheus instances scrape local targets and remote_write a subset to a global Prometheus for cross-region dashboards.
- Edge satellites: Tiny Prometheus instances at edge sites scrape locally (resilient to WAN outages) and ship condensed metrics upstream when links are healthy.
- High availability: Run two Prometheus instances scraping the same targets and deduplicate in Alertmanager; keep Grafana reading from both or a gateway.
Real-world examples
Example 1: Mixed on-prem and cloud
A small team runs 40 on-prem servers and 25 cloud instances. They deploy Fleet Monitor on a single VM with 4 vCPU and 32 GB RAM, SSD-backed. Node exporters run on every host, and Prometheus scrapes over private links. A blackbox exporter probes public endpoints from two regions. Alerts route to Slack during work hours and PagerDuty after hours. They use 15 days of Prometheus retention locally and remote_write critical service KPIs to a compact, long-term VictoriaMetrics instance for 180-day trending. The operations team curates three core dashboards; service teams maintain their own panels in dedicated Grafana folders.
Example 2: Edge devices on shaky links
An industrial integrator monitors 120 edge nodes (small fanless PCs) across factories connected via unreliable cellular links. Each factory hosts a tiny Prometheus and node exporters on all nodes. The factory-level Prometheus keeps 7 days of data and federates a low-cardinality subset upstream nightly. Alerts are handled locally where possible (SMS or on-site displays) to avoid missing incidents during outages. Grafana in the cloud aggregates high-level KPIs per factory to evaluate uptime SLAs.
Example 3: Security-first environment
A research lab has strict segmentation. No incoming connections to research nodes are allowed. Node exporters are bound to loopback only. Instead, each node runs a small agent that periodically pushes a snapshot of metrics to a bridge host via SSH, which then transforms them into a local exporter endpoint scraped by Prometheus. While pull-based scraping is the norm, this push-bridge pattern meets their policy while preserving the label model used throughout Fleet Monitor. Logs flow via promtail over mutually authenticated TLS to a central Loki instance.
Cost and performance considerations
- Scrape intervals: Aggressive scrapes increase cardinality and disk churn. For infrastructure metrics, 15s is typically enough granularity. Per-request application metrics can often be aggregated into histograms at 30–60s windows.
- Label discipline: Reserve high-cardinality labels for logs or traces, not metrics. Pre-aggregate metrics at the application when necessary.
- Storage classes: Use local SSD for Prometheus; network-attached storage can add latency and variance in compactions.
- Federation vs. remote_write: Use remote_write to centralize long-term retention; use federation to curate a model of “what matters” for cross-site dashboards and alerts.
Implementation blueprint
Step 1: Prepare the monitoring host
- Create a “monitor” user with limited sudo, dedicated home on SSD-backed storage.
- Install Podman and podman-compose; enable lingering for the user if you plan rootless systemd units.
- Create directories for configs and data; set ownership and SELinux labels where needed.
Step 2: Author configs
- Prometheus: prometheus.yml with scrape and evaluation intervals, alerting rules, and file_sd references.
- Alertmanager: Routes by team/env; set receivers for Slack/email/PagerDuty; define inhibition rules to prevent alert storms.
- Grafana provisioning: One Prometheus data source, one Loki data source; bootstrap a “Fleet” folder with a node overview dashboard.
- (Optional) Loki: Single-binary config with filesystem storage for pilot deployments.
Step 3: Define Compose services
- Prometheus service mounts config and data directories, sets proper command-line flags (retention, TSDB paths, external labels), and joins a dedicated network.
- Grafana service mounts provisioning and data directories; expose port 3000 internally and publish through a TLS-terminating reverse proxy.
- Alertmanager service mounts alertmanager.yml and a data dir for silences; expose only internally and link Grafana to it.
- Loki service (if used) with a single volume; promtail agents configured to send to it over HTTPS.
Step 4: Deploy exporters
- Node exporter on each server via your configuration manager; firewall rules allow 9100 from the monitoring host.
- Podman exporter wherever containers run; mount the Podman socket read-only and ensure proper user/group permissions.
- Promtail on nodes where logs matter; forward journald and application logs with consistent labels for team, env, and service.
Step 5: Automate inventory
- Generate file_sd JSON from Ansible inventory or cloud APIs. Emit labels for env, team, region, and service.
- Run a Scheduled job to refresh inventory files and reload Prometheus when topology changes.
Step 6: Curate dashboards and alerts
- Import a well-regarded Node Exporter dashboard as a starting point.
- Add panels that reflect your SLOs, not just system metrics. Tie panels to runbooks via links.
- Write alerting rules with thresholds and durations that match observed baselines. Tag each with owner and severity.
Step 7: Harden and observe the observer
- Put Grafana and Alertmanager behind HTTPS with OAuth-based login.
- Restrict Prometheus UI to a private network; monitor the monitor with a second lightweight watchdog (e.g., external probe that checks Grafana health).
- Set up log retention and alerts for the monitoring stack itself (disk space, container restarts, scrape failures).
Patterns for scale and resilience
- Sharding by responsibility: One Prometheus per domain (infrastructure, databases, applications) to reduce blast radius and cardinality collisions.
- Write-ahead log (WAL) tuning: If you see slow restarts after crashes, allocate enough IOPS and consider reducing scrape concurrency.
- Retention per tier: Short retention locally for freshness and real-time alerting; long-term retention in a cost-effective remote store for trends.
- HA-lite: Two Prometheus instances scraping the same targets with deduplication in Alertmanager and Grafana’s unified views.
Common gotchas and how to avoid them
- Exporter sprawl: Keep a registry of exporters and their owners. Avoid one-off exporters that nobody maintains.
- Dashboard drift: Version dashboards in Git. Promote changes via PRs and auto-provision them so UIs aren’t hand-edited snowflakes.
- Alert fatigue: Start with a small, curated set of alerts. Review and prune monthly. Add on-call previews and burn-rate-based alerts for SLOs.
- Inconsistent labeling: Define a label schema early and enforce it in exporters and relabeling rules.
- Missing runbooks: Require a runbook URL for any page-worthy alert.
From pilot to production
Roll out Fleet Monitor to a subset of servers first. Validate that dashboards answer real questions and that alerts are scarce but always correct. As confidence grows, expand coverage, introduce blackbox checks from multiple regions, and layer in application metrics. Keep the stack modest—Podman Compose keeps it understandable—yet durable with good storage and network hygiene. The result is a monitoring platform that feels native to Linux, scales with your fleet, and remains operationally boring in the best possible way.
Taking the Next Step
With Podman Compose as the backbone, Fleet Monitor delivers a lean, portable stack—Prometheus, Grafana, Alertmanager, and optional Loki—that feels native to Linux and is easy to reason about. By standardizing labels, automating discovery, and curating dashboards and alerts around SLOs, you get signal over noise and a system you can trust. Hardened endpoints, reproducible configs, and modest scaling patterns keep it resilient as your fleet grows. Start with a small pilot, iterate in Git, and expand as you validate real operational wins. Spin up the compose file, ship your first metrics, and let the data guide your next improvements.