There’s a specific kind of silence that should worry you. It’s the silence of a production system with no monitoring. No alerts, no dashboards, no logs being shipped anywhere useful. Everything appears fine - right up until a customer emails to say the app has been down for two hours and nobody noticed. The absence of bad news is not the same as the presence of good news. If you’re not measuring, you’re guessing.
The dashboard you check once and forget
Most teams do have something. A Grafana instance someone set up during a hackathon. A free-tier Datadog account with default dashboards nobody looks at. Maybe some CloudWatch alarms that fire so often they’ve been muted in the team Slack channel. This is worse than having nothing, because it creates the illusion of coverage. The team believes they have monitoring. What they actually have is decoration.
Useful observability isn’t about having dashboards. It’s about being able to answer a specific question when something goes wrong: what changed, when, and what did it affect? If your monitoring can’t help you answer that in under five minutes, it’s not monitoring - it’s furniture.
What a minimal viable stack actually looks like
You don’t need a six-figure observability platform to get this right. For most early-stage startups, three things cover ninety percent of the ground:
Structured logging. Your application should emit logs in a consistent, parseable format - JSON, not free-text strings that require regex archaeology to query. Ship them somewhere centralised. If grep on an SSH session is your log analysis strategy, you’re one bad deploy away from a very long night.
Metrics on the things that matter. Request latency, error rates, saturation of key resources. Not fifty dashboards tracking everything your cloud provider exposes - four or five panels that tell you whether the system is healthy right now. The RED method (Rate, Errors, Duration) is a solid starting point. If you can see those three things per service, you’re ahead of most startups.
Alerting that means something. Every alert should require a human decision. If an alert fires and the correct response is “ignore it,” delete the alert. Alert fatigue is the fastest way to ensure your team ignores the one notification that actually matters. Page on symptoms, not causes. “Error rate above 5% for five minutes” is actionable. “CPU at 80%” usually isn’t.
The cost of not knowing
The real expense of poor observability isn’t the tooling you didn’t buy. It’s the engineering time burned on every incident. Without proper instrumentation, debugging becomes a manual, heroic effort - someone SSHing into boxes, tailing log files, trying to reconstruct a timeline from memory and Slack messages. A problem that should take twenty minutes to diagnose takes half a day. Multiply that across every incident, every quarter, and you’ve spent more on firefighting than the monitoring would have cost.
There’s a subtler cost too. Without observability, you can’t make informed decisions about your infrastructure. You don’t know which services are over-provisioned, which endpoints are slow, or where your actual bottlenecks are. Every capacity decision becomes a guess, and guesses trend expensive - teams over-provision because they’re afraid of what they can’t see.
Start with questions, not tools
The mistake most teams make is starting with a tool and hoping insight follows. Start with the questions instead. What do we need to know when something breaks? What does “healthy” look like for each service? Who needs to be alerted, and about what? Answer those first, and the tooling decisions become obvious. The best observability setup is the one your team actually uses - not the one with the most features.
If your current setup amounts to “we’ll check the logs if someone complains,” that’s a gap worth closing before it closes itself. Get in touch - we can help you figure out what to measure and how to make it useful.