This post summarizes what I learnt about Datadog, a third-party monitoring service.
- Presents multiple dashboards to keep an eye on how well everything’s running
- Parsing of logs from Heroku, JMX
- Using analytics from Datadog to help build self-healing systems
- Agent-based monitoring, arbitrary configuration of log sources
- dogstatsd - Aggregation of metrics and stats
- Log collector. Forwarded to Datadog’s back-end
- Can execute specific commands to collect information, health checking, …
- Events and metrics can be tagged arbitrary for organization’s sake
- Usage rate: Roughly 4.5m metrics collected/hour. Roughly 1250 metrics/second
- Integrations for other products (Slack, Pagerduty, et al).
- Monitors are built with a visual construction kit:
- Dropdowns, pick-and-place.
- Graphs, charts
- Timeboards and Screenboards
- Averages, time between events, events/unit of time, above/below (and/or equal to)
- How to treat those metrics (sums, differences, averages)
- Types of alerts
- Basic widgets are used to construct the dashboards. Each widget is configurable.
- Basic health checks (OK/NOT OKAY) are possible.
- Application tracing - Python lib that you can include in your application, and it adds native Datadog monitoring hooks.
- Only in us-east-1 region.
- Some reliability issues (could be AWS, could be them). They’re working on it.
- If an app doesn’t come up, it doesn’t send an event, so they don’t know that something went wrong.
- Because all teams are generating events, it’s hard to figure out what’s actually going on. Sometimes the tags don’t make sense…