Personal Programming Notes

To err is human; to debug, divine.

Datadog

This post summarizes what I learnt about Datadog, a third-party monitoring service.

Features

  • Presents multiple dashboards to keep an eye on how well everything’s running
  • Parsing of logs from Heroku, JMX
  • Using analytics from Datadog to help build self-healing systems
  • Agent-based monitoring, arbitrary configuration of log sources
    • dogstatsd - Aggregation of metrics and stats
    • Log collector. Forwarded to Datadog’s back-end
  • Can execute specific commands to collect information, health checking, …
  • Events and metrics can be tagged arbitrary for organization’s sake
  • Usage rate: Roughly 4.5m metrics collected/hour. Roughly 1250 metrics/second
  • Integrations for other products (Slack, Pagerduty, et al).
  • Monitors are built with a visual construction kit:
    • Dropdowns, pick-and-place.
    • Graphs, charts
    • Timeboards and Screenboards
    • Averages, time between events, events/unit of time, above/below (and/or equal to)
    • How to treat those metrics (sums, differences, averages)
    • Types of alerts
  • Basic widgets are used to construct the dashboards. Each widget is configurable.
  • Basic health checks (OK/NOT OKAY) are possible.
  • Application tracing - Python lib that you can include in your application, and it adds native Datadog monitoring hooks.

Notable shortcomings

  • Only in us-east-1 region.
  • Some reliability issues (could be AWS, could be them). They’re working on it.
  • If an app doesn’t come up, it doesn’t send an event, so they don’t know that something went wrong.
  • Because all teams are generating events, it’s hard to figure out what’s actually going on. Sometimes the tags don’t make sense…