What you could learn from ‘Practical Monitoring’ by Mike Julian (2017, 137 pages)
Modern software architecture makes finding what is going wrong more complex than ever before. in his book ‘Practical Monitoring’ Mike offers good advice on how to find and diagnose faults, failures and errors.
- Tool obsession – think about the mission and goal, rather than the tool or approach
- Monitoring is a job – everyone needs to consciously include monitoring in their roles. Think about monitoring when you design, build and run software
- Checkbox monitoring – make sure you define work ‘working’ means and monitor that
- Using monitoring as a crutch –
- Manual configuration
Monitoring design patterns
Composable monitoring. Use multiple specialised tools and couple the loosely together to form a monitoring platform with the following components
- Data collection – use counter or gauge
- Data storage – store time series data in TSDB, other data types should be stored differently
- Visualisation – show important data visually, making it easy to understand, access and filter
- Analytics and reporting – create reports to make sure services and third parties are living up to their
- Alerting – only report things you need to act on, or need to make a decision
Monitoring from the user perspective. Users care about if the app works, not how many nodes you are runnin
- start monitoring where users interact with your code
- monitor response code s (especially 5xx codes)
Buy not build.
- Do not build, unless you are Netflix, Google or Facebook, the overheads are huge
- You will not have the expertise
- You will not investment more money in improvement as companies who provide SaaS
- No really, use SaaS
- Realistically, you will need to re-architecture your monitoring every 2/3 years
- Keep improving little and often
Monitoring and alerts. Differentiate between FYI and action
- FYI – something is working
- Action – someone needs to do something
- Top tips – stop using emails, write Runbooks, delete and tune alerts
Runbooks. Write Runbooks for each of your services
- What is the service, what does it do?
- Who is responsible for it?
- What dependencies does it have?
- What does the infrastructure for it look like?
- What metrics and logs does it emit, and what do they mean?
- What alerts ar set up for it and why?
Monitoring the business
- Find the KPI or OKRs that will drive success
- Monitor these by default
Monitor front end
- Monitor page load times for actual users
- Monitor JS or other framework exceptions
- Keep track of pager load time with your CI system
- fit logging and monitoring to your apps by default
- do the basics first, request/response times, database read/write times
Monitor your build and release pipelines
- when did a deploy start/end, what build and who deployed it
- see who/what keeps breaking the environment
- heat beat – check you apps are up frequently
- Distributed tracing – tag every request with a request ID
- Distributed tracing is very complex and tough, only do this after instrumenting your apps with metrics and logs
Server monitoring. Automate the monitoring of all you servers/hosts
- CPU (% used)
- Memory (% used vs free)
- Load (how many processes are waiting to be served by the CPU)
- SSL certificates (especially expiration)
- Database servers (especially queries per second)
- Load balancers
- Message queues (queue length and consumption rate)
- Chaching (hit/miss ratio)
I learnt a lot from this book.
- Be explicit on what you are monitoring, and why
- Create Runbooks to make it easy for anyone to help
- Invest in monitoring, and continue to invset in monitoring, but start with the basics first
You can buy Practical Monitoring from Amazon UK here