by Oliver
26. October 2017 10:00
Although the choice of the best session was not easy, it has to be awarded to Sasha Goldstein with his session on performance detective work. His session was very well prepared, had a clear goal and path, was packed with insights into challenges and problems during performance investigation work, and offered hand-crafted solutions like his own realtime ETW event tracing tool etrace.
What follows is an excerpt of Sasha Goldshtein's presentation.
Structure of a Performance Investigation
- Obtain the problem description
- Build a system diagram
- Run a quick performance checklist
- Understand which component is exhibiting the problem
- Investigate thoroughly
- Find the root cause
- Resolve the issue
- Verify resolution
- Conduct and document post-mortem
Performance Metrics, Goals, Monitoring
- Performance metrics don’t live in a vacuum!
- Derive performance metrics from business goals
- Monitor these metrics in your APM solution, home-made dashboard, or collection script, and get alerts
Investigation Anti-Methods
- Make assumptions
- Trust “instincts” and irrational beliefs
- Look under the street light
- Use random tools
- Blame the tools
The USE Method
USE: Utilization, Saturation, Errors
- Build a functional diagram of the system, including hardware/software resources
- For each resource, identify utilization, saturation, and errors
- Understand, resolve, and verify errors, excessive saturation/utilization, under-utilization
Statstics Lie Be Careful With Statistics
- Averages are meaningless
- Medians are almost meaningless
- Percentiles are OK if you know what you’re doing
- Find good visualizations for your performance data
- Beware coordinated omission
Look at histograms or sometimes even percentile plots aka cumulative distribution charts to really understand your data, e.g. your performance traces. Just look at this dinosaur to understand that very differently shaped data can lead to the same statistics values:
Conduct a Postmortem – Do It!
- Document the steps taken to identify, diagnose, resolve, and verify the problem
- Which tools did you use? Can they be improved?
- Where were the bottlenecks in your investigation?
- Can you add monitoring for sysadmins/ops?
- Can you add instrumentation for investigators?
- How do we triage this problem automatically next time it happens?
Resources
And now, happy performance hunting!