Sunday, January 8, 2012

Tips to perform RCA of a production crash

1) Concentrate

Concentration arises when the intellect directs the mind to the present action without meandering into unproductive channels of the past and future.

2) Be consistent

Consistency is when the intellect directs all thoughts and activities towards a goal.

3) Do not assume, expect the unexpected (Keep your mind open)


In bulk insert/update scenario, a developer would have committed at regular intervals and not after every query or at the end of a big transaction.

4) Analyze log

Check different logs for any traces,clues, etc. Many a times we may not be able to find root cause from a log. But, after analyzing several logs due repeatability we may get a a trend which can help find the root cause.

5) Some issues have patterns and trends and some do not - Monitor all layers, tiers, etc. Correlate. Generally its easy to find root cause if there are trends

6) Some times it makes sense not to change anything. If you change, create a history of changes for analysis - We may require to revert the changes.

7) Talk to the developers - they know or have sense of whats the problem or its cause. 

8) Failures are rich in learning - Must do post mortem or root cause analysis. Need to know how to prevent the following time. Publish a post mortem document - Identify what happened, how, when, what can be done to prevent similar problems in future, etc.

9) Keep calm and be human (Be blameless)
There are two ways of reacting when an outage is due to a mistake done by humans. One is obvious :)

Another - Don't get angry at people on outages. Else people may hide problems, people stop communicating, discourages transparency, small problems get ignored, turned in to big problems

If people are afraid to speak up, you extend outages ! 

10) Share it - Practice deliberate sharing. If it matters, share it, you never know who will benefit…and it could be you.

No comments:

Post a Comment