Subscribe to Blog via Email
Follow me on TwitterMy Tweets
As many companies judge a DBA’s worth on the uptime of a database, nothing tests a DBA’s mettle like database service outages where the cause is indeterminate at the time of the outage. The secondary challenge is that once the database is brought back up, the DBA must then take the necessary steps to analyze the data and figure out what caused the outage, hopefully resolving or reporting to ensure outages to not re-occur. The DBA’s goal should not just be to resolve the issue, but to provide a high level, clearly documented report of what occurred and what steps have been/scheduled to be performed to address the problem.
This is where a DBA’s skills as a detective, attorney and researcher come into play and where I have discovered many of us, often due to the natural complexity of database environments, can find challenges. Having a solid set of steps to follow post an outage is extremely important to offering your manager solid data to show that you know what went wrong and to offer recommendations to deter a repeat incident.
As many of us can be “distracted” by benign performance issues that had nothing to do with the outage, it is important to keep ourselves on the correct path, performing solid detective work and following the right culprit until we are able to collect the evidence needed when it’s time to convince the judge or jury, (i.e. manager or development/application team…) To do so takes careful research, so what path is less riddled with benign errors that could distract us? Where should we start?
The Alert Log and Trace Files
The alert log should always be your first stop once you have determined the database is out of danger and can be returned to an accessible state.
It is commonly the best idea to first work from the background trace files to the Oracle trace files unless the alert log has directed you to a specific session level trace file to inspect.
With the data between the alert log and the trace files, you should start to have a clear picture of the events that lead up to the outage.
The trace files will commonly offer the DBA…
All of this data should be captured in the final report, without too much detail, just the facts.
You now know what happened, you may not know, as of yet why. You must look at your options to find out what tools/options are available to collect historical data of the incident.
If you have the Diagnostics and Performance Tuning Pack, then you are in the “with ease” group! (I plan to write a post about ADDM, AWR and ASH soon, so I am doing a bit of chicken before the egg on this one, my apologies if you have not used the following reports from the command line! Luckily, there are a number of solid resources out there for these reports…)
If you have Oracle Enterprise Manager, then you have a GUI interface that may offer you “lovely” pictures for your report, (which is always appreciated by non-DBA’s that have to decipher what might seem simple to you!)
Now, for those that aren’t so lucky and are not blessed with the advanced tools, (I won’t scold you for not convincing your managers and companies to purchase the valuable tools above…:)) but I will offer you the next option down in options:
AWR/Statspack review of what we are looking for:
Trace and Profile:
If you have the SQL_ID, you can pull the SQL and if you can pull the SQL, you should be able, albeit not in real-time or leading up to your database outage, perform a trace and discover what type of waits are occurring.
As many outages are due to a combination of SQL statements that are “incompatible” with each other due to performance, design challenges or resource limitations, collecting traces for each individual statement can offer a valuable amount of data to back up a report.
Creating Your Report
3. Dates to address any issues with responsible parties assigned to address them.
Last Recommendations for Success
Rarely are database outages simple, so utilizing all resources without blinding ourselves with too much data is very important. I commonly find that we are either walking around, narrowed in on a target, missing great data under our noses or confused to utter distraction by all the data coming at us from all directions. With that said,