The Tao of a Database Outage

February 27, 2011 Kellyn Gorman

As many companies judge a DBA’s worth on the uptime of a database, nothing tests a DBA’s mettle like database service outages where the cause is indeterminate at the time of the outage. The secondary challenge is that once the database is brought back up, the DBA must then take the necessary steps to analyze the data and figure out what caused the outage, hopefully resolving or reporting to ensure outages to not re-occur. The DBA’s goal should not just be to resolve the issue, but to provide a high level, clearly documented report of what occurred and what steps have been/scheduled to be performed to address the problem.

This is where a DBA’s skills as a detective, attorney and researcher come into play and where I have discovered many of us, often due to the natural complexity of database environments, can find challenges. Having a solid set of steps to follow post an outage is extremely important to offering your manager solid data to show that you know what went wrong and to offer recommendations to deter a repeat incident.

As many of us can be “distracted” by benign performance issues that had nothing to do with the outage, it is important to keep ourselves on the correct path, performing solid detective work and following the right culprit until we are able to collect the evidence needed when it’s time to convince the judge or jury, (i.e. manager or development/application team…) To do so takes careful research, so what path is less riddled with benign errors that could distract us? Where should we start?

The Alert Log and Trace Files

The alert log should always be your first stop once you have determined the database is out of danger and can be returned to an accessible state.

Inspect the log at the time leading up to the outage for any error messages
- Note the type of error message, is it memory, I/O, background or session related?
- Is there an error number involved that can be researched on the web or with Oracle Support?
- Has the error occurred more than once or leading up to the outage the only time, (patterns are important…)
Taking the time of the error in the alert log into perspective, was there any corresponding trace files produced in the trace/diagnostic directories, (11g) or bdump/udump directories, (10g à)?
- Inspect for larger than normal trace files. Incidents will show themselves in more data being produced in the trace files.
- Is there a pattern to what trace files and what size?
If your alert log showed an issue with a specific type of area in Oracle, this should tell you what background trace files to inspect:
- Memory- inspect the memory manager trace files, (mnon, mmnl)
- Archiver- The archiver trace file.
- Log writer- lgwr or logwriter trace file.
- Sessions/undo- smon
- Recovery- reco and smon
- Processes- pmon
- Queuing/Streams- (qmnc, qxxx)
- Oracle Jobs- (jxxx, cjqx)
- Checkpoint- ckpt
- Session Trace files- <sid>_<pid>_ora.trc files

It is commonly the best idea to first work from the background trace files to the Oracle trace files unless the alert log has directed you to a specific session level trace file to inspect.

With the data between the alert log and the trace files, you should start to have a clear picture of the events that lead up to the outage.

The trace files will commonly offer the DBA…

What type of process lead or was part of a combination of issues that lead to the outage.
The SID, the SERIAL#, SQL_ID and even the login information for the process(es) that lead up to the outage.
What the error messages and/or numbers were involved in the issue.
The times of each step leading up to the incident to give a clear picture of what went wrong.

All of this data should be captured in the final report, without too much detail, just the facts.

You now know what happened, you may not know, as of yet why. You must look at your options to find out what tools/options are available to collect historical data of the incident.

If you have the Diagnostics and Performance Tuning Pack, then you are in the “with ease” group! (I plan to write a post about ADDM, AWR and ASH soon, so I am doing a bit of chicken before the egg on this one, my apologies if you have not used the following reports from the command line! Luckily, there are a number of solid resources out there for these reports…)

If you have a snap ID for the ending period just before your crash, you are able to easily pull and ADDM or AWR report.
If you have this snap ID and an offending SQL_ID, then I would pull a SQL_ID specific AWR report for the time of the incident, ($ORACLE_HOME/rdbms/admin/awrsqrpt.sql).
If you do not have the snap ID for a considerable time before the outage, then ASH is king and you can simply run an ASH report up to the time of the outage and collect very valuable data about what was occurring at that time in the database.

If you have Oracle Enterprise Manager, then you have a GUI interface that may offer you “lovely” pictures for your report, (which is always appreciated by non-DBA’s that have to decipher what might seem simple to you!)

Capture your data as soon as you are able after the incident from OEM. As Oracle will aggregate this data once it is shown as “historical”, the actual severity of the image can change drastically from real-time to the historical data.
If you at first do not see anything amiss, especially if you are inspecting historical data, carefully move the window in small sections and see how the data is displayed differently. Aggregated data can “hide” issues unless you happen on the right window.
Inspect the OEM data before AND after the outage or database hang. Many times this can give you a full picture of what occurred and if the database was actually hung, vs. an outage, the data will often still be collected at the session level, even if the top activity grid panel shows there was none at the time.

Now, for those that aren’t so lucky and are not blessed with the advanced tools, (I won’t scold you for not convincing your managers and companies to purchase the valuable tools above…:)) but I will offer you the next option down in options:

Statspack:

Same rules as above. Attempt to collect the single window of snapshots before the outage and as close to the outage as possible.

AWR/Statspack review of what we are looking for:

Look for out of the ordinary issues with high waits and sql statements that you normally wouldn’t see.
What are the highest wait events? Is this normal for this database? Do you know what is normal for this database? J
Do any of the waits or issues in the report correspond to the trace files?

Trace and Profile:

If you have the SQL_ID, you can pull the SQL and if you can pull the SQL, you should be able, albeit not in real-time or leading up to your database outage, perform a trace and discover what type of waits are occurring.

As many outages are due to a combination of SQL statements that are “incompatible” with each other due to performance, design challenges or resource limitations, collecting traces for each individual statement can offer a valuable amount of data to back up a report.

Creating Your Report

No more than 2 pages.
Should have the following sections:

Summary of outage
Timeline of outage
Findings that back up the summary.
Recommendations to address or deter from outage re-occurring.

3. Dates to address any issues with responsible parties assigned to address them.

Last Recommendations for Success

Rarely are database outages simple, so utilizing all resources without blinding ourselves with too much data is very important. I commonly find that we are either walking around, narrowed in on a target, missing great data under our noses or confused to utter distraction by all the data coming at us from all directions. With that said,

Do utilize resources from application, network and/or development support that can offer insight to any area of the outage that you may not have a clear view of.
Help them stay on the path, too. The alert log and trace files, along with your research have provided you with a solid picture of what happened inside the database and lead to the outage, so if anyone appears to be distracting from the issue at hand, simply say, “thank you, but this does not correspond with my data I’ve collected in the database outage.”
Always have data to back up your report. If you have a theory without data, do NOT put it in the report. You must have data to back up what you are saying occurred. There is no quicker way to lose coworker and/or customer support than to report something that you can’t back up.

2 thoughts on “The Tao of a Database Outage”

Fahd Mirza

March 1, 2011 at 8:17 am

In order to fully exploit the worth of RCA, both DBA and its clients (developes or managers) should appreciate that this is serious activity and not something to fill the belly of emails to the upper management. Proper time should be given to DBA to fully trace, identify, analyze, slice and dice and recommend the findings and further actions in RCA.
Pingback: One Year With Oracle - Oracle - Oracle - Toad World

Comments are closed.

You May Also Like

Why Go To Oracle?

EM13c- BI Publisher Reports

Three Months at Microsoft

2 thoughts on “The Tao of a Database Outage”