The Good, the Bad and The Ugly

Two weeks ago, after a new mart had performed excellent for 24hrs, a sudden and detrimental decline in performance was seen. I’d been trying to track this issue for almost two months, but had only seen it in two or three processes total and never to this level. I had some history from previous DBA’s, but they didn’t know what was causing it and when a mart has a limited lifespan before you are dropping and recreating it from a 6TB warehouse- any DBA knows, this is not going to be easy.

The main goal of the mart is to support a specific set of repeat processes to complete a function of the business. One of two larger processes of this main function was clearly shown as the culprit. Upon an attempt to see if improvement could be gained easily by utilizing an OEM SQL Profile, I received the error:

“The optimizer could not merge the view at line ID 52 of the execution plan.”
In my experience, this error is an indicator that a view, (either created by the developer or Oracle itself as it hashes, sorts, etc. so verify first) is causing issues for Oracle’s Cost Based Optimizer and the DBA involved should note this to the developer’s/IT Management that this view should be known as “volatile” to the environment. This means that it can cause performance issues due to it’s complexity and how it impacts the Oracle CBO.
Another reason the SQL Profile was not a good choice? The CBO recognized the WORST plan as the best choice when comparing the results of what it felt was the “better execution plan”.

How to track down the poor performing plan vs. the correct hashing plan? I’ve found a certain indicator in large warehouse environments- nested loops with rowid range scans as a solid indicator that Oracle has made a very poor choice.

Many different scenarios can cause this to exist, including a combination of Oracle parameters and statistics. I’m a new DBA and due to the number of “experts” that set up these databases and where their specialties lie, the Cost based optimizer and Oracle database parameters seem to be uncharted territories, so a little trust is still being gained in this arena… 🙂

For the problem at hand though, querying the database will show extensive time remaining, (20-200hrs) for “ROWID RANGE SCANS” per parallel slave session in long operations:

select opname, totalwork, time_remaining/60/60, sid from v$session_longops
where time_remaining>0 order by time_remaining desc;

Querying temp space usage will show large sorts per parallel slave sorting data that will not coorelate to the data set size, (30-90GB), where the data set that is being sorted should only be around 2-4GB:

select vs.sid, vs.osuser, vs. process, vs.sql_id, vtu.segtype, ((vtu.blocks*8)/1024)MB, vtu.tablespace from v$tempseg_usage vtu, v$session vs
where vtu.session_num=vs.serial# order by blocks desc;

Through the six simultaneous processes, I was able to trace three different execution plans for the same statement. The original(the good) from when performance was solid, the second, (the bad) which was sub-optimal, but would still finish with just a little extra temp used and then the third, what I called “the ugly” as these had estimated times of completion between 50-200 hours.
The difference in “The Ugly” plan was that the CBO, based off the hashing through the view, had deemed the statistics stale on all the objects involved, treating them, almost like a cluster table and updating the statistics through dynamic sampling. The statistics change cascaded through the system, affecting each of the jobs kicked off post the change. This was easy for me to see as a DBA, but I was contacted with a simple, panicked complaint of “the database has stopped processing anything!” from the users. This is always something to keep in mind as a DBA or developer- all it takes is ONE table to throw an execution plan off, doesn’t matter how well you collected statistics on the others.

Note: These views are very involved, so the explain plans are quite complex and incredibly long- and obviously I need to work on how best to present plans like these to a blog. Due to this if you are interested, I will be happy to email the full plans to anyone interested.

Statement Impacted:
CREATE TABLE dly_temp1_XXX (IBHID,FO_DAY_ID,LO_DAY_ID,FONO_DAY_ID,LONO_DAY_ID,FOFO_DAY_ID,LOFO_DAY_ID,FORO “tons of case statements and such”
STG1_OUTPUT_VW_NL_R48;
<–VIEW with more case statements to two tables The GOOD!!
the optimal plan for this statement, leading with an index on the largest table and using it to isolate the rows before heading into the table partition, then the global temp tables. No rowid range scans here and temp space usage is never over the dataset being worked with.
The BAD:Performs a full table scan well after it’s created it’s own temp tables as part of the process. This is sub-optimal and will not have the rowid range scans, but will still utilize tons of temp space for sorting and hashing for the wrong reasons.
The Ugly:
The CBO decides the global temp table is the best choice, (which could easily happen if there were no dynamic sampling, so I want to collect statistics dynamically, have the correct amount of temp allocated to sorting and hashing…) but we do not want a bind peek issue due to askew sizes on partitions, etc. on the permanent tables, dynamic sampling, etc. We have what we need, we need to keep Oracle on course.

Short Term Solution to Correct the Problem: During the intial build of these static tables, I collected my statistics VERY CAREFULLY with my own scripts. I know what tables need histograms on indexed columns, what tables and indexes need statistics dropped, what columns need gathered with columns size 1, the exact sample size, etc. I really need Oracle to stop working hard here and undoing all the work we put in during the build process.

DIAL BACK THE STATS to when the execution path was correct with the best statistics for the environment. OEM is very good about showing me exactly WHERE it went wrong, so I was able to simply dial back to before this time, (or query when the LAST_ANALYZED for the tables involved are to see the update to the stats if I did not have the convenience of OEM):

SQL> exec dbms_stats.restore_table_stats (‘DM_SCHMA’,’ITEM_DETL’,’17-DEC-09 04.00.00.000000000 PM -04:00′);
SQL> exec dbms_stats.restore_table_stats (‘DM_SCHMA’,’ORDER_DETL’,’17-DEC-09 04.00.00.000000000 PM -04:00′);

Now lock the statistics on the tables required to stay static, vs. the temp tables involved to deter the plan from changing:
SQL> exec dbms_stats.lock_table_stats(‘DM_SCHMA’,’ITM_DETL’);
SQL> exec dbms_stats.lock_table_stats(‘DM_SCHMA,’ORDR_DETL);
To allow for the most stable execution plans, I ended up locking down 18 tables that were called by the total views utilized in the main processing of the mart.

Long Term Solution: Since this was a case of dynamic sampling “pulling the rug out from under us”, updating statistics on tables that are in a static, read-only state, my choice was to add steps post the build and “proper” update of statistics to lock the stats on all tables involved in the views with a DBMS_STATS.LOCK_TABLE_STATS. I still rely on dynamic sampling for my temp tables and a few of my other tables, so this is a solid compromise that is easily placed into the datamart build scripts when these tables are first created and after the correct statistics script is run.

Print Friendly
December 29th, 2009 by

facebook comments:

  • Facebook
  • Google+
  • LinkedIn
  • Twitter