Oracle

Passive-Aggressive Dynamic Sampling CBO with 11.2.0.3.0

Didn’t know that was the name of a little know feature in 11g, eh? OK, so I may have named it that, but this is where I’m at right now with dynamic sampling at many of the shops I’ve worked at.  I know it’s due to the common sizes and types of databases I’ve worked on, too.  Many of them have unique hybrid databases that require unique solutions.

These databases often have both transactional and ETL processing/reporting, along with adhoc queries from GUI tools.  For best results we’ve implemented the following in most of them:

1.  Throughout ETL loads, I work with developers to collect stats effectively to produce consistent plans that provide solid performance.  As they load and make a percentage of change to an object that will impact load performance, then we update stats or import stats, etc. to ensure that the CBO has what it needs to perform its job.  The nightly job collects on stale and ensures that it knows what is stale beforehand, (it’s surprising how many are unfamiliar with how stale stats information is gathered.  Yes, maybe that could be simplified, but right now, its built as effectively as can be.)  We don’t update stats if it’s not necessary to do so.

2. parallel is offered only where it can offer a performance increase and only at a degree of parallelism, (DOP) that offers the best performance considerate of the amount of processes, (the magic number here is so often “4”…)  No Auto-DOP is set, nothing that may cause inconsistent performance as consistency is what we are looking for in most shops.

So this is where dynamic sampling in 11.2.0.3.0 with parallel and I do not get along.  The goal, post taking a dynamic sample, is to provide an updated execution plan for the CBO to follow.  In 11.2.0.3.0, if parallel is used, the CBO has the option to perform a dynamic_sampling, level 6, even if you have your database parameters set to level 2 and your stats are up to date.  If I have developers who are assured that they already have the adequate stats and best plan in place, what does dynamic sampling offer me by a game of Russian roulette?

Example #1

Note, the  only difference in the two runs are the hint:

First run:  SELECT DISTINCT /*+ parallel(4) */ .....
9264 rows selected.
Elapsed: 00:07:22:93
Second Run: SELECT DISTINCT /*+ parallel(4) dynamic_sampling(0) */ ....
9264 rows selected
Elapsed: 00:3:23:98

Yet, when you look at TKPROF for the trace, it appears the statements took the same amount of time and did the same thing-

First Run:

call     count       cpu    elapsed       disk      query    current        rows

------- ------  -------- ---------- ---------- ---------- ----------  ----------

Parse        1      0.35       0.35          0          2          0           0

Execute      1      0.00       0.00          0          0          0           0

Fetch      619    192.17     199.84    6809688    6762205         67        9264

------- ------  -------- ---------- ---------- ---------- ----------  ----------

total      621    192.53     200.20    6809688    6762207         67        9264

Second Run:

call     count       cpu    elapsed       disk      query    current        rows

------- ------  -------- ---------- ---------- ---------- ----------  ----------

Parse        1      0.34       0.35          0          2          0           0

Execute      1      0.00       0.00          0          0          0           0

Fetch      619    191.11     198.71    6809146    6762205         67        9264

------- ------  -------- ---------- ---------- ---------- ----------  ----------

total      621    191.46     199.06    6809146    6762207         67        9264

So what was going on outside of the actual execution that required an extra 4 minutes, more than doubling the elapsed time?

Yes, a dynamic sample of a large table so the CBO can ensure it had the more reliable data to make and educated decision on the execution plan to use.  Yes, the stats were up to date.  No, the plan did not change.

If you haven’t noted the actual SQL executed on a dynamic sample, it will start with the following and can be located in the trace file starting with the following:

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB)

You can then follow it down to through the column, sample info to see the table it is sampling on.  It’s simple enough to note the execution time in SQL Monitor, through OEM , AWR or queries.  This is a SQL Statement executed just like any other statement.

If you run a SQL_ID specific AWR report, (awrsqrpt.sql) you will see the following after the execution plan:

Note 
----- 
 - dynamic sampling used for this statement (level=6)

Why should I be against this?  The CBO took, what it felt, was a quick sample and wanted to offer me a better plan.  The problem is, it didn’t offer me a better plan.  It stuck with the original and ended up doubling my elapsed time, which the user did notice!

Why did the CBO decide to take the dynamic sample if the stats were up to date?  The ETL had updated a few rows on a column with askew data that caused it to question if it did have the right data on that column for a join.  Now again, we know as DBAs that it would not make a difference in the plan that the CBO SHOULD choose, but the CBO didn’t know this and this is one more reason we need DBAs.

———————————————————————-

Example #2

Now, Example #1 was the lesser of two evils in my database world.  The second example is when it thinks it CAN offer me a better plan with the limited information it retrieves from the dynamic sample.  The plan shown contains just the limited view demonstrating the buffer sort at step 36 that really would be seen at step 8 in the optimal plan.  How many would have blamed the CBO for this,though?  The choice to perform the buffer sort on all the rows, before any filtering was made due to the dynamic sample on the table and that the design of the table and the join tables are a bit challenging.  The biggest challenge though is that this is a vendor app, so a change request can be initiated, but until then, what do you do?

 

| 31 | VIEW |                                    | 279K | 5452K|         | 4420 (1) | 00:00:54 | 
| 32 | HASH GROUP BY |                           | 279K | 5179K| 8104K   | 4420 (1) | 00:00:54 | 
| 33 | TABLE ACCESS FULL | JOIN_TBL4             | 294K | 5461K|         | 2743 (1) | 00:00:33 | 
| 34 | TABLE ACCESS FULL | JOIN_TBL3             | 294K | 13M  |         | 2743 (1) | 00:00:33 | 
| 35 | TABLE ACCESS FULL | JOIN_TBL2             | 294K | 8911K|         | 2743 (1) | 00:00:33 | 
| 36 | BUFFER SORT |                             | 2617K|      |         | 90M (1)  |301:11:06 | 
| 37 | TABLE ACCESS FULL | BIG_TABLE1            | 2617K|      |         | 13531 (1)| 00:02:43 |
Note 
----- 
 - dynamic sampling used for this statement (level=6)

The users called me after the execution hadn’t completed in a couple hours and I was asked to kill it and explain what had occurred.  When looking at the different plan hash values, it was easy to see, if dynamic sampling came into play, all consistent executions went out.  Sometimes the performance was acceptable, other times, dynamic sampling was not doing us any favors.

Now, there is going to be a lot of finger pointing, but I commonly prefer to see it as it is.

The database design and code is so often partly to blame, (views upon views have been seen to create complex scenarios, combined with dynamic sampling, can create situation where less than optimal plans are chosen.)   If your design includes lookup tables or other small tables, joined to very large ones, you can also experience less than optimal performance as the dynamic sample can incorrectly choose the join order.   The SQL may often be sub-optimal and should be written in a way that will benefit the choices of the optimizer, but, for many companies, the code may be offered by a vendor with limited flexibility and you are simply left with performance challenges and few answers.  The goal here is to stop the performance challenge, as everybody knows, the database is guilty until proven innocent, so it’s your job to create a stable environment.

So what did we do to stop it?

There are a couple ways to address this if you prove that dynamic sampling is partially to blame for the poor performance scenario in the database-

1.  To stop *most* dynamic sampling from occurring, you can change the dynamic_sampling parameter =0, but understand, you just addressed this at a database level, not at a session or code level.  This will stop the feature from being available to anything in the database, including code that may benefit from it.  I prefer to choose this as a last resort.

2.  You can alter session and set dynamic_sampling=0.  This is at the session level, so only that session is impacted.

3.  If it is only a one or two volatile statements, you can add the following to your hint:  /*+ dynamic_sampling(0) */  and so now we are at the SQL level, which is what I prefer if at all possible.

Luckily for this client, they were able to alter the session in the beginning of the ETL run, disable dynamic sampling and this removed not only the extra time elapsed from the executions, it returned consistent performance that they required.  Dynamic sampling has it’s place and benefit, but for a few of my clients, it’s not a beneficial feature for their design.

 

Kellyn

http://about.me/dbakevlar