CBO, Statistics and A Rebel(DBA) With A Cause

This is not an “I’m against Oracle Best Practices” article.

This is not an “I’m preaching against histograms” article.

This is not an article telling folks that they should EVER take the steps or the choices that are presented in this article.

I’m also going to refer to the wonderful post by Greg Rahn as to the reasons why what works in my environment would most likely NOT work in yours:

http://structureddata.org/2008/10/14/dbms_stats-method_opt-and-for-all-indexed-columns/

What this article IS:

  1.  An honest account of how a unique database design was impacted when changing from a previously researched method of stats collection to post an 11g upgrade to the recommended choices.
  2. The reasons, including the high level data, (to protect the innocent, which happens to be the DBA writing this post, due to her confidentiality agreement…) that lead to the findings.
  3. The steps that were utilized, (and some that are still underway) to correct the performance issues introduced.
  4. The why a DBA should know their data/database, research, test, research, test some more and trust their knowledge.

 

Many folks in the Oracle arena were aware that I was undertaking a complex and challenging 11g upgrade.  This upgrade was challenging due to multiple datamarts, built using RMAN’s duplicate command, that would require me to perform a series of upgrades, one after another successfully.  IT was an “all or nothing” upgrade scenario and due to the interval, I had a very limited timeline to make it right or an inability to rollback the upgrade.

The initial source database environment was upgraded to 11.2.0.2.0 with the latest patch on a Tuesday and the first datamart to be built off of the upgrade was initialized, with all the script changes on that Friday in mid-December.  There were a couple challenges, including the office closed down between the Christmas week, (if no one is using the databases, how do you really know there is something wrong from your user base??  There were positives and negatives, I took the path of less users in the database meant I had downtime to work through any issues!)  The first build was completed and outside of a few, minor performance issues, the initial upgrade phase was successful and progressed along successfully through the initial mart build.

The 10g environment had been offered a slot of my time back in late 2009-early 2010 for careful performance tuning recommendations.  The database is a dual-purpose database that is over 12TB of large, wide main tables, over 100 columns wide for most and incorporating only 2-4 columns per table in any where clause.    Much of the data is used during a load process and almost exclusively in that load process.  This resulted in me recommending, due to this unique design,  to move much of the stats collection into the code level.  The DBA staff previously had relied on dynamic sampling for much of the safety net and by moving to statistics collection within the code, much of the performance challenge stabilized.  There was a nightly job that was the “catch-all”, creating stats collection calls from objects  listed as “STALE” in the DBA_TAB_MODIFICATIONS table.  Due to partitioning strategies that were based on reporting and not on data loads, hints were incorporated into the load processes to stabilize many of the erratic plans that were impacting performance and to deter poor choices from bind peeking.   As we moved the developers from performing DML transactions against the large partitions to a more mature choice of partition exchange, the database performance continued to stabilize and the business grew more satisfied.  The remaining performance issues were considered non-issues and no further performance tuning was really addressed in the source database except to realize there would be some required with the upgrade to 11g and a few one-offs that came up as new code was introduced.

I had learned quickly in 9i and 10g to know my database when it came to how to collect statistics and for this database, where large, wide tables, the same 2-4 indexed columns used in all where clauses and execution plans that needed to be stable, the following held particular, satisfying results:

METHOD_OPT=> ‘FOR ALL INDEXED COLUMNS’

The database simply did not care about the other columns and ran well with a streamlined, efficient data dictionary that did not offer many options than the basic information.  The culture among the developers and former DBA’s had grown into one that due to design challenges, had implemented a solid “code with hints” formula that was the only one I had ever seen created successfully.  It was an abrupt change, considering my former two positions I had spent most of my time removing hints to let the optimizer do its jobs after correcting statistics!  With this new environment, the source database only needed to retain information about the tables as a whole and the indexed columns that would be used in the load and reporting processes.

  • The load process would use the indexed columns.
  • The reporting processes used the partitions

For the 10g environment, stats collection on stale objects could run anywhere from 10 minutes to 8hrs.  The DBA_TAB_MODIFICATIONS table was cleared out nightly and plans were stable.

The 11g upgrade had been in place for the main design for one week when I proceeded forward with the *standard* for 11g statistics gathering.   All the DBA’s were in agreement that I would proceed forward with this post the upgrade, but until they attended Maria Colgan’s sessions at RMOUG in mid February, I don’t know if they truly realized what change I was undertaking, (they’ve mostly left performance tuning and statistics to me happily…)

Maria Colgan has been one of my favorites for quite some time, so of course, her blog posts are referenced here as some of the most important ones to me in my decision for what we should do for stats gathering in our forthcoming 11g databases:

https://blogs.oracle.com/datawarehousing/entry/managing_optimizer_statistics

I’d already utilized many of her recommendations, albeit hacked for the unique mart environments with much success when it came to partitioned objects:

http://optimizermagic.blogspot.com/2009/02/maintaining-statistics-on-large.html

I have extended stats on many of number of specific objects in my marts and even one table that I drop stats and lock the table stats due to dynamic sampling is the ONLY way the CBO utilizes this table correctly on a consistent basis, (always  have to have one odd one out!)  Everything is scripted, runs automatically and consistently since December, 2009, providing, (again!) consistent performance in our multiple mart environments, where prior to December 2009, the performance of the mart environments were erratic and beyond anything the former DBA’s could guarantee.

I knew I would not have the issues in my source database with any of the above concerns, including large partitions, as we were collecting stats at the partition level in the code during load/creation time.

Greg Rahn, Randolf Geist and others all had incredibly informative and universally similar messages on what should and should not be done in an 11g database, including improvements in how 11g’s stats collection features functioned.  I really was only interested in one change and that was to METHOD_OPT.

I’ve always been a DBA engrossed by anything Cost Based Optimizer, (CBO) related and considered myself well-versed in the area, specifically in the challenges of warehouse databases.

I then made my first, “should know better” mistake.  I stopped researching my own environment, testing further stats collection changes from 10g to 11g and blindly trusted what I was reading.   Hey, I was only going to give the CBO more information, let 11g make the decisions on the method of how to collect the statistics, everything else in my scripts would remain the same.

CBO performance is dependent on a number of factors for the feature to function optimally in any database and Oracle offers us auto stats gathering jobs and a scheduler to run these jobs-straight out of the box.  One of the largest dependencies to this offering of is:

  1.  Design of the database.
  • Is the database actually designed so that joins can be effective and efficient?
  • Is the partitioning strategy well thought out and evenly distributed?
  • Do the correct indices exist or are their extra indices that should be reviewed and removed?
  • Are their hints, bugs or any other obstacles to the CBO doing its job?

2.   Data

  • How is the data loaded vs. queried.  Is this process very different from the requirements.
  • Is the database multi-use so that transactional and reporting reside in the same database, resulting in concurrency issues?
  • Is the database so large that the auto scheduler window runs out of time each day, building up a queue of objects that never have stats gathered?

I had collected my 10g SQL Baselines and had these to return to at any time.  I was confident in my skills to determine good plan from bad plan, (I work with a DBA currently who first approached me in a conference a couple years ago because he wanted to know how I knew a slide in my presentation was a mistake, having put in a “bad plan” for a specific SQL statement…  This all goes back into the depth some of us obsessed DBA’s get into our data..:))

Now, with this in knowledge in hand, I implemented the following change:

METHOD_OPT=> ‘FOR ALL INDEXED COLUMNS’

Yes, I simply removed the method option, allowing Oracle to make the choice it felt was best.  This is when, slowly but surely, all $#%& broke loose.

 

Now many of you will quickly assume it was in the form of poor performance and you would be correct.  The reason behind the performance and what was most challenging about it, was what was not as clear.

I did have a few incidents of bind peeking and where hints were ignored and such.

select * from table(dbms_xplan.display_awr(‘7rjk8mr85s2nq));

Wow…21 different plans for the same statement…

INSERT INTO DW_USR.CT1 (COL1,COL2,COL3,COL4,COL5,COL6,COL7.....
COL18,COL19,COL20)
SELECT * FROM ( SELECT S.COL1, S.COL2, S.COL3, ROW_NUMBER ()
OVER (PARTITION BY S.COL5, S.COL6,S.COL7, S.COL8
ORDER BY S.COL12) COL6,S.COL2, S.COL12, S.COL15,
S.COL19, S.COL5, S.COL15, S.COL22,
S.COL27, S.LAST_FILE_ID, S.COL28,
S.COL29, S.COL31, S.COL33,
S.COL35, S.COL36, S.COL37,
S.COL38, S.COL39, S.COL40,
S.COL43, S.COL44, S.F1, S.F2, S.F3,
S.F4, S.F5, S.F6, S.F7, S.F8, S.F9, S.F10,
S.COL45, S.COL46, S.COL47,
S.COL48, S.COL81
FROM ( SELECT /*+ use_hash(cts ca) */
CTS.COL5, CA.COL3, CTS.COL6,
CTS.COl76 AS COL6, CTS.COL2,
CTS.COL12, CTS.COL15, NULL AS COL19,
CTS.COL5, CTS.COL15, CTS.COL22,
CTS.COL27, CTS.COL7 AS LFID,
CTS.COL28, CTS.COL17, CTS.COL31,
CTS.COL33, CTS.COL35, CTS.COL36,
CTS.COL37, CTS.COL38,
CTS.COL39, CTS.COL40 AS COL40, :B5 AS
COL43, :B5 AS COL44, CTS.F1, CTS.F2,
CTS.F3, CTS.F4, CTS.F5, CTS.F6, CTS.F7, CTS.F8,
CTS.F9, CTS.F10, CTS.COL45, :B4 AS COL46, :B3 AS
COL47, CTS.COL48, CTS.COL49, CTS.COL50,
CTS.COL51, CTS.COL52,
CTS.COL53, CTS.COL54,
CTS.COL55, CTS.COL48,
CTS.COL28
FROM DW_USR.CTS1 CTS,
(SELECT COL2, COL3, COL5
FROM (SELECT COL2, COL3,
COL5, ROW_NUMBER () OVER(PARTITION BY COL2, COL5
ORDER BY COL43 DESC, IND_ID) C_RNK
FROM DW_USR.CA1  WHERE COL2 = :B1 ) C
WHERE C_RNK = 1) CA
WHERE CTS.P_ID = :B2 <--partition ID on stage table.
AND CTS.COL2 = :B1 <--Askew partition indicator for final tables.
AND CTS.COL2 = CA.COL2 (+)
AND CTS.COL5 = CA.COL5 (+)
AND CTS.COL72 = 'N' ) S ) X
WHERE X.COL3 IS NOT NULL
AND NOT EXISTS
( SELECT /*+ use_hash(ct) */ 1  ßHint, yeah, who cares??
FROM DW_USR.CT1 CT
WHERE X.COL9 = CT.COL9
AND X.COL4 = CT.COL4
AND X.COL2 = CT.COL2
AND X.COL2 = CT.COL3
AND CT.COL3 = :B1 );

This is where I remind everyone that yes, a hint can be ignored by the CBO.  A hint is simply a recommendation to the database.  If Oracle believes it has more complete information than your hint, it can continue to make its own decision about the path it will take in the database.

  • The data is askew across partitions.
  • The partitioning key is different on the ca table than the other tables.
  • The plan changes were not just in different order, it also involved different index usage, too.

How is this possible?  For the cts, staging table, the actual P_ID or partitioning key is a random number.  There are 1500 of these partitions, mapped out in a table to mark them as used or unused.  The original concept for the design was this staging table was a “parking lot” and any load process could choose an empty space, (partition) and load into it.  This left us open to considerable challenges for the CBO to make a solid decision when the lot could have a smart car one time in it and a semi-truck with trailer in its spot the next run.  I don’t find this a flaw in the CBO, but one in the design that the developer and the DBA group needed to address.

 

PERFORMANCE TANK

Two weeks since change:

  • Very minor performance hits outside of what I consider- “Why we should not insert/update/delete against large partitions” issues.  Can anyone say “REWRITE!!”
  • Stats job starts to run long…longer each day.  Not wanting to complete.
  • SYSTEM and SYSAUX tablespace, multiple additional space requests, (escalated and performed by after hours support, so not first noticed by main DBA staff, multiple environments and tablespace requests nightly, as it is…

 

Due to a performance challenge on a report, I ran a 10046 trace.  A red flag triggers when I start noticing large counts and long runs against certain data dictionary objects-  those involved with histograms and objects stats are clearly apparant:

 

SQL ID: 96g93hntrzjtr Plan Hash: 2239883476     COST: RULE

 

select /*+ rule */ bucket_cnt, row_cnt, cache_cnt, null_cnt, timestamp#,
sample_size, minimum, maximum, distcnt, lowval, hival, density, col#,
spare1, spare2, avgcln
from
hist_head$ where obj#=:1 and intcol#=:2
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        7      0.00       0.00          0          0          0           0
Execute    209      0.02       7.01          0          0          0           0
Fetch      209      0.00       3.04          6        836          0         209
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      425      0.02      10.06          6        836          0         209

 

SQL ID: db78fxqxwxt7r Plan Hash: 3312420081  COST:  RULE

 

select /*+ rule */ bucket, endpoint, col#, epvalue
from
histgrm$ where obj#=:1 and intcol#=:2 and row#=:3 order by bucket
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        4      0.00       0.00          0          0          0           0
Execute     56      0.01       7.03          0          0          0           0
Fetch       56      0.00       5.21         26        168          0         805
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      116      0.01      12.25         26        168          0         805

 

I knew I was seeing some large waits, but I was collecting stats on system level objects, fixed stats, plus, this was RULE Based Optimizer.  What was the sudden, consistent high fetch and times on different data dictionary calls for the optimizer?

I then noted the allocation of more space to the SYSTEM and SYSAUX tablespaces.  I queried objects from these two tables:

 

 SEGMENT_NAME                             MB
1              C_OBJ#_INTCOL#                           6723
2              I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST           6113
3              WRI$_OPTSTAT_HISTGRM_HISTORY             4697
4              I_WRI$_OPTSTAT_H_ST                      3038
5              WRH$_ACTIVE_SESSION_HISTORY              1910
6              SYS_LOB0000008982C00004$$                1096
7              WRH$_SQL_PLAN                            1088
8              WRI$_OPTSTAT_HISTHEAD_HISTORY            971
9              I_WRI$_OPTSTAT_HH_OBJ_ICOL_ST            935
10             I_H_OBJ#_COL#                            804

 

I then checked a few partitioned tables, noting HOW many histograms were being collected for all the wide tables vs. what would ever be utilized in the simple queries, (remember, the queries here want ALL the data for anything in 2-4 columns…)  We previously would have only have the indexed columns with column stats data, few histogram data.

 

Of the main five, wide tales, I had the following histogram distribution:

OBJECT                 INDEXED COLUMNS                        HISTOGRAM COLUMNS
CA_TBL                      2%, (3)                                                  80%
CT_TBL                      4%, (5)                                                  84%
I_TBL                       2%, (3)                                                  70%
HH                          2% (3)                                                   74%
ILS                         1% (1)                                                   90%

 

The final counts came up and I had 30GB of new stats info *bloating* my SYSTEM and SYSAUX tablespace that I hadn’t had before and that was causing performance issues that many don’t really think of.

The performance waits added by the additional statistics on the database, which you need to also include to all load processes, as they collect statistics at the TIME we load the partition, increased processing time by 18% overall.

Now, with this said, we need to think about how stats are retained in the data dictionary, both for use by the CBO and if we needed to revert.

Would I gain anything in performance by reverting back?

Upon testing I realized:

  • A simple reversion to 10g plans would not solve this issue.
  • A simple change in back to how we collect statistics would not correct the issue, outside of clearing the data dictionary of all the stats post the retention time.

I chose to perform some lower level tests against the stats changes.  I just happen to have a couple copies of the main schema for our source database lying around!  It is utilized to build our marts, but available for testing afterwards.

I performed the following:

  • Change the retention from 31 days to 10:
exec dbms_stats.alter_stats_history_retention(10);
  • Tried to purge the stats from the data dictionary, but after two days of waiting for the delete statement to clear out some 30GB of stats, finally gave up and let it clear manually.
exec DBMS_STATS.PURGE_STATS(SYSDATE-10);
alter table WRI$_OPTSTAT_IND_HISTORY move;
  • After this, needed to rebuild my indexes…

 

Well, that’s a little better…from 30GB down to 12.6GB

 

Choosing the following statement as a test example I could point at a couple partitions, (there were others…):

SELECT I.I_ID, I.FN_COL, I.ROWID, I.ORA_ROWSCN
FROM CA_TBL C, I_TBL I
WHERE C.M_ID = 41891 <--Partitioning Key
AND C.C_NBR = 2000065698902 <--Unique column, part of index
AND I.I_ID = C.I_ID; <--Unique Column part of index

 

I chose one of the largest partitions in the CA_TBL and noted that it had stats collected on 100% of the columns and 800% of those contained histograms.

The execution time was 12.02 seconds.

Ran a 10053 trace file throughout the steps…

 

Clean out ALL STATS:

EXEC DBMS_STATS.DELETE_TABLE_STATS(ownname=>'<OWNER>', tabname=>'CA_TBL', PARTNAME=>'P_41891');

 

Recollected the table stats:

EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>’OWNER’, TABname=>’CA_TBL’, PARTNAME=>'P_41891', METHOD_OPT=>'FOR ALL INDEXED COLUMNS', CASCADE=FALSE, DEGREE=8);
EXEC DBMS_STATS.GATHER_INDEX_STATS(ownname=>'DW_PROD', indname=>'CA_ACCT_PK', partname=>'P_41891');

Cleared the shared pool and ran again…

Elapsed: 00:00:00.01

Yes, I traced it, yes, I grabbed a 10053 trace, too.  I wanted to know WHY performance was so much better!

 

It used the SAME explain plan both times:

First RUN:

============
Plan Table
============
----------------------------------------------------------+-----------------------------------+---------------+
| Id  | Operation                            | Name       | Rows  | Bytes | Cost  | Time      | Pstart| Pstop |
----------------------------------------------------------+-----------------------------------+---------------+
| 0   | SELECT STATEMENT                     |            |       |       |  1965 |           |       |       |
| 1   |  NESTED LOOPS                        |            |       |       |       |           |       |       |
| 2   |   NESTED LOOPS                       |            |     1 |    52 |  1965 |  60:21:57 |       |       |
| 3   |    PARTITION LIST SINGLE             |            |     1 |    26 |  1964 |  60:19:06 | KEY   | KEY   |
| 4   |     INDEX RANGE SCAN                 | CA_PK |     1 |    26 |  1964 |  60:19:06 | 2636  | 2636  |
| 5   |    INDEX UNIQUE SCAN                 | I_PK     |     1 |       |     1 |  00:02:51 |       |       |
| 6   |   TABLE ACCESS BY GLOBAL INDEX ROWID | I_TBL |     1 |    26 |     1 |  00:02:51 | ROW LOCATION| ROW LOCATION|
----------------------------------------------------------+-----------------------------------+---------------+
Predicate Information:
----------------------
4 - access("C"."M_ID"=41891)
4 - filter(TO_NUMBER("C"."C_NBR")=2000065698902)
5 - access("I"."IND_ID"="C"."I_ID")

 

============
Plan Table
============
----------------------------------------------------------+-----------------------------------+---------------+
| Id  | Operation                            | Name       | Rows  | Bytes | Cost  | Time      | Pstart| Pstop |
----------------------------------------------------------+-----------------------------------+---------------+
| 0   | SELECT STATEMENT                     |            |       |       |   624 |           |       |       |
| 1   |  NESTED LOOPS                        |            |       |       |       |           |       |       |
| 2   |   NESTED LOOPS                       |            |     1 |    52 |   624 |  19:10:52 |       |       |
| 3   |    PARTITION LIST SINGLE             |            |     1 |    26 |   623 |  19:08:01 | KEY   | KEY   |
| 4   |     INDEX RANGE SCAN                 | CA _PK |     1 |    26 |   623 |  19:08:01 | 2636  | 2636  |
| 5   |    INDEX UNIQUE SCAN                 | I_PK     |     1 |       |     1 |  00:02:51 |       |       |
| 6   |   TABLE ACCESS BY GLOBAL INDEX ROWID | I_TBL |     1 |    26 |     1 |  00:02:51 | ROW LOCATION| ROW LOCATION|
----------------------------------------------------------+-----------------------------------+---------------+
Predicate Information:
----------------------
4 - access("C"."M_ID"=41891)
4 - filter(TO_NUMBER("C"."C_NBR")=2000065698902)
5 - access("I"."I_ID"="C"."I_ID")

 

But you’ll notice the cost has gone down significantly for CA_PK on the CA_TBL.  The time has decreased significantly.  Why?

 

In the 10053 Trace File, the Before and AFTER:

FIRST RUN-

Index Stats::
Index: CA_ PK  Col#: 1 4 2  PARTITION [2635]
LVLS: 2  #LB: 19641  #DK: 16052378  LB/K: 1.00  DB/K: 1.00  CLUF: 11032002.00
LVLS: 2  #LB: 19641  #DK: 16052378  LB/K: 1.00  DB/K: 1.00  CLUF: 11032002.00
Index: CA_ IX1  Col#: 4  PARTITION [2635]
LVLS: 2  #LB: 9776  #DK: 15506817  LB/K: 1.00  DB/K: 1.00  CLUF: 11030574.00
LVLS: 2  #LB: 9776  #DK: 15506817  LB/K: 1.00  DB/K: 1.00  CLUF: 11030574.00
Access path analysis for CA_TBL1

SECOND RUN-

Index Stats::
Index: CA_ PK  Col#: 1 4 2
    USING COMPOSITE STATS
LVLS: 2  #LB: 7467054  #DK: 5267937617  LB/K: 1.00  DB/K: 1.00  CLUF: 4906174903.00
Index: CA_IX1  Col#: 4
USING COMPOSITE STATS
LVLS: 2  #LB: 4006915  #DK: 3749763777  LB/K: 1.00  DB/K: 1.00  CLUF: 4911402731.00
Access path analysis for CA_TBL

 

By using “FOR ALL INDEXED COLUMNS” Composite Stats were built for the index in question and this did result in a better plan for the query used for the test than if there were individual column stats that had been created.

 

The second gain was due to data dictionary clean up that removed waits for the internal SQL Statements:

SQL ID: 96g93hntrzjtr Plan Hash: 2239883476

 

select /*+ rule */ bucket_cnt, row_cnt, cache_cnt, null_cnt, timestamp#,
sample_size, minimum, maximum, distcnt, lowval, hival, density, col#,
spare1, spare2, avgcln
from
hist_head$ where obj#=:1 and intcol#=:2
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        6      0.00       0.00          0          0          0           0
Execute     84      0.00       0.01          0          0          0           0
Fetch       84      0.00       0.04          8        330          0          78
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      174      0.00       0.05          8        330          0          78

 

SQL ID: db78fxqxwxt7r Plan Hash: 3312420081

 

select /*+ rule */ bucket, endpoint, col#, epvalue
from
histgrm$ where obj#=:1 and intcol#=:2 and row#=:3 order by bucket
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        3      0.00       0.00          0          0          0           0
Execute     36      0.00       0.00          0          0          0           0
Fetch       36      0.00       0.05         11        108          0         623
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       75      0.00       0.05         11        108          0         623

 

Ahhhh…much better!!

 

Why did I delete the old statistics before collecting with a new method option?

I’ve always been fascinated when looking at DBA_TAB_COLUMNS vs. DBA_TAB_COL_STATISTICS and DBA_PART_HISTOGRAMS.  I’ve often noted the discrepancy between their views when instead of purging and recollecting stats, a different method option is chosen, leaving old histogram data in place and how this can impact the choices for the CBO:

 

  • Columns with “old histogram/column data” appear more attractive than columns with updated statistics when natural growth occurs.
  • Incorrect execution plans confusing the optimizer.
  • Misleading data for the DBA/Developer when viewing statistical info from the stats data in columns views vs. histogram or historical stats data.

 

My First Recommendations

  1. Decrease the retention policy on statistics from 31 days to 7 to purge out the stats from the data dictionary quicker.
  2. Attempt to test on one of the datamarts DBMS_STATS.PURGE_STATS(sysdate-7) and see how long it would take to purge the old stats, (the mart contained the source data dictionary from the RMAN duplicate..)
  3. Rebuild the supporting “OPSTAT” indexes that would now be quite fragmented from the purging process.
  4. Only collect statistics for indexed columns from this point out, documenting the scenario for historical purposes.

 

Upon taking on this task, step #1 was quick.  Step #2, not so much.

DBMS_STATS.PURGE_STATS utilizes a delete statement.  Have you ever tried to run a single delete on constantly accessed, bloated, data dictionary tables, non-parallel, must complete before you it will commit?

 

Two attempts were made to purge stats from the data dictionary, both lasted over two days.  First one was forced to end due to requirements, the second ended during a sudden hiccup on the network, (why DBA’s should nohup or screen EVERYTHING!:))

 

Final Recommendation

The final recommendation, as I am on my final week with the company was to pull back the retention time and let them age out naturally.

I also recommended, due to the partition level stats being collected in the code during the load stage.  The best choice was to delete the stats before recollecting, but do so granularity=>AUTO so as to not collect stats on an already collected partition level objects.

 

Moral

There is always one exception to the story.  The day we say that it should ALWAYS be done a certain way or blindly do something because that is the way it is supposed to be done without using our intellect and skill to verify it is the right choice for our environment, is the day we can make a mistake.

I still believe in the standard practice for how statistics are best collected in Oracle 11g, but I also am aware, I just happen to be working in an environment that has the combination that did not benefit from the common, best choice.


Print Friendly
March 9th, 2012 by

facebook comments:

  • “… If Oracle believes it has more complete information than your hint, it can continue to make its own decision about the path it will take in the database.” Would that be more like: Optimizer will obey a hint if it can. Sometimes query transformations make hints impossible to follow.

  • Yes, that too… 🙂

  • What a great post. Thank you!
    It just comes down to 2 points (for me):
    * there is NO ‘best practice’ than test-think-understand
    * Oracles default are wisely chosen – for the 80% ‘ordinary’ systems
    Martin

  • Thanks Martin!
    I agree completely. The other DBA’s heard me say it over and over, “I had such high hopes!” I really wanted us to be in that 80%, but I know what we do here and how we do it makes many of the “rules of thumb” we have as DBA’s get thrown right out the window as soon as you walk in the door! 🙂

  • Hi Kellyn,

    Interesting article.

    >This is where I remind everyone that yes, a hint can be ignored by the CBO. A hint is simply a recommendation to the database. If Oracle believes it has more complete information than your hint, it can continue to make its own decision about the path it will take in the database.

    I prefer Jonathan Lewis’s description along the lines that a hint is an optimizer directive that will be obeyed provided it is valid.

    The problem with later versions of Oracle is that specifying a hint correctly so that it is valid post query-transformation stage is tricky.

    Upgrades are particularly revealing of poor/changing hinting syntax.

    I’ve been dealing with an upgrade recently that involves a lot of large SQL and a lot of single hints (index or use_nl) for queries with 8 or 9 tables and really been hammering home that a single hint here or there is not plan stability.

    > It used the SAME explain plan both times
    > But you’ll notice the cost has gone down significantly for CA_PK on the CA_TBL. The time has decreased significantly.

    I can understand how the change in stats might affect the performance of recursive SQL related to internal stats-related objects but I’d expect the overall impact of this to be relatively small.

    But for the application SQL, the plan is the same.
    Surely any reduction in cost or estimate time is arbitrary.
    If the plan is the same, surely all that has changed is the estimated execution time and estimated cost?

    If all that’s changed is the stats and that didn’t affect the plan then surely they’re not going to affect the actual execution of the query but something else perhaps?

    If something tangible to a sql statement’s performance has changed and it’s not the plan and it’s not related to caching (?) and you’ve not changed anything physical about the structures involved then what?

    >Moral
    Never say never, never say always

    Cheers,
    Dominic

  • Hi Dominic,
    Thank you for your comments and you are right- the only change being the internal stats related object being the source of the over-all performance gain is correct.
    If you look into the 10053 excerpts below that, it was the change from individual column stats to composite stats for the multi-column index. I should have ALSO included that the CA_PK is three columns, not ONE!! Apologies for that info being out of the post.
    Thank you again,
    Kellyn

  • My understanding, in most cases, was that a hint ensured that a specific plan was considered.

    If the plan resulting from the hint is still not theoretically the cheapest, then it will be rejected.

    Dave

    P.S. great post.

  • Pingback: EM12c Enterprise Monitoring, Part III - Oracle - Oracle - Toad World()

  • Pingback: Monitoring a Microsoft OS Failover Cluster - Oracle - Oracle - Toad World()

  • Facebook
  • Google+
  • LinkedIn
  • Twitter