I had worked hard on a report, 47 SQL statements in all to tune it down from 5 hrs to under 30 minutes.  The first runs had been quite successful, so when a third run sent an alert on temp usage, I knew something was wrong.

     SID PROCESS      MACHINE SQL_TEXT    SQL_ID     TOTAL MB
-------- ------------ -------------------- ---------------------
     507 1137	      HOST SELECT *** 7t3muww36xhzn  45516
     600 1139	      HOST SELECT *** 7t3muww36xhzn  45516
     525 1132	      HOST SELECT *** 7t3muww36xhzn  45516
     509 1135	      HOST SELECT *** 7t3muww36xhzn  45516

I checked the stats first, as one of the fixes was to ensure the staging tables in this process were collecting stats after the initial feeds came in, but both tables involved showed valid statistics:

SQL> select num_rows, last_analyzed from dba_tab_partitions 2 where table_name='<I_STAGE>' 3 and partition_name='P170';

  NUM_ROWS LAST_ANAL
---------- ---------
 480900000 17-OCT-11
SQL> select last_analyzed from dba_tables 2 where table_name='<SML_TBL>';

LAST_ANAL
---------
17-OCT-11

I ran a quick AWR report for the specific SQL_ID to see what I was dealing with, execution plan wise..


              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     46198 18-Oct-11 09:00:09       282       7.9
  End Snap:     46200 18-Oct-11 10:00:13       245       8.4
   Elapsed:               60.07 (mins)
   DB Time:            1,874.62 (mins)

 

SQL ID: 7t3muww36xhzn           DB/Inst: PRODUCTION/PROD  Snaps: 46198-46200
-> 1st Capture and Last Capture Snap IDs
   refer to Snapshot IDs witin the snapshot range
-> SELECT ***

    Plan Hash           Total Elapsed                 1st Capture   Last Capture
#   Value                    Time(ms)    Executions       Snap ID        Snap ID
--- ---------------- ---------------- ------------- ------------- --------------
1   324636810               4,785,428             4         46199          46199
2   4097803110                  1,047             1         46200          46200
          -------------------------------------------------------------

Plan 1(PHV: 324636810)
----------------------

Plan Statistics                 DB/Inst: PRODBASE/prodbase  Snaps: 46198-46200
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100

Stat Name                                Statement   Per Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)                         4,785,428    1,196,357.0     4.3
CPU Time (ms)                             3,450,070      862,517.6     8.4
Executions                                        4            N/A     N/A
Buffer Gets                                 759,453      189,863.3     0.0
Disk Reads                                  683,619      170,904.8     2.8
Parse Calls                                      35            8.8     0.0
Rows                                              0            0.0     N/A
Execution Plan
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name        | Rows  | Bytes | Cost  | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |     9 |       |       |        |      |
|   1 |  COUNT STOPKEY                |             |       |       |       |       |       |        |      |
|   2 |   PX COORDINATOR              |             |       |       |       |       |       |        |      |
|   3 |    PX SEND QC (ORDER)         | :TQ10002    |     1 |   240 |     9 |       |       |  Q1,02 | P->S | QC (ORDER)
|   4 |     VIEW                      |             |     1 |   240 |     9 |       |       |  Q1,02 | PCWP |
|   5 |      SORT ORDER BY STOPKEY    |             |     1 |   120 |     9 |       |       |  Q1,02 | PCWP |
|   6 |       PX RECEIVE              |             |     1 |   240 |       |       |       |  Q1,02 | PCWP |
|   7 |        PX SEND RANGE          | :TQ10001    |     1 |   240 |       |       |       |  Q1,01 | P->P | RANGE
|   8 |         SORT ORDER BY STOPKEY |             |     1 |   240 |       |       |       |  Q1,01 | PCWP |
|   9 |          HASH JOIN            |             |     1 |   120 |     5 |       |       |  Q1,01 | PCWP |
|  10 |           PX RECEIVE          |             |     1 |   109 |     3 |       |       |  Q1,01 | PCWP |
|  11 |            PX SEND BROADCAST  | :TQ10000    |     1 |   109 |     3 |       |       |  Q1,00 | P->P | BROADCAST
|  12 |             PX BLOCK ITERATOR |             |     1 |   109 |     3 |   KEY |   KEY |  Q1,00 | PCWC |
|  13 |              TABLE ACCESS FULL| I_STAGE     |     1 |   109 |     3 |   KEY |   KEY |  Q1,00 | PCWP |
|  14 |           PX BLOCK ITERATOR   |             |  5002 | 55022 |     2 |       |       |  Q1,01 | PCWC |
|  15 |            TABLE ACCESS FULL  | SML_TBL     |  5002 | 55022 |     2 |       |       |  Q1,01 | PCWP |
------------------------------------------------------------------------------------------------------------------------

This is the one that was eating up all the temp! Note that even though I checked stats, stats were correct as of the previous day, no changes to the partition stats, the execution plan only shows one row, which anyone who listens to Maria Colgan knows, that’s just Oracle giving you the benefit of the doubt and saying, “I don’t think there’s any rows in this object, (or sub-object in this case..) but I’ll give you 1 row for the fun of it!”

Second execution plan in the report is the one I desired:

Plan 2(PHV: 4097803110)
-----------------------

Plan Statistics                 DB/Inst: PRODUCTION/PROD  Snaps: 46198-46200
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100

Stat Name                                Statement   Per Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)                             1,047        1,046.7     0.0
CPU Time (ms)                                   967          967.0     0.0
Executions                                        1            N/A     N/A
Buffer Gets                                   2,007        2,007.0     0.0
Disk Reads                                        3            3.0     0.0
Parse Calls                                       9            9.0     0.0
Rows                                             40           40.0     N/A
User I/O Wait Time (ms)                           1            N/A     N/A
Execution Plan
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name        | Rows  | Bytes |TempSpc| Cost  | Pstart| Pstop |    TQ  |IN-OUT| PQ
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |       |  2834 |       |       |        |      |
|   1 |  COUNT STOPKEY                |             |       |       |       |       |       |       |        |      |
|   2 |   PX COORDINATOR              |             |       |       |       |       |       |       |        |      |
|   3 |    PX SEND QC (ORDER)         | :TQ10002    |  1272K|   291M|       |  2834 |       |       |  Q1,02 | P->S | QC
|   4 |     VIEW                      |             |  1272K|   291M|       |  2834 |       |       |  Q1,02 | PCWP |
|   5 |      SORT ORDER BY STOPKEY    |             |  1272K|   152M|   389M|  2834 |       |       |  Q1,02 | PCWP |
|   6 |       PX RECEIVE              |             |    40 |  9600 |       |       |       |       |  Q1,02 | PCWP |
|   7 |        PX SEND RANGE          | :TQ10001    |    40 |  9600 |       |       |       |       |  Q1,01 | P->P | RA
|   8 |         SORT ORDER BY STOPKEY |             |    40 |  9600 |       |       |       |       |  Q1,01 | PCWP |
|   9 |          HASH JOIN            |             |  1272K|   152M|       |     7 |       |       |  Q1,01 | PCWP |
|  10 |           PX RECEIVE          |             |  5005 | 55055 |       |     2 |       |       |  Q1,01 | PCWP |
|  11 |            PX SEND BROADCAST  | :TQ10000    |  5005 | 55055 |       |     2 |       |       |  Q1,00 | P->P | BR
|  12 |             PX BLOCK ITERATOR |             |  5005 | 55055 |       |     2 |       |       |  Q1,00 | PCWC |
|  13 |              TABLE ACCESS FULL| SML_TBL     |  5005 | 55055 |       |     2 |       |       |  Q1,00 | PCWP |
|  14 |           PX BLOCK ITERATOR   |             |  1271K|   139M|       |     4 |   KEY |   KEY |  Q1,01 | PCWC |
|  15 |            TABLE ACCESS FULL  | I_STAGE     |  1271K|   139M|       |     4 |   KEY |   KEY |  Q1,01 | PCWP |
------------------------------------------------------------------------------------------------------------------------

So what changed? What impacted my statistics?

Upon investigation, I came to the conclusion that it is a combination of a “feature” with what I think is a bug in 10g dbms_stats.

A search of stats processing showed that during the one process that was executing against the P170 partition on the I_STAGE, there were a number of other partitions in this same table having stats gathered post loading.

declare v_stage_table_name varchar2(64); begin select min(stage_table_name) into v_stage_table_name from stage_tables 
where stage_table_type_cd = '<I_STAGE>'; dbms_stats.gather_table_stats (ownname => 'dw_user', tabname => v_stage_table_name, 
partname => 'P450' ,estimate_percent =>.01, granularity=>'PARTITION', method_opt=>'for all columns size 1', 
no_invalidate=> false, cascade=>false, degree=>4); end;

Now the key here in the statement above is:

no_invalidate=>false

If you read the description for this from Oracle:

no_invalidate Does not invalidate the dependent cursors if set to TRUE. The procedure invalidates the dependent cursors immediately if set to FALSE. Use DBMS_STATS.AUTO_INVALIDATE. to have Oracle decide when to invalidate dependent cursors. This is the default. The default can be changed using the SET_PARAM Procedure.

The surmised bug is one where even though the dbms_stats being performed by another process is partition level, the invalidation of the cursors is across all partitions in the object, causing them all to be invalidated, requiring them to re-parse the SQL.  (There are a number of similar bugs already documented in 10.2.0.4.0 for partition level statistics gathering…)

The feature to allow Oracle to re-parse and take advantage of the newest statistics information in the data dictionary resulted in a poor performance challenge in this instance, as the cursors were invalidated on a process that needed no changes to statistics.

I tested repeatedly against partitions, collecting stats with the no_invalidate set to false or true and even to AUTO to see what would occur and it consistently impacted my cursors against other partitions.  I can find not documented bug, but as many know, I’m about to move everything to 11g  in short order and expect it would be a waste of time to pursue it to far…

I notified the Java developer who owns this code to please update the no_invalidate=>true to correct the performance impact short term and look forward to 11g bugs to replace my exhaustion on 10g ones! :)

« »