Oracle

But…. I Collected Stats!–Updated!!

*****After much adieu, I’ve finally returned to this post to rewrite what concerned me and befuddled others… :)*****

I’ve worked with some great developers.  Developers who can write code in their sleep to perform the most complex functions and aggregations, (makes me quite envious, as I need to actually think how to write code… :))  Most of these developers are still tripped up by the Oracle Cost Based Optimizer, (CBO) and honestly, I think they should be.  Just how I have to think about how to code, it should be expected that they need to think about how the CBO works.  It is one of those fundamental differences between DBA’s and Developer’s mindsets that make each group special.

*****This is where I’m going to rewrite this post.  Reason?  I rewrote the original query to disguise my company’s SQL as I’m always requested to do, but I disguised it so well that it didn’t have the same challenges as the original or address something that I really wanted to discuss.  I’m still going to change the column and table names, but I’m going to use the original query, as this is a topic I really want to discuss honestly, (forgive me boss for I have sinned…. J)  We worked with a number of different options with this query, with the sort on this wide table for a couple days.  It’s a new design that is part of a new project and it was time to really think outside the box, (which I will get into at the end of this post. *********

Our example for the evening is a query that is performing poorly.  The (REAL) query is a select on ONE table with parallel to gather a performance gain from temp tablespace groups and includes an order by:

select /*+ PARALLEL(t1, 4) */  t1.* from TBL1 t1 
order by t1.MAIN_ID;

(Now let’s admit to the real performance challenge here…) The query isn’t returning after a full day, you note that the long ops on the process states it has 4138 hrs left and climbing, but you notice, if you simply take out the order by, the query returns in a matter of minutes ALL OF THE DATA.  Why, when you put in the order by, it refuses to return anything?

As the sort is the pinpoint of the latency, a quick look at the most common work area for this large of a sort would be in order-  TEMP usage:

select vst.sql_text,  swa.sid, swa.tablespace

, swa.operation_type
, trunc(swa.work_area_size/1024/1024) “PGA MB”
, trunc(swa.max_mem_used/1024/1024)”Mem MB”
, trunc(swa.tempseg_size/1024/1024)”Temp MB”
from v$sql_workarea_active swa, v$session vs, v$sqltext vst
where swa.sid=vs.sid
and swa.sql_id=vs.sql_id
and vs.sql_id=vst.sql_id
and vst.piece=0;

 SQL_TEXT                                SID TABLESPACE   OPERATION PGA MB Mem MB Temp MB

select /*+ PARALLEL(t1, 4) */    234 TEMP_A              SORT (v2)      47           1024         96179
select /*+ PARALLEL(t1, 4) */    377 TEMP_B              SORT (v2)      58           1024         116256
select /*+ PARALLEL(t1, 4) */    260 TEMP_C              SORT (v2)      76           1024         146017
select /*+ PARALLEL(t1, 4) */    304 TEMP_D              SORT (v2)      47           1024         96269

Note the difference in the parallel slave use of temp.  This is a clear indicator that Oracle is doing what it can with what it knows about these objects, but something is amiss.

What are we working with here?
select segment_name, bytes/1024/1024 MB from dba_segments
where segment_name=('TBL1')
group by segment_name;
 
TBL1= 243GB
 
Rest of info:
TBL1=558 million rows, 238 columns, last analyzed, 2 hrs ago.
 
Now that we know what we are dealing with-  Why are the sorts out of whack on the tables?  The tables have been analyzed. 
 
Oracle can only sort data if it knows what it is dealing with.  What is required to sort data correctly?  Accurate information and as a sort is at a column level, it must be at the COLUMN level.  The question the DBA should ask themselves is not "Have statistics been gathered?" but if "Have statistics been gathered correctly for how the data is being utilized?"
 
For the DBA, this is where the DBA_TAB_COLUMNS view can come in very handy.  It can quickly show you if the info that is required for the where clause is present for the CBO.
 
select table_name, column_name, num_distinct, low_value, high_value, num_nulls,
last_analyzed, global_stats, user_stats, histogram from dba_tab_columns
where table_name in ('TBL1','TBL2)
and column_name in ('MAIN_ID');

Yeah, I think the database is trying to tell us something… 🙂
TBL1 MAIN_ID NO NO NONE

How can the database know how to best sort the data if it doesn’t know anything about the columns it is supposed to join or sort on??

Table statistics were collected, but no column information was collected.  The most likely instigator would be method_opt=’FOR ALL INDEXED COLUMNS’ and due to these tables not having any indexes, no column data was collected.  Recollect the statistics with method_opt=’FOR ALL COLUMNS SIZE 1′ .

exec dbms_stats.gather_table_stats( ownname=>'OWNER1',tabname=>'TBL1', ESTIMATE_PERCENT=>.00001, METHOD_OPT=> 'FOR ALL COLUMNS SIZE 1',CASCADE=TRUE,DEGREE=>8);
Now, the CBO will know a bit more about what it's working with!
select table_name, column_name, num_distinct, low_value, high_value, num_nulls,
last_analyzed, global_stats, user_stats, histogram from dba_tab_columns
where table_name in ('TBL1')
and column_name in ('MAIN_ID');
TBL1 MAIN_ID 586368000 C40D1A2050 C50A4736594A 0 10/25/2010 8:41:53 PM YES NO NONE
 
Returning to run the original query, now that the table has column stats: 
select /*+ PARALLEL(t1, 4) */ t1.* from TBL1 t1
order by t1.MAIN_ID;

The final performance and usage?  The example above was able to write the data to file in 5hrs and the TEMP usage was in line with the amount of data that was being sorted.  There is still a delay in pulling the data to the application layer, but that is a fight for another day.  The final answer for this query was simply the statistics, but the challenge for the people involved still revolves around the wide table and how to ease some of the strain on the database due to the sort:

SID TEMP      OPERATION PGA MB Mem MB Temp MB

307 TEMP_C SORT (v2)        1029           1024         1739

247 TEMP_A SORT (v2)        1024           1024        1755
286 TEMP_D SORT (v2)        1022           1024         1783
354 TEMP_B SORT (v2)        1024           1024         1755

****Coskan asked why I threw in the index and forced the query in the original version with a hint.  This was an attempt to “dimensionalize” the wide table.  Oracle will not utilize an index like a dimension table would be utilized in a star schema and perform a hash join unless you use a hint.  The test was how long would it take and how efficient would it be to sort on the indexed column vs. the table?

There were multiple tests, the first attempting to use the index, but due to the sort, it refused the parallelism which was required to get through the 558 million rows.  The second choice was to perform the following, which I will admit-  worked great for Oracle to write out to a file, did not work well for the application to utilize and perform the same! L

CREATE table MID_DATA tablespace USR_DATA1 as select MAIN_ID from TBL1;

This took only 42 minutes to create.  This was then used to join on the MAIN_ID and the order by.

SELECT TBL1.<each column but the MAIN_ID>, MD.MAIN_ID
from TBL1, MID_DATA MD
where TBL1.MAIN_ID=MD.MAIN_ID
order by MD.MAIN_ID;

I was able to query AND extract the data to an output file on a remote backup server to simulate what the application would do in 5 hrs.  I thought I had a winner here, but unfortunately, the application still sorted the entire table and temp usage went through the roof, (we run into that challenge often where a solution in Oracle does not solve the problem at the application SQL level due to incompatibilities in how the application presents the SQL, may rewrite the query, etc.)

There is a reason we have star schemas and when Oracle is unable to profit from one due to one reason or another, it can be an incredible battle for the DBA and developers who work in these environments.

6 thoughts on “But…. I Collected Stats!–Updated!!

  • A nice post giving more basic way to start looking into the where the problem is.Thanks for sharing.

  • Coskan Gundogar

    Kellyn,

    If you still needed hint then why did you gathered stats ? Till I saw the hints I was thinking stats worked like charms are they there as a typo or despite the fresh stats it is still not working ?

  • timurakhmadeev

    Hi Kellyn

    >This is a clear indicator that Oracle is doing what it can with what it knows about these objects, but something is amiss.
    Well, this looks nice for a DBA who feels the database 🙂 but there's absolutely not enough information to make a judgement that missing statistics was the real problem in your case. Why? Because:
    1) run time execution plans are missing
    2) OPERATION_ID column from v$sql_workarea_active is missing
    3) database structure was changed
    4) second query is forces to use new index
    5) statistics is gathered with estimate_percent=10^(-5), which is about 1.6*10^4 rows, which is about 2.5 MB of table data, which is quite small comparing to 243 GB

  • Kellyn Pedersen

    Coskan and Tim- you are both right… I thought about this post after I'd gone to bed, knowing I'd written this up way too quick and thinking, "I jumped into the conclusion without explaining anything and didn't really give enough information…" among other rambling thoughts and OK, admission here, cold medicine going through my brain…
    I'm going to revisit this in a couple days when everything settles down and step through the information more distinctly and go through Tim's 1-5 list. I think they are very valuable points that although I can do that with a developer face to face, I didn't do that here- so how is anyone supposed to understand how I got from point A, to point B and then onto point F?? 🙂
    Thanks guys!
    Kellyn

  • Hi Kellyn,

    I only now come across this post. I’ve recently encountered a similar thing regarding statistics resp. cardinality estimates and the PGA memory allocation of workareas (see here: http://oracle-randolf.blogspot.com/2011/01/hash-aggregation.html) but in my cases this behaviour could only be reproduced with HASH based aggregations, but sort operations didn’t expose the same.

    Could you share a little bit more details – which Oracle version / release was this?

    Are you sure that your successful attempt was not influenced by less concurrent load so that the “global memory bound” parameter allowed for larger workareas?

    Did you have a chance to develop a generic test case based on your actual case that you could share?

    It would be interesting to see how such a thing could happen to a sort operation – because according to Joze’s paper and my tests a sort operation should be able to consume all memory available up to the “global memory bound” no matter what the initial workarea size estimate based on the execution plan says.

    Randolf

  • Hi Randolf!
    I think you are onto something there. At the time I was working on 10.2.0.4.3 on Linux and we had a number of one-off patches that demanded we often used the hint, NO_USE_HASH_AGGREGATION in many queries as we hit the bug.

    Due to this I was constantly working with the developers to work around it, not perform them and so most likely avoided a lot of what you were experiencing.

    I definitely had a less concurrent load, but with heavy individual processing. The environment in question possessed a batch queue that only allowed six of these processes to run simultaneously.

    Let me know if you have any other questions, even though I’m no longer there, I still can easily speak to others who still are.

Comments are closed.