DBA Kevlar

Tips, tricks, (and maybe a few rants) so more DBAs become bulletproof!

When Parallel is Wrong

The Lead Developer and I have been working very, very hard on tuning the main process that produces much of the revenue for our company. The tuning exercise has been very successful but we’ve run into some challenging on the pre-processing that occurs for each week to build the mart that the newly tuned process runs on afterwards. Due to this, I’ve been granted some time to monitor the build process a little closer than previously and although I have a lot of tuning to do, some of the largest challenges comes in the form of queries that the analysts have to execute as part of the final checks before we toggle over to the newly created mart.

These statements all have the same where clause, two different tbl_keys, (12 or 48), pulling the entire table, (each table between 90-380GB) and then ordering the data by the i_id. They all have the same parallel hints, just different tables substituted into the statements. The reason for the sort is that the data is then pulled into an application to finish the final step. The application is incapable of dealing with this type of large sorts, so they decided to do it internal to the database. The final data is used to verify that the correct data is present in the mart of the members that were fed into the primary database that the mart is based off of for the timeline.

select /*+ full(t) parallel (t 8) */ t.*
from tbl_summary t WHERE (tbl_key=12 and usr_id=0)
 

order by i_id ;

The above statement is going to result in 16 parallel slaves, pulling the entire data set for user class=0 vs. a specific id, for the last 12 months of data, then sort it by the usr_id. Note the first hint- FULL. Realize, as this data is across multiple partitions, we will not see a “partition list single” in the explain plan, but “px iterator” for the database to correctly identify the individuals that meet the criteria for each partition. Once it completes this step, then the challenge comes in sorting all that data and the problem with the second hint- PARALLEL.

I love parallel. I am thrilled with the performance gains I can attain when hashing large tables when accessing multiple partitions. I do not love parallel when someone decides to put a group by or order by into the statement. If any sorting has to occur on the statement, I’ve been seeing some odd behavior in about 80% of the executions. It appears that when sorting, it’s not Oracle sorting the data all at once, but sorting one parallel slave’s data set at a time. This creates an escalation in temp space usage that can become insanely difficult for Oracle to process efficiently and performance degrades immensely.

This is what occurred with the statements for the analysts.
With Parallel:
select /*+ full(t) parallel (t 8) */ t.*
 
from tbl_summary t WHERE (tbl_key=12 and usr_id=0)
order by i_id ;

Elapsed Time Breakdown
SQL Time (sec) 47,951.17
PL/SQL Time (sec) 0.00
Java Time (sec) 0.00

Activity By Time
Elapsed Time (sec) 47,951.17
CPU Time (sec) 9,734.81
Wait Time (sec) 38,216.35

Activity By Waits
Remaining Waits(20.9%)
User I/O Waits(58.8%)
CPU(20.3%)

I decided to query the temp usage, as I knew the above query would need to sort the ib_id and would impact the performance of the parallel execution.  The first and last sql_id in the table are the queries in question with parallel, which the one owned by kpedersen is my run at it without parallel:

SID
OSUSER
PROCESS
SQL_ID
SEGTYPE
MB
TABLESPACE
 
571
usr1
26584
029y9k9qvgtp2
SORT
61.5
TEMP
 
772
usr1
26580
029y9k9qvgtp2
SORT
90
TEMP
     
673
usr1
26582
029y9k9qvgtp2
SORT
131.75
TEMP
     
626
usr1
26590
029y9k9qvgtp2
SORT
4370.25
TEMP
   
837
usr1
26592
029y9k9qvgtp2
SORT
41248.75
TEMP
Part 48
with parallel
767
usr1
6434
94bu54yp2ay2v
SORT
107.25
TEMP
     
703
usr1
6436
94bu54yp2ay2v
SORT
73.5
TEMP
     
568
usr1
6438
94bu54yp2ay2v
SORT
278.5
TEMP
     
635
usr1
6442
94bu54yp2ay2v
SORT
1248.25
TEMP
     
811
usr1
6440
94bu54yp2ay2v
SORT
3679.75
TEMP
Create table for Cognitive
data
655
Mine
5976:4628
arz8vhqzt80y3
SORT
2584.25
TEMP
Mine Without Parallel
   
600
usr1
26656
bazjcx55wky5u
SORT
2517.25
TEMP
     
827
usr1
26658
bazjcx55wky5u
SORT
29089.5
TEMP
Part 12
with parallel
 

Now after seeing the waits for temp, the long operations, (441 minutes left on the sort in temp…) I decide to say enough it enough and attempt the execution without the hints- just plain ol’ query of this huge table, nothing more, (see above in the table, showing my temp usage for the single process, as “Mine”.) Fifteen minutes, yes, that’s right, fifteen minutes and I had the first rows back on my screen:

SQL Text-
select t.* from tbl_summary t
WHERE (tbl_key=12 and usr_id=0) order by i_id;

DETAILS on Without Parallel:
Elapsed Time Breakdown
SQL Time (sec) 945.25
PL/SQL Time (sec) 0.00
Java Time (sec) 0.00

Activity By Time
Elapsed Time (sec) 945.25
CPU Time (sec) 370.67
Wait Time (sec) 574.58

Activity By Waits
Remaining Waits(1.1%)
User I/O Waits(59.7%)
CPU(39.2%)

Moral:
Carefully test any parallel processing when sorting is involved. You may not receive performance benefits as expected.

Print Friendly

facebook comments:

7 thoughts on “When Parallel is Wrong
  • Asif Momen says:

    Nice post.

    Even I have experienced performance degradation in past when PARALLEL and ORDER BY were used together.

  • iso775533eac says:

    看看blog調整心情,又要來繼續工作,大家加油.........................

  • Randolf says:

    Kellyn,

    just an idea: I think in another post you've mentioned that you might have some code that uses manual PGA settings (workarea_size_policy = manual).

    What I could imagine is that you encounter an odd combination of the SAS bug (the SORT_AREA_SIZE parameter won't be propagated to the parallel slaves, so they might be using 65535 bytes (!, yes 64K, the default if the value is not specified in the instance parameters) and a skewed spread of the work across the parallel slaves, eventually leading to the situation that you end up with a few parallel slaves with a SORT_AREA_SIZE of 64K doing all the sorts…

    Just an idea – I might be totally wrong.

    I would like to get a better understanding what oddities exactly you observe with your parallel execution, this sounds interesting…

    Randolf

  • Uwe Hesse says:

    I suppose you tried using a Temporary Tablespace Group, giving multiple Temp tablespaces to the parallel processes? They have been there in 2010 at least http://docs.oracle.com/cd/B19306_01/server.102/b14231/tspaces.htm#sthref1197

    Kind regards
    Uwe

  • dbakevlar says:

    Hi Uwe,
    This was during this time and temp tablespace groups, although they assist, especially when they reside on SSD with this issue, the problem really was in the type of sort requirements for this query that was hindered when the parallel slaves attempted to sort, then resort each slave into the final result. This process was rewritten/designed during the same time as we introduced the other temp tablespace groups on SSD and we simply didn’t have to perform this step anymore, preferring a better way of presenting the data, (in other words, we let SAS do the work on their side…. :)

  • Hi,

    Thanks for the post.

    Two questions though:

    1. If the sorting is done one parallel server at a time (serially?) why should the Temp usage increase? I understand if the performance drops (allocate/deallocate temp, loose time coordinating,…) but wouldn’t the temp space used by one parallel server be reused by the next?

    2. (a more silly question!): Would it help if one rewrites this [or perhaps the similar ones] query as a “select-order by” on an inline view which is running in parallel? Is it possible in a way that query is not re-written by CBO so that the inline view is run in full first?

    Thanks

  • dbakevlar says:

    If you test out certain queries, some will actually perform better as serial vs. parallel. It’s not that common, but it does happen. Now, the idea that “more parallel is better” is not true. There is commonly a “sweet spot” of four or eight DOP, (degree of parallel) that will be found as best performance before you “move” your bottleneck from the amount of processes to another aspect of the database, (concurrency, application, etc…)

    1. The writes to a single TEMP file with parallel on a single TEMP file in a non-ASM filegroup is extensively longer due to the way temp files are written to vs. standard datafiles. Now, if you are using, let’s say, DOP of 4 with a TEMP tablespace with four temp files on ASM, you will see improved performance as each file will be written to by each parallel slave. You can monitor this type of writes easily.

    2. Rewrites of a query to improve performance will, by far and in my opinion, give greater gains than just throwing hardware at anything. There are times, (due to vendor code that can’t be changed…) when a change to improve performance by tuning a process with parallel to utilize more hardware options are beneficial.

    Thanks,
    Kellyn

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Facebook
  • Google+
  • LinkedIn
  • Twitter