When Parallel is Wrong
The Lead Developer and I have been working very, very hard on tuning the main process that produces much of the revenue for our company. The tuning exercise has been very successful but we’ve run into some challenging on the pre-processing that occurs for each week to build the mart that the newly tuned process runs on afterwards. Due to this, I’ve been granted some time to monitor the build process a little closer than previously and although I have a lot of tuning to do, some of the largest challenges comes in the form of queries that the analysts have to execute as part of the final checks before we toggle over to the newly created mart.
These statements all have the same where clause, two different tbl_keys, (12 or 48), pulling the entire table, (each table between 90-380GB) and then ordering the data by the i_id. They all have the same parallel hints, just different tables substituted into the statements. The reason for the sort is that the data is then pulled into an application to finish the final step. The application is incapable of dealing with this type of large sorts, so they decided to do it internal to the database. The final data is used to verify that the correct data is present in the mart of the members that were fed into the primary database that the mart is based off of for the timeline.
select /*+ full(t) parallel (t 8) */ t.*
from tbl_summary t WHERE (tbl_key=12 and usr_id=0)
order by i_id ;
The above statement is going to result in 16 parallel slaves, pulling the entire data set for user class=0 vs. a specific id, for the last 12 months of data, then sort it by the usr_id. Note the first hint- FULL. Realize, as this data is across multiple partitions, we will not see a “partition list single” in the explain plan, but “px iterator” for the database to correctly identify the individuals that meet the criteria for each partition. Once it completes this step, then the challenge comes in sorting all that data and the problem with the second hint- PARALLEL.
I love parallel. I am thrilled with the performance gains I can attain when hashing large tables when accessing multiple partitions. I do not love parallel when someone decides to put a group by or order by into the statement. If any sorting has to occur on the statement, I’ve been seeing some odd behavior in about 80% of the executions. It appears that when sorting, it’s not Oracle sorting the data all at once, but sorting one parallel slave’s data set at a time. This creates an escalation in temp space usage that can become insanely difficult for Oracle to process efficiently and performance degrades immensely.
This is what occurred with the statements for the analysts.
With Parallel:
select /*+ full(t) parallel (t 8) */ t.*
from tbl_summary t WHERE (tbl_key=12 and usr_id=0)
order by i_id ;
Elapsed Time Breakdown
SQL Time (sec) 47,951.17
PL/SQL Time (sec) 0.00
Java Time (sec) 0.00
Activity By Time
Elapsed Time (sec) 47,951.17
CPU Time (sec) 9,734.81
Wait Time (sec) 38,216.35
Activity By Waits
Remaining Waits(20.9%)
User I/O Waits(58.8%)
CPU(20.3%)
I decided to query the temp usage, as I knew the above query would need to sort the ib_id and would impact the performance of the parallel execution. The first and last sql_id in the table are the queries in question with parallel, which the one owned by kpedersen is my run at it without parallel:
SID
|
OSUSER
|
PROCESS
|
SQL_ID
|
SEGTYPE
|
MB
|
TABLESPACE
|
|||
571
|
usr1
|
26584
|
029y9k9qvgtp2
|
SORT
|
61.5
|
TEMP
|
|||
772
|
usr1
|
26580
|
029y9k9qvgtp2
|
SORT
|
90
|
TEMP
|
|||
673
|
usr1
|
26582
|
029y9k9qvgtp2
|
SORT
|
131.75
|
TEMP
|
|||
626
|
usr1
|
26590
|
029y9k9qvgtp2
|
SORT
|
4370.25
|
TEMP
|
|||
837
|
usr1
|
26592
|
029y9k9qvgtp2
|
SORT
|
41248.75
|
TEMP
|
Part 48
|
with parallel
|
|
767
|
usr1
|
6434
|
94bu54yp2ay2v
|
SORT
|
107.25
|
TEMP
|
|||
703
|
usr1
|
6436
|
94bu54yp2ay2v
|
SORT
|
73.5
|
TEMP
|
|||
568
|
usr1
|
6438
|
94bu54yp2ay2v
|
SORT
|
278.5
|
TEMP
|
|||
635
|
usr1
|
6442
|
94bu54yp2ay2v
|
SORT
|
1248.25
|
TEMP
|
|||
811
|
usr1
|
6440
|
94bu54yp2ay2v
|
SORT
|
3679.75
|
TEMP
|
Create table for Cognitive
data |
||
655
|
Mine
|
5976:4628
|
arz8vhqzt80y3
|
SORT
|
2584.25
|
TEMP
|
Mine Without Parallel
|
||
600
|
usr1
|
26656
|
bazjcx55wky5u
|
SORT
|
2517.25
|
TEMP
|
|||
827
|
usr1
|
26658
|
bazjcx55wky5u
|
SORT
|
29089.5
|
TEMP
|
Part 12
|
with parallel
|
Now after seeing the waits for temp, the long operations, (441 minutes left on the sort in temp…) I decide to say enough it enough and attempt the execution without the hints- just plain ol’ query of this huge table, nothing more, (see above in the table, showing my temp usage for the single process, as “Mine”.) Fifteen minutes, yes, that’s right, fifteen minutes and I had the first rows back on my screen:
SQL Text-
select t.* from tbl_summary t
WHERE (tbl_key=12 and usr_id=0) order by i_id;
DETAILS on Without Parallel:
Elapsed Time Breakdown
SQL Time (sec) 945.25
PL/SQL Time (sec) 0.00
Java Time (sec) 0.00
Activity By Time
Elapsed Time (sec) 945.25
CPU Time (sec) 370.67
Wait Time (sec) 574.58
Activity By Waits
Remaining Waits(1.1%)
User I/O Waits(59.7%)
CPU(39.2%)
Moral:
Carefully test any parallel processing when sorting is involved. You may not receive performance benefits as expected.
Nice post.
Even I have experienced performance degradation in past when PARALLEL and ORDER BY were used together.
看看blog調整心情,又要來繼續工作,大家加油.........................
Kellyn,
just an idea: I think in another post you've mentioned that you might have some code that uses manual PGA settings (workarea_size_policy = manual).
What I could imagine is that you encounter an odd combination of the SAS bug (the SORT_AREA_SIZE parameter won't be propagated to the parallel slaves, so they might be using 65535 bytes (!, yes 64K, the default if the value is not specified in the instance parameters) and a skewed spread of the work across the parallel slaves, eventually leading to the situation that you end up with a few parallel slaves with a SORT_AREA_SIZE of 64K doing all the sorts…
Just an idea – I might be totally wrong.
I would like to get a better understanding what oddities exactly you observe with your parallel execution, this sounds interesting…
Randolf
I suppose you tried using a Temporary Tablespace Group, giving multiple Temp tablespaces to the parallel processes? They have been there in 2010 at least http://docs.oracle.com/cd/B19306_01/server.102/b14231/tspaces.htm#sthref1197
Kind regards
Uwe
Hi,
Thanks for the post.
Two questions though:
1. If the sorting is done one parallel server at a time (serially?) why should the Temp usage increase? I understand if the performance drops (allocate/deallocate temp, loose time coordinating,…) but wouldn’t the temp space used by one parallel server be reused by the next?
2. (a more silly question!): Would it help if one rewrites this [or perhaps the similar ones] query as a “select-order by” on an inline view which is running in parallel? Is it possible in a way that query is not re-written by CBO so that the inline view is run in full first?
Thanks
Pingback: AWR Warehouse and SYSDBA - Oracle - Oracle - Toad World