Locating UNKNOWN SQL_ID Info in OEM Through AWR

Rarely are reports based off large snapshot variances helpful to a DBA unless you come across an odd situation such as this one…Better yet, we need to know a little bit about our AWR tables behind our reports so we can piece together what the reports leave out…:)

Scenario:  After-hours support has killed a session after high temp usage has occurred.  You, as the primary DBA, are left to look into the issue the next day.
Your first attempt to inspect the issue is through Enterprise Manager, (OEM) and you are surprised that very little activity is actually showing up that
is of concern.  You can see pink areas, something that would flag you, but they aren’t showing up in the active sessions in the left-  only showing in the right.
What’s going on here?

Now, taking the right side into consideration, you do see four sessions, easily assumed to be parallel slaves, processing away.
When drilling down into the left side SQL_ID sessions, you not that none of the top right side sessions are showing up for the SQL_ID’s in question.

When drilling down into the right side four sessions, SID’s, all of them show no SQL_ID’s connected and are listed as “UNKNOWN”.

How as a DBA, do you locate this data?  Your first quest will be to run an ASH report, but it will only show you the problem waits, type of waits, not the SQL that was the
cause of the issue.  How do you find the SQL/SQL_ID for the SID’s?

ASH Report for the snapshot in question:

  DB Id    DB Name      Inst Num Instance
----------- ------------ -------- ------------
 xxxxxxxxx DBMART               1 mart1
Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
Enter 'html' for an HTML report, or 'text' for plain text
Defaults to 'html'
Enter value for report_type: text
Type Specified:  text
Defaults to current database
Using database id: xxxxxxxxxxxxxxx
Defaults to current instance
Using instance number: 1
ASH Samples in this Workload Repository schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Oldest ASH sample available:  12-Sep-11 12:20:34   [  24306 mins in the past]
Latest ASH sample available:  29-Sep-11 09:26:08   [      0 mins in the past]
Specify the timeframe to generate the ASH report
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter begin time for report:
--    Valid input formats:
--      To specify absolute begin time:
--        [MM/DD[/YY]] HH24:MI[:SS]
--        Examples: 02/23/03 14:30:15
--                  02/23 14:30:15
--                  14:30:15
--                  14:30
--      To specify relative begin time: (start with '-' sign)
--        -[HH24:]MI
--        Examples: -1:15  (SYSDATE - 1 Hr 15 Mins)
--                  -25    (SYSDATE - 25 Mins)
Defaults to -15 mins
Enter value for begin_time: 09/27/11 16:00:00
Report begin time specified: 09/27/11 16:00:00
Enter duration in minutes starting from begin time:
Defaults to SYSDATE - begin_time
Press Enter to analyze till current time
Enter value for duration: 240
Enter a name for the report or a default will be used, (I prefer a naming convention of:
<report type, (ash, awr, addm)>_<sid>_<snapshot>.<txt/html>
View the ASH report for the time period, going to the top events first:
Event                          % Event  P1 Value, P2 Value, P3 Value % Activity
------------------------------ ------- ----------------------------- ----------
Parameter 1                Parameter 2                Parameter 3
-------------------------- -------------------------- --------------------------
PX Deq Credit: send blkd         58.37           "268501009","3","0"       1.97
sleeptime/senderid         passes                     qref
                                                 "268501010","3","0"       1.97
                                                 "268501010","4","0"       1.97

There’s our issue!

Now using this info, we know we are looking at a parallel, (PX) issue that we can then take to the next
step to identify the process and see if it is connected to one of our SID’s we noted in the OEM console:

   Sid, Serial# % Activity Event                             % Event
--------------- ---------- ------------------------------ ----------
User                 Program                          # Samples Active     XIDs
-------------------- ------------------------------ ------------------ --------
     2170, 3249      16.27 PX Deq Credit: send blkd            14.75
DM_USER             oracle@host1 (P020)             11K/14K [ 74%]        0
                           CPU + Wait for CPU                   1.46
                                                      1,059/14K [  7%]        0
     2138,  585      16.13 PX Deq Credit: send blkd            14.55
DM_USER             oracle@host1 (P022)             11K/14K [ 73%]        0
                           CPU + Wait for CPU                   1.53
                                                      1,107/14K [  8%]        0
     2130,  581      16.11 PX Deq Credit: send blkd            14.53
DM_USER             oracle@host1 (P023)             11K/14K [ 73%]        0
                           CPU + Wait for CPU                   1.54
                                                      1,113/14K [  8%]        0
     2129,  533      16.03 PX Deq Credit: send blkd            14.54
DM_USER             oracle@host1 (P021)             11K/14K [ 73%]        0
                           CPU + Wait for CPU                   1.41
                                                      1,024/14K [  7%]        0

Note that is also reports back the same four sids, all showing as “UNKNOWN” (2170, 2129,2130 and 2138) as we see in the OEM console..)
It just doesn’t have any SQL or SQL_ID’s to connect with them.

The coordinating SID is caught in the Top sessions running PQ’s, so we will see this as our identifier, even if it wasn’t seen in the OEM console:

Sid,Srl# (Inst) % Activity SQL ID        Event                          % Event
--------------- ---------- ------------- ----------------------------- --------
User                 Program
-------------------- ------------------------------
2113,UNKNOWN(1)      80.89               PX Deq Credit: send blkd         58.37

This is very important, as we need this to track further, outside of the report.

Using the coordinator, then query the DBA_HIST_ACTIVE_SESS_HISTORY from during the snapshots to capture the main SQL_ID:

select * from sys.DBA_HIST_ACTIVE_SESS_HISTORY 
where session_id=2113
and snap_id between 717 and 727;
SAMPLE_TIME SESSION_ID SESSION_SERIAL# USER_ID SQL_ID
27-SEP-11 10.33.39.946 PM 2113 143 255 5dpcst074nk81
27-SEP-11 10.33.49.976 PM 2113 143 255 5dpcst074nk81

You can then pull the data that is connected to the parent that will tell you what session experienced the problem:

select * from sys.WRH$_SQLTEXT 
where sql_id='5dpcst074nk81';
SNAP_ID DBID  SQL_ID  SQL_TEXT COMMAND_TYPE REF_COUNT
760 2778270765 5dpcst074nk81 <CLOB>   1            0

the clob then shows the SQL involved in the executed session, which then tells you what your pain is on:

create table dm_user.tbl_iot
(
  col1 ,
  col2 ,
  col3 ,
  col4....
constraint tbl_pk
primary key (col1))
organization index.... <--Here's our key, it's an IOT, which includes the PK!
tablespace mart_data1 pctfree 0 nologging parallel 4 as....

We can now find the secondary SQL_ID that correponds to most of the combined hit to the primary SQL that will show in the following statement:

select distinct SQL_ID, optimizer_cost from dba_hist_sqlstat 
where snap_id between 717 and 727
and plsexec_time_delta=0
order by optimizer_cost desc;
SQL_ID OPTIMIZER_COST
fjbbxnyc66t7c 256350
5dpcst074nk81 253906 <--Here was the one shown, it will be the one above!
1xg7tkc156a1m 5477
24hc2470c87up 5302
dyd4b36t1ppph 4865
gfjvxb25b773h 2331
3wy90uysqgcfx 1974
a55tay7577psc 1922
00a15fn17bx7p 1912
b92jqqxwt3tfd 1725
cf621qmts91wf 1028
5atpa8vj2gakz 1007
7qskskbx91q7t 992

Now run an awrsqrpt.sql to see the hit for the SQL_ID in question:

If you have never run one of these, it’s a great report for unique info on a SQL Report for a given run.  For this one, we just want to see WHAT was running:
You will need your snapshot ID’s and the SQL_ID to create the report and the report resides in $ORACLE_HOME/rdbms/admin.  Yes, I prefer the text version of the HTML, but to each their own… :)

 

WORKLOAD REPOSITORY SQL Report

Snapshot Period Summary
DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
DBMART         xxxxxxxxx mart1               1 10.2.0.4.0  NO  host1
              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:       709 27-Sep-11 14:30:31        57       6.3
  End Snap:       727 27-Sep-11 23:30:53        37       5.1
   Elapsed:              540.37 (mins)
   DB Time:            2,726.16 (mins)
SQL Summary                               DB/Inst: MARTF/martf  Snaps: 709-727
                Elapsed
   SQL Id      Time (ms)
------------- ----------
fjbbxnyc66t7c ##########
CREATE UNIQUE INDEX "DM_USER"."TBL_PK" on "DM_USER"."TBL_IOT"("IBID") INDEX
ONLY TOPLEVEL TABLESPACE "MART_DATA1" PCTFREE 0 NOLOGGING parallel 4 as select
col1 , col2 , col3 , col4 , col5 , col6
          -------------------------------------------------------------
SQL ID: fjbbxnyc66t7c                     DB/Inst: MARTF/martf  Snaps: 709-727
-> 1st Capture and Last Capture Snap IDs
   refer to Snapshot IDs witin the snapshot range
-> CREATE UNIQUE INDEX "DM_USER"."TBL_PK" on "DM_USER"."TBL_IOT"("COL...
    Plan Hash           Total Elapsed                 1st Capture   Last Capture
#   Value                    Time(ms)    Executions       Snap ID        Snap ID
--- ---------------- ---------------- ------------- ------------- --------------
1   3019078213            115,239,821             1           710            726
          -------------------------------------------------------------
Plan 1(PHV: 3019078213)
-----------------------
Plan Statistics                           DB/Inst: DBMART/mart1  Snaps: 709-727
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100
Stat Name                                Statement   Per Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)                        ##########  115,239,820.7    70.5
CPU Time (ms)                            ##########   23,668,620.1    45.7
Executions                                        1            N/A     N/A
Buffer Gets                              ##########  940,627,725.0    70.9
Disk Reads                                7,761,573    7,761,573.0    19.4
Parse Calls                                       9            9.0     0.0
Rows                                              0            0.0     N/A
User I/O Wait Time (ms)                   1,949,762            N/A     N/A
Cluster Wait Time (ms)                            0            N/A     N/A
Application Wait Time (ms)                        0            N/A     N/A
Concurrency Wait Time (ms)                  301,390            N/A     N/A
Invalidations                                     0            N/A     N/A
Version Count                                    17            N/A     N/A
Sharable Mem(KB)                              9,285            N/A     N/A
          -------------------------------------------------------------
          -------------------------------------------------------------
Execution Plan
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost  | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |          |       |       |   256K|       |       |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |       |       |       |        |      |            |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |   637M|   229G|       |       |       |  Q1,01 | P->S | QC (ORDER) |
|   3 |    INDEX BUILD UNIQUE    | PRIM_PK  |       |       |       |       |       |  Q1,01 | PCWP |            |
|   4 |     SORT CREATE INDEX    |          |   637M|   229G|       |       |       |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE          |          |   637M|   229G|   136K|       |       |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANGE      | :TQ10000 |   637M|   229G|   136K|       |       |  Q1,00 | P->P | RANGE      |
|   7 |        PX BLOCK ITERATOR |          |   637M|   229G|   136K|     1 |     4 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| PRIM_TBL |   637M|   229G|   136K|     1 |     4 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------
Note
-----
   - cpu costing is off (consider enabling it)
   - estimated index size: 257G bytes
Full SQL Text
SQL ID       SQL Text
------------ -----------------------------------------------------------------
fjbbxnyc66t7 CREATE UNIQUE INDEX "DM_USER"."TBL_PK" on "DM_USER"."TBL_IOT"
             "("COL1") INDEX ONLY TOPLEVEL TABLESPACE "MART_DATA1" PCTFREE 0 NO
             LOGGING parallel 4 as select ...<want to see the full SQL statement, yes
             this report will show you the output...>

We now can see that our “offender” was the primary key creation on the IOT create statement.

Now you have the Information on the “Unknown” offender that was showing in OEM.  Nothing can hide from a DBA for long! :)

Oracle Open World 2011, Prep Time!

I’ve spent the week updating, packing and communicating in preparation for my trip to Oracle Open World 2011.  As with most folks that are attending, there is a lot to prepare for, but I have that added challenge of three kids who find this a fine opportunity to drive their father to distraction, as my ex-husband is responsible with managing their week, which he is unaccustomed to.   Although I’m thankful for Google calendar that will tell my ex where and when each child has to be at all times, what homework, social events, etc., it is not a lot of fun for him, as the schedule is commonly kept from my home and not his during the week so finding a way to keep it from becoming too chaotic for him may not be feasible…

My preparation has included a few last minute tasks for RMOUG, which turned out to be pretty painless, so that wasn’t impacting.  I think one of my biggest challenges was a final rush to try and get what identification I needed sync’d with the correct name, as we all know, the government agencies are far from efficient when someone changes their name, no matter if it’s marriage or in may case, a divorce, (ask me how my identity crisis is going, considering how often I need to use a different last name depending on what company/agency I’m speaking with at any given time…sigh…)

Upon checking my calendar, it appears that I have a dinner to attend every night that we are in San Francisco.  The company for each of these events, of course, is wonderful- some of the folks I know and many I only know virtually and look forward to meeting for the first time.  I’m looking forward to the Pythian Blogger’s meetup on Wednesday night.  It’s a great opportunity to meet up with ex-coworkers and meet fellow bloggers-  always a fun, enterprising group!

I’m looking forward to meeting a long list of folks, including, Dan Norris, Lisa Dobson, Eddie Awad, Yuri Velikanov, many more folks than I can possibly list…) and those that I haven’t seen in quite a while, Alex Gorbachev, Cary Milsap, Deb Lilley, Carol Dacko and again, more than I can list… :) )

There will be many that were unable to make it this year, including a last minute cancel due to family needs by Chet Justice.  He will be missed by many and want to pass a long a thank you for my daughter Cait and I for the picture on oraclenerd.com.  We love our matching “DBA Goth Princess” T-shirts.  When asking my 16yr old son to take our picture, his response was, “You know Mom, there is this great feature called a self-timer found on cameras…”  I had to respond with “You know, there is this great feature called, I keep feeding you…”  He took a great picture of Cait and I for Chet’s site… :)

Packing will continue tonight,  along with more planning in the calendars, an attempt to get my tasks for the latest Agile sprint at work done and a hope that what clothes I pack will work the slightly different weather in San Francisco.  Yes, I’m bringing my cowboy boots… :)

See you there!

 

SQL Server and Distributed Transaction Tuning

Users complained that a monthly financial report would no longer run SQLServer Reporting Services, (SSRS.)   Upon investigation, it was found that this was a stored procedure that ran from one Annex database, sourcing from another and outer joins to a SQLServer database on a remote server through a linked server configuration.  In attempts to run the report,  my SQL Server Profile traces on both the source SQL Server and the remote SQLServer resulted in consistent sp_reset_connection results from the source and no activity in the remote.

 

I ran my trusty and favorite script to tell me what processes were taking the most resources and quickly realized with statement was the cause:

SELECT  TOP 1000
[Object_Name] = object_name(st.objectid),
creation_time, last_execution_time, total_cpu_time = total_worker_time / 1000,
avg_cpu_time = (total_worker_time / execution_count) / 1000,
min_cpu_time = min_worker_time / 1000, max_cpu_time = max_worker_time / 1000,
last_cpu_time = last_worker_time / 1000, total_time_elapsed = total_elapsed_time / 1000 ,
avg_time_elapsed = (total_elapsed_time / execution_count) / 1000,
min_time_elapsed = min_elapsed_time / 1000, max_time_elapsed = max_elapsed_time / 1000,
avg_physical_reads = total_physical_reads / execution_count,
avg_logical_reads = total_logical_reads / execution_count,
execution_count,
SUBSTRING(st.text, (qs.statement_start_offset/2) + 1,
((CASE statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END
-- qs.statement_start_offset
) /2) + 1) as statement_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY
sys.dm_exec_sql_text(qs.sql_handle) st
WHERE Object_Name(st.objectid) IS NOT NULL
--AND DB_NAME(st.dbid) = '
ORDER BY   total_cpu_time DESC
db_name(st.dbid), total_worker_time / execution_count  DESC

 

Now the important data from here is that the report came up at the top promptly, as it was an challenge to be sure:

Object_Name execution_time total_cpu_time min_cpu_time max_cpu_time last_cpu_time avg_logical_reads
Rpt_ClientRevenueDownloadKa

33:15.1

1040436

507989

532447

532447

4701439

Note that the min and max CPU times are very consistent and the average logical reads are out of this world!  The other nice thing about this report, is that it will tell you exactly, the code that is causing the issue.  You have the choice then of copying the code out of the final column and running and explain plan to see what is the issue or executing a test to see the problem.  The code in question was a distributed query, so an explain plan often does not show what a test of the query would, so I ran the one statement, which did return to the remote server, but with frustrating results:

What you see in the diagram above is both a choice by the server to parallelize the query, which is flooding the server with both waits on parallel and network IO.   The query did end up returning, but only after approximately 10 minutes.

Something is coming across the network to run against the remote server and after verifying statistics, etc. on the remote, I returned to the source server and looked at the query.

The query required a two step temp table to be created before joining to the tables across the linked server.  The issue here, is that this data was flooding both the network as well as the remote database.  The following, without the temp table, completes very quickly:

SELECT
--ISNULL(atv.NAME, 'UNKWN') AS name,
ISNULL(parentCrmA.NAME, 'UNKWN') AS parent,
month(t.DT1) AS pmonth, YEAR(t.DT2) AS pyear,
RIGHT('0' + CONVERT(VARCHAR(3), MONTH(t.DT2)) + CONVERT(VARCHAR(5), YEAR(t.DT2)), 6) AS fdte,
t.DT2 AS pdate, t.P_AMT*(-1) AS amt,
'rev_amt' AS r_amt, 'adj_amt' AS a_amt,
a.GLA_NO AS GLNo, --atv.crmVert AS vertical,
crmA.OID_NAME AS oidname, s.S_NAME AS spc,
c.C_NAME AS Cname, crmA.NC_NAME AS c_cname,
crmA.I_CD AS i_cd
--INTO      #FINData ß2

nd

 step temp table commented out, along with its columns…
FROM        R_SERVER.FIN_app.dbo.tglTbl t
INNER JOIN  R_SERVER.FIN_app.dbo.tglAtbl a ON t.GLAKey = a.GLAKey
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglItbl i ON t.TNo = i.TNo
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglCtbl c ON i.CKey = c.CKey
--LEFT OUTER JOIN @AccountToVerticalMap atv ON c.CID = atv.cName COLLATE SQL_Latin1_General_CP1_CI_AS
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglStbl s ON i.PSKey = s.SPKey
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglARtbl ar ON t.ARKey = ar.ARKey
LEFT OUTER JOIN   S_SERVER_DB.dbo.Atbl crmA ON c.CID = crmA.Name COLLATE SQL_Latin1_General_CP1_CI_AS AND crmA.DeletionS_CD = 0 AND crmA.S_CD = 0
LEFT OUTER JOIN   S_SERVER_DB.dbo.Atbl parentCrmA ON cA.PA_ID= parentCA.A_IDAND parentCA.DS_CD = 0 AND parentCrmA.S_CD = 0
WHERE a.GLANo between 4000 and 4999

:03 seconds, in fact… J 

The IO issue is the code creates the two temp objects, then attempts to send this data across the network to the linked server and onto the remote database to process the request.  As this is a monthly report, you have to ask yourself, is a temp table really required??  Not in this instance and it is actually the bottleneck to performance.

 

ORIGINAL:

with cAName(MemName)
AS (SELECT      c.CID
FROM  R_SERVER.FIN_app.dbo.tglCtbl c
WHERE c.CID is not null
UNION ALL
SELECT      ar.ARCode
FROM  R_SERVER.FIN_app.dbo.tglARtbl ar
WHERE ar.ARCode is not null
)
INSERT      INTO @ActVertMap
SELECT      DISTINCT
a.Name AS Name,
a.New_Vertical AS Vert
FROM  S_SERVER_DB.dbo.Atbl a
LEFT OUTER JOIN cAName c ON a.Name = c.Name COLLATE Latin1_General_CI_AS
WHERE a.Name is not null
order by a.Name

 

SIMPLE  REWRITE:

drop table cacct_wrk
go
SELECT      c.CID  as "MemName" into cacct_wrk
FROM  R_SERVER.FIN_app.dbo.tglCtbl c
WHERE c.CID is not null
UNION ALL
SELECT      ar.ARCode
FROM  R_SERVER.FIN_app.dbo.tglARtbl ar
WHERE ar.ARCode is not null
go
drop table ActVertMap_wrk
go
SELECT         DISTINCT
a.Name AS mapName,
a.New_Vertical AS mapVert
into ActVertMap_wrk
FROM  S_SERVER_DB.dbo.Atbl a
LEFT OUTER JOIN cacct_tst c ON a.Name = c.MemName COLLATE Latin1_General_CI_AS
WHERE a.Name is not null
order by a.Name
go

 

FINAL REPORTING QUERY REWRITE:

SELECT
ISNULL(atv.NAME, 'UNKWN') AS name,
ISNULL(parentCrmA.NAME, 'UNKWN') AS parent,
month(t.DT1) AS pmonth, YEAR(t.DT2) AS pyear,
RIGHT('0' + CONVERT(VARCHAR(3), MONTH(t.DT2)) + CONVERT(VARCHAR(5), YEAR(t.PDate)), 6) AS fdte,
t.DT2 AS pdate, t.P_AMT*(-1) AS amt,
'r_amt' AS r_amt, 'a_amt' AS a_amt,
a.GL_NO AS GLANo, atv.M_VERT AS vertical,
cA.O_ID AS oidname, s.S_NAME AS spc,
c.C_NAME AS FIN_cname, cA.NC_NAME AS ccname,
cA.I_CD AS icode --INTO      #FINData
FROM        R_SERVER.FIN_app.dbo.tglTbl t
INNER JOIN  R_SERVER.FIN_app.dbo.tglAtbl a ON t.GLAKey = a.GLAKey
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglItbl i ON t.TNo = i.TNo
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglCtbl c ON i.CKey = c.CKey
LEFT OUTER JOIN   ActVertMap_wrk atv ON c.CID = atv.mName
COLLATE SQL_Latin1_General_CP1_CI_AS
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglStbl s ON i.PSKey = s.SPKey
LEFT OUTER JOIN   R_SERVER.FIN_app.dbo.tglARtbl ar ON t.ARKey = ar.ARKey
LEFT OUTER JOIN   S_SERVER_DB.dbo.Atbl cA ON c.CID = cA.Name COLLATE SQL_Latin1_General_CP1_CI_AS AND cA.DeletionS_CD = 0 AND cA.S_CD = 0
LEFT OUTER JOIN   S_SERVER_DB.dbo.Atbl parentCA ON crmA.P_ID= parentCA.A_ID AND parentCA.DeletionS_CD = 0 AND parentCA.S_CD = 0
WHERE a.GLANo between 4000 and 4999

And this returns in .03 seconds!! :D

 

Using the correct feature for the goal and the use of the code is important.  Where using a temporary object might be a great option for local processing, the limitations of the optimizer in any platform to utilize that temporary object must be taken into consideration.  This was a complex process involving three databases that with growth, resulted in poor performance.

Keep it Simple and Other Superman Challenges

“Oh what tangled web we weave when complexity is added we can’t see…”

How many times has complexity of design in an application, outside of the database, lead to the database blamed for slow performance?   This is where a manager thanks the technical Gods for a DBA with great Sherlock Holmes skills to track down and prove the database not only innocent, but figure out what the real problem is.  The database is guilty until proven innocent, we all know that, so when user’s come to the DBA and demands, “Why can’t I run this report?  I should be able to run a simple report, shouldn’t I?”  We have to take out the Holmes pipe and figure out what is going on.

My latest mystery landed me in a web, which would have been funny if not happening to me.  The exact demand from the users above was rightfully requested of the DBA group.  Being a good, responsible DBA,  we quickly took responsibility when we discovered a bug that was impacting stats collection with parallel degree set, provided a work around while the SR was being worked with MOS and proceeded forward to ensure the user received their report, ensuring business revenue.

As the trace is my first choice in this type of challenge, I attempted to trace the session numerous times, wanting to ensure I’d fixed the issue, along with a future goal of performance tuning, but found I was unable to pin it down and found that I was really seeing numerous sessions involved with this process, not just one to create the report.

The users, running the report to test my fix, experienced another failure with the following  error:  ” Failed to read the Report Stream from Report Server.”  Now this is a vague error and rarely one that *really* has to do with the Oracle database.  I could see the stats failure via initial data load logs but nothing really explained after removing this roadblock as to why the users were still experiencing failures.

As this report was kicked off through our web interface, supported by APEX, I went to the application support specialist for answers.   He appeared just as frustrated when he informed me that the report was actually called from APEX, but was phycially run from SSRS, (SQL Server Reporting Services.)  I promptly asked for the report and was dismayed to find out, depending on the parameters passed, it could contain anywhere from 24 to 50 database calls to complete the report.

What appeared to be one, simple, but slow report to the user, was actually a very large collection of queries, ran from a SQL Server Reporting Repository, submitted through an APEX web interface on a second server and then to the production database on a third.

Ahhh, but I’m not through yet.  I pulled the queries from the report, formatted them for a command line run and then passed in the parameters for the large feed that had been failing.  When I ran this from the command line, no failures occurred and I was able to trace everything for my future performance tuning.  I was surprised to see that there really wasn’t that much that was wrong with the SQL, only a few queries of the many that would require time and offer the biggest bang for the buck.   I then ran the report through the web interface hoping to find a way to trace it and my run through the web interface WAS SUCCESSFUL!  It also ran in 1/2 the time that it took for the version that reported in the logs that failed.   I already had the APEX and SSRS guys looking into anything on their side that might indicate a timeout or failure cause, but I still felt like we were missing something.

I approached the user and asked them where they were kicking the report off from and found out that they were executing it from a terminal server web session, not from their workstation’s web interface.  I quickly emailed the Windows Admin, who also administers the terminal servers and asked her if she was aware of any reporting session or terminal server timeout.  She  informed me that they disconnect any idle sessions after three hours.  As this session would show as idle, only submitting the report, but requiring a connection to be active for the remainder of the run, this was the actual issue.  Terminal server was dropping the session while it was attempting to complete a report that through it’s entire batch process across three servers, took almost five hours… :(

The users were then requested to run the report only from a workstation web interface.  The results were successful and the database was again innocent.

Moral of the story:  Complexity is the enemy of production processing.  As developers, application designers and DBA’s, it is often seen as proof to the world of our God-like creative powers.  This is where we also quickly move from being an asset to a liability to the business.  Do not create that which can not be easily supported and kept in the life-cycle of technology as technology grows and changes .  It is a fool-hardy endeavor that quickly costs the company money, time and nerves…  Want to be a superstar?  Stop being someone that to support his technical choices, requires a Superman effort…