“Oh what tangled web we weave when complexity is added we can’t see…”
How many times has complexity of design in an application, outside of the database, lead to the database blamed for slow performance? This is where a manager thanks the technical Gods for a DBA with great Sherlock Holmes skills to track down and prove the database not only innocent, but figure out what the real problem is. The database is guilty until proven innocent, we all know that, so when user’s come to the DBA and demands, “Why can’t I run this report? I should be able to run a simple report, shouldn’t I?” We have to take out the Holmes pipe and figure out what is going on.
My latest mystery landed me in a web, which would have been funny if not happening to me. The exact demand from the users above was rightfully requested of the DBA group. Being a good, responsible DBA, we quickly took responsibility when we discovered a bug that was impacting stats collection with parallel degree set, provided a work around while the SR was being worked with MOS and proceeded forward to ensure the user received their report, ensuring business revenue.
As the trace is my first choice in this type of challenge, I attempted to trace the session numerous times, wanting to ensure I’d fixed the issue, along with a future goal of performance tuning, but found I was unable to pin it down and found that I was really seeing numerous sessions involved with this process, not just one to create the report.
The users, running the report to test my fix, experienced another failure with the following error: ” Failed to read the Report Stream from Report Server.” Now this is a vague error and rarely one that *really* has to do with the Oracle database. I could see the stats failure via initial data load logs but nothing really explained after removing this roadblock as to why the users were still experiencing failures.
As this report was kicked off through our web interface, supported by APEX, I went to the application support specialist for answers. He appeared just as frustrated when he informed me that the report was actually called from APEX, but was phycially run from SSRS, (SQL Server Reporting Services.) I promptly asked for the report and was dismayed to find out, depending on the parameters passed, it could contain anywhere from 24 to 50 database calls to complete the report.
What appeared to be one, simple, but slow report to the user, was actually a very large collection of queries, ran from a SQL Server Reporting Repository, submitted through an APEX web interface on a second server and then to the production database on a third.
Ahhh, but I’m not through yet. I pulled the queries from the report, formatted them for a command line run and then passed in the parameters for the large feed that had been failing. When I ran this from the command line, no failures occurred and I was able to trace everything for my future performance tuning. I was surprised to see that there really wasn’t that much that was wrong with the SQL, only a few queries of the many that would require time and offer the biggest bang for the buck. I then ran the report through the web interface hoping to find a way to trace it and my run through the web interface WAS SUCCESSFUL! It also ran in 1/2 the time that it took for the version that reported in the logs that failed. I already had the APEX and SSRS guys looking into anything on their side that might indicate a timeout or failure cause, but I still felt like we were missing something.
I approached the user and asked them where they were kicking the report off from and found out that they were executing it from a terminal server web session, not from their workstation’s web interface. I quickly emailed the Windows Admin, who also administers the terminal servers and asked her if she was aware of any reporting session or terminal server timeout. She informed me that they disconnect any idle sessions after three hours. As this session would show as idle, only submitting the report, but requiring a connection to be active for the remainder of the run, this was the actual issue. Terminal server was dropping the session while it was attempting to complete a report that through it’s entire batch process across three servers, took almost five hours… 🙁
The users were then requested to run the report only from a workstation web interface. The results were successful and the database was again innocent.
Moral of the story: Complexity is the enemy of production processing. As developers, application designers and DBA’s, it is often seen as proof to the world of our God-like creative powers. This is where we also quickly move from being an asset to a liability to the business. Do not create that which can not be easily supported and kept in the life-cycle of technology as technology grows and changes . It is a fool-hardy endeavor that quickly costs the company money, time and nerves… Want to be a superstar? Stop being someone that to support his technical choices, requires a Superman effort…