But…. I Collected Stats!–Updated!!

*****After much adieu, I’ve finally returned to this post to rewrite what concerned me and befuddled others… :) *****

payday loans lenders online

I’ve worked with some great developers.  Developers who can write code in their sleep to perform the most complex functions and aggregations, (makes me quite envious, as I need to actually think how to write code… :) )  Most of these developers are still tripped up by the Oracle Cost Based Optimizer, (CBO) and honestly, I think they should be.  Just how I have to think about how to code, it should be expected that they need to think about how the CBO works.  It is one of those fundamental differences between DBA’s and Developer’s mindsets that make each group special.

*****This is where I’m going to rewrite this post.  Reason?  I rewrote the original query to disguise my company’s SQL as I’m always requested to do, but I disguised it so well that it didn’t have the same challenges as the original or address something that I really wanted to discuss.  I’m still going to change the column and table names, but I’m going to use the original query, as this is a topic I really want to discuss honestly, (forgive me boss for I have sinned…. J)  We worked with a number of different options with this query, with the sort on this wide table for a couple days.  It’s a new design that is part of a new project and it was time to really think outside the box, (which I will get into at the end of this post. *********

Our example for the evening is a query that is performing poorly.  The (REAL) query is a select on ONE table with parallel to gather a performance gain from temp tablespace groups and includes an order by:

select /*+ PARALLEL(t1, 4) */  t1.* from TBL1 t1 
order by t1.MAIN_ID;

(Now let’s admit to the real performance challenge here…) The query isn’t returning after a full day, you note that the long ops on the process states it has 4138 hrs left and climbing, but you notice, if you simply take out the order by, the query returns in a matter of minutes ALL OF THE DATA.  Why, when you put in the order by, it refuses to return anything?

As the sort is the pinpoint of the latency, a quick look at the most common work area for this large of a sort would be in order-  TEMP usage:

select vst.sql_text,  swa.sid, swa.tablespace

, swa.operation_type
, trunc(swa.work_area_size/1024/1024) “PGA MB”
, trunc(swa.max_mem_used/1024/1024)”Mem MB”
, trunc(swa.tempseg_size/1024/1024)”Temp MB”
from v$sql_workarea_active swa, v$session vs, v$sqltext vst
where swa.sid=vs.sid
and swa.sql_id=vs.sql_id
and vs.sql_id=vst.sql_id
and vst.piece=0;

 SQL_TEXT                                SID TABLESPACE   OPERATION PGA MB Mem MB Temp MB

select /*+ PARALLEL(t1, 4) */    234 TEMP_A              SORT (v2)      47           1024         96179
select /*+ PARALLEL(t1, 4) */    377 TEMP_B              SORT (v2)      58           1024         116256
select /*+ PARALLEL(t1, 4) */    260 TEMP_C              SORT (v2)      76           1024         146017
select /*+ PARALLEL(t1, 4) */    304 TEMP_D              SORT (v2)      47           1024         96269

Note the difference in the parallel slave use of temp.  This is a clear indicator that Oracle is doing what it can with what it knows about these objects, but something is amiss.

What are we working with here?
select segment_name, bytes/1024/1024 MB from dba_segments
where segment_name=('TBL1')
group by segment_name;
 
TBL1= 243GB
 
Rest of info:
TBL1=558 million rows, 238 columns, last analyzed, 2 hrs ago.
 
Now that we know what we are dealing with-  Why are the sorts out of whack on the tables?  The tables have been analyzed. 
 
Oracle can only sort data if it knows what it is dealing with.  What is required to sort data correctly?  Accurate information and as a sort is at a column level, it must be at the COLUMN level.  The question the DBA should ask themselves is not "Have statistics been gathered?" but if "Have statistics been gathered correctly for how the data is being utilized?"
 
For the DBA, this is where the DBA_TAB_COLUMNS view can come in very handy.  It can quickly show you if the info that is required for the where clause is present for the CBO.
 
select table_name, column_name, num_distinct, low_value, high_value, num_nulls,
last_analyzed, global_stats, user_stats, histogram from dba_tab_columns
where table_name in ('TBL1','TBL2)
and column_name in ('MAIN_ID');

Yeah, I think the database is trying to tell us something… :)
TBL1 MAIN_ID NO NO NONE

How can the database know how to best sort the data if it doesn’t know anything about the columns it is supposed to join or sort on??

Table statistics were collected, but no column information was collected.  The most likely instigator would be method_opt=’FOR ALL INDEXED COLUMNS’ and due to these tables not having any indexes, no column data was collected.  Recollect the statistics with method_opt=’FOR ALL COLUMNS SIZE 1′ .

exec dbms_stats.gather_table_stats( ownname=>'OWNER1',tabname=>'TBL1', ESTIMATE_PERCENT=>.00001, METHOD_OPT=> 'FOR ALL COLUMNS SIZE 1',CASCADE=TRUE,DEGREE=>8);
Now, the CBO will know a bit more about what it's working with!
select table_name, column_name, num_distinct, low_value, high_value, num_nulls,
last_analyzed, global_stats, user_stats, histogram from dba_tab_columns
where table_name in ('TBL1')
and column_name in ('MAIN_ID');
TBL1 MAIN_ID 586368000 C40D1A2050 C50A4736594A 0 10/25/2010 8:41:53 PM YES NO NONE
 
Returning to run the original query, now that the table has column stats: 
select /*+ PARALLEL(t1, 4) */ t1.* from TBL1 t1
order by t1.MAIN_ID;

The final performance and usage?  The example above was able to write the data to file in 5hrs and the TEMP usage was in line with the amount of data that was being sorted.  There is still a delay in pulling the data to the application layer, but that is a fight for another day.  The final answer for this query was simply the statistics, but the challenge for the people involved still revolves around the wide table and how to ease some of the strain on the database due to the sort:

SID TEMP      OPERATION PGA MB Mem MB Temp MB

307 TEMP_C SORT (v2)        1029           1024         1739

247 TEMP_A SORT (v2)        1024           1024        1755
286 TEMP_D SORT (v2)        1022           1024         1783
354 TEMP_B SORT (v2)        1024           1024         1755

****Coskan asked why I threw in the index and forced the query in the original version with a hint.  This was an attempt to “dimensionalize” the wide table.  Oracle will not utilize an index like a dimension table would be utilized in a star schema and perform a hash join unless you use a hint.  The test was how long would it take and how efficient would it be to sort on the indexed column vs. the table?

There were multiple tests, the first attempting to use the index, but due to the sort, it refused the parallelism which was required to get through the 558 million rows.  The second choice was to perform the following, which I will admit-  worked great for Oracle to write out to a file, did not work well for the application to utilize and perform the same! L

CREATE table MID_DATA tablespace USR_DATA1 as select MAIN_ID from TBL1;

This took only 42 minutes to create.  This was then used to join on the MAIN_ID and the order by.

SELECT TBL1.<each column but the MAIN_ID>, MD.MAIN_ID
from TBL1, MID_DATA MD
where TBL1.MAIN_ID=MD.MAIN_ID
order by MD.MAIN_ID;

I was able to query AND extract the data to an output file on a remote backup server to simulate what the application would do in 5 hrs.  I thought I had a winner here, but unfortunately, the application still sorted the entire table and temp usage went through the roof, (we run into that challenge often where a solution in Oracle does not solve the problem at the application SQL level due to incompatibilities in how the application presents the SQL, may rewrite the query, etc.)

There is a reason we have star schemas and when Oracle is unable to profit from one due to one reason or another, it can be an incredible battle for the DBA and developers who work in these environments.

Follow up: When an RMan Clone Goes Bad

A follow up to this previous post.  I thought I had this one down-  had followed the standard protocol when you receive a 600 error.  I looked it up on MOS and had my bug number, documentation and proof of what the bug was, right?

Gotta love being a DBA-  the day you think you’ve figured out, something comes up to surprise you and you learn something new. 

ORA-00600: internal error code, arguments: [kcvhvdf_1], [], [], [], [], [], [], []

Odd version of bug 9314439: RECOVERY FAILS ON CLONED DATABASE ORA-600 [KCVHVDF_1] which is an 11.1.0.7 bug, but I was receiving in a 10.2.0.4.3 database clone.

 
Nope, nice try, Kellyn, (oh-oh, she’s started to refer to herself in the third person, it can’t be good!)   This is where the DBA Gods come down and say, “You ain’t that hot and here’s a reminder…”  :)
 
This 600 error had an RMAN error that went with it:  RMAN-06136: ORACLE error from auxiliary database: ORA-03113: end-of-file on communication channel
  
When the error showed up the next time, it seemed to confirm the bug.  Then the third time it appeared, but with numerous other errors in the duplicate.  This sent up red flags for me along with the other DBA I’ve worked with for years.  As he started to look into the feasibility of saving the duplicate process, I started searching through the miscellaneous errors. 
  
RMAN-03009: failure of sql command on clone_default channel at …
RMAN-06136: ORACLE error from auxiliary database
ORA-19563: header validation failed for file
 “Media Recovery Start
 
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.”  Ohhh, what fun! :)

 
The clone had failed on multiple steps-

  •  Recovery of multiple datafiles.
  • The switch on the datafiles once recovered.
  • Creation of the temp tablespace. 

 As we’d been experiencing some odd sqlnet disconnect errors on another database server, we decided to go back to the original 3113 error.  There weren’t any 3113/3114/3135 errors, but the miscellaneous errors did make sense if the disconnects had occured at the OS level instead of the database session level.
 
Using the times of the failures in the duplicate log, I went to the sqlnet.log for the target database.  For each and every error in the duplicate log, there was a corresponding error in the sqlnet.log for the same time:
Time:
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT= ))

 Ahhh…the true source of the problem reveals itself…  The duplicate was at a state where it wasn’t likely to be successfully recovered, so a restart from scratch would be required.  I issued a drop database, which only resulted in dropping the control files, redo logs and system tablespace’s datafile-  all the rest of the datafiles had to be removed at the OS level. 
I then tweaked the sqlnet.ora and re-ran the duplicate, extending the expire_time to ensure we could stay connected.   Suddenly, no bug and the duplicate completed successfully…
 
Now, the final question is-  All the people that have been experiencing the bug 9314439, are they really experiencing a bug or is the 600 error really a 12170 disconnect error due to sqlnet/network issues?

The Superman Conundrum

I used to collect comic books when I was a kid, so superheroes are close and dear to my heart, but just not as crazy about the expectation that technical support experts be held to the expecation of infallible like Superman.
I’ve both experienced and have discussed with other technical folks regarding this situation that, as of late and in all common sense, just doesn’t make any sense in the technical world.

I’ve nicknamed this technical environment situation the “Superman Conundrum”.   In my history I’ve  *survived* three companies who’ve treated their technical environment in this manner.  This is the kind of company where hardware, software, applications and processes fail on a consistent basis with the expectation that DBA’s and other support staff will come in and save the day/night/weekends….you get the picture…

The reason for the environment being in such poor condition is commonly due to the following:

  • Lack of consistent, demanded maintenance schedule.
  • Inability or refusal to replace hardware/software/code before it becomes an impacting problem.
  • Inefficient resource head count to address problems pro-actively, placing the staff in a constant fire-fighting mode as neglected systems start to unravel.
  • Poor planning of projects, (weak development/testing/production implementation phase.)
  • An expectation when issues arise, (and often this is not in writing or even spoken, but when fires erupt and they say jump…) that someone will come in and do whatever necessary to save the day, (often employing duct tape and baling wire to get them through the demand to get the system back up and running after the impacting issue…)
  • Lack of understanding from business management on what it takes to build a strong foundation to support the technical environment for the long run.

For a DBA in this type of environment, it can be both invigorating and terrifying-  kind of like being on the high wire without a net.  It’s thrilling to come in and save the day, to be able to bring back a system or save a process that everyone thought was doomed.  Users and Managers think you walk on water, so it is something akin to an adrenaline high.  If it starts to become a repeat performance…often, you then begin to realize, just how foolish and dangerous it is to depend on any human being to be this infallible just because something in the environment should have been automated/built with redundancy/mirrored/designed without such flaws and that’s when you recognize that missing net below you….  Management and peers outside the demands to be Supermen do not see the harm of the missing net, you are Supermen, what’s a little fall?  You are infallible, right?

Many of these environments also start to play what I like to call the “blame game”.  It first appears that the people involved are attempting to help ensure that processes are put in place to keep the problem from repeating itself.  The slight difference is that it results in simply pointing fingers at the “Supermen” they once praised, instead telling them what they did wrong, what they should have done, (in hindsight of course) differently to keep the problem from occurring, (and of course, regularly demands more checks, double checks, etc. to compensate for what flaws are in the environment/software/processes, etc assigned to, you guessed it- Supermen…) or just outright complain without even any suggestions.  When it comes to anyone who actually takes the time to suggest an option to correcting the problem at its source, (i.e. fix the code, replace the insufficient hardware, upgrade, tune, etc…) this is quickly and quietly turned down or refused due to time/resource/budget constraints.

There is a backlash condition post the blame game.  Little by little, the environment, if on it’s destined course, becomes more and more overwhelming for the “supermen”.  This is an exhausting situation, as personal life is intruded upon and professionally it’s not so great to constantly be in fire-fighting mode.  In turn, their morale is whittled away at by the blame game until the “supermen” start playing the “why even try anymore game”.  You begin to see the looks of hopelessness on the supermen faces.  It’s there when they walk in the office, it’s there with the first sarcastic joke of the day and it’s there when you receive email at 2am in the morning in regards to the latest fire they had to use their “super breath” on to put out.

Soon, the turnover starts.  The supermen know it’s coming, some may warn you, most won’t say a word and will just quietly leave.  The one thing you should know is that the technical world is a small place and even if they don’t let you in on the chaos that exists in your environment, that doesn’t mean they won’t share this information with peers in their field when asked.  This can be detrimental to companies when they are looking for new “supermen” and can’t figure out why they can’t get any solid resumes.  I have a “list” of companies that I refuse to interview with.  Trusted sources have listed them as poor work environments and I have surprised more than one technical recruiter when I’ve been aware of an odd quirk or two in a company that they later on verified.  Being a DBA is tough enough, so if we can avoid a poorly managed environment, we’ll commonly do so! 

So as a manager, what can you do to try to quell the “Superman Conundrum”? 

  • Challenge the culture, this is a constant task that must be repeated.  When the business/management expects technical personnel to make “supermen leaps”, demand that the requester find a solid, long term solution by the next request or that it will be turned down.
  • Be a buffer between your “Supermen” and the business.  Do not let them be placed in a “blame game” situation or take heat for what logically is a failure in design/hardware/process.  Work alone with the superman held accountable to find out what they think needs to change to deter the breakdown and push for a solid resolution.
  • Demand full project plans and full project cycles that include padded development, test and user testing schedules.  Use the “Pay Forward” model to show value in time invested before technical projects go to production vs. the cost post going production for outages and breakdowns.
  • Design, build and code for the LEAST OPTIMAL scenario.  Anything built for only the most optimal scenario should be expected to experience outages and endure high support cost in both man hours and upgrades.
  • Always remember “You get what you pay for” and “Nothing is free”.  In this technical onslaught of open source and free tools, applications, etc. there needs to be an honest look at what you are really getting for the money.  If you require an enterprise environment and believe you can get by on the freebie version, you are likely lying to yourself and making unreasonable demands of your people.  You should not go on the sales pitch-  try the product out, it might live up to your needs and then again, it might not, so don’t go production with it the first week!

If you’re a DBA in this type of environment, one who, like Superman, thinks you are not just going to save the day, but going to stabilize and productionize it, I can tell you from going through this myself, it’s not a whole lot of fun.  You often feel like you’ve taken two steps forward only to be thrusted 2 1/2 steps back.  You may implement automation, request meetings, document processes and standards.  The problem always lies in that this is not where management’s focus is.  They are sure, that if you were just more perfect, more like Superman, then the environment would run just peachy.  It will always come down to either you or the database that is at fault and you must either change that culture of thinking or hope you have a manager around who can help fight this battle for you so you can simply do your job.

It takes a lot of power to take on this kind of culture fight, but heck, you’re Superman, right? :)

The Rants of a SQL Server DBA

Part of this is going to be the SQL Server DBA in me ranting, so be patient and know as it says in the lovely site disclaimer, this is MY OPIONION and yes, I stick by this opinion 100%.

I have very strong feelings on who and how a SQL Server, especially one with SSRS is installed.  Microsoft’s continued mis-marketing, (albeit successful for sales) and ease of installation has created database environments that are misconfigured, poor-performing and poorly designed-  BY DEFAULT. 
Few experts in .Net or SSRS, when asked, know what transaction logs, filegroups, lock escalation or sp/dbcc procedures are.  If you don’t know what each and every one of these are for, you shouldn’t be installing SQL Server.  If you don’t know why the tempdb and transaction logs should be on separate spindles or why it’s not a good idea to have one, monstrous, huge drive, (see the first part of this sentence for a clue…) then you shouldn’t be installing SQL Server.

What are the rules of thumb a DBA follows when we are installing?

  • Install the SQL Server as a dedicated SQL Server admin account that has the least amount of privileges required to perform all database tasks, (If I see one more database owned by “domain\standard user of database” I’m going to break something…:))
  • If this is a production database, high use, I’m going to want multiple filegroups, (Oracle DBA’s can think tablespaces at this point… :) )  I am good with the primary filegroup for the standard tables, but I want another for indexes, a third, fourth, fifth, etc. for high use objects.
  • I do want multiple drives and I do not want to place my binaries in the default location on the C:\ drive.  The last thing my NT Admin needs is me filling up the drive that the OS resides on.  Give me a designated SQL Server binaries drive with plenty of room for upgrades, it’s only going to get larger.
  • I want drives for my data, my indexes, tempdb, transaction logs and backups.  Do not skimp on space and I am more than happy to tell you the ratio of sizes needed for the backups per retention period. 
  • Don’t give me RAID5 for my databases, you’ll only tick me off when people start complaining about the I/O issues.  I know you get less MB in the end, but trust me, RAID0+1/RAID10 is worth the cost.
  • Don’t give admin access to every Tom, Dick and Harry to the box.  You don’t do it to your Oracle servers, why would you do it to these database servers? 
  • Last, but not least-  treat them with the respect any database server deserves.  Back them up, move the files off to tape, secure the systems, leave the database admin work to a DBA.

So, what started the rant?  One of my poor NT Admin’s went through a challenging process that renamed a service account.  After he corrected all this procedure broke, we arrived a couple days later to find that no one could run any of the reports through the web from this main reporting server.

It turns out, when this SQL Server with SSRS was installed and configured for our company by a third party vendor it was performed with the service account that the NT Admin was forced to rename. Due to this, it was the “db owner” of the Reporting Server, including all the encryption internally and cached authentication had “run out”.

I found the error immediately, as there were also SQL Server Agent jobs attempting to authenticate with the service account, so the following message was reported in SQL Server’s error logs:
Message
[298] SQLServer Error: 15404, Could not obtain information about Windows NT group/user ‘DOMAIN\SERVICE_ACCT’, error code 0xffff0002. [SQLSTATE 42000] (ConnIsLoginSysAdmin)
It wasn’t as simple as renaming the database owner for the ReportServer database, as there are encryption keys that are created at the time of installation.  I will say, they have simplified the process for 2008 vs. earlier versions though! :)

• Change the ownership of the ReportServer and ReportServerTempdb databases to the DOMAIN\NEW_SRVC_ACCT database, (as it should have originally been performed as.) using the stored proc sp_changedbowner.
use
 
go
exec sp_changedbowner ‘DOMAIN\NEW_SRVC_ACCT’
go
• Start the Reporting Services Configuration Manager and connect to the database repository for the SSRS.

  1. Click on Service Account and change from “Network Service” to “Use Another Account”.  Type in the username and password of the new DB Owner you specified in the sp_changedbowner step.
  2. Click Apply
  3. Click on “Encyrption Keys” on the left and choose to backup, (always, always backup after every change.)
  4. Choose a secure local location, (and also make a copy of these on your backup server each time..) and password protect it.
  5. Click Apply.
  6. You have now officially reset everything in the Network service from the old domain user account to the new one.  You now have to reset all back to the Network service.
  7. Click again on Service Account and Change the “Use Another Account” back to “Network Service”
  8. Click Apply
  9. Click on “Encryption Keys” and make another backup, saving to a new file name each time you perform this step, (date and time in file name helps…)  Save off a copy of the final encryption keys to the backup server is essential if you wish to recovery the SSRS some day, so DON’T FORGET!!
  10. Change the ownership of any other databases with a simple execution of the sp_changedbowner as seen above, (only SSRS requires the other steps…) and change any agent jobs or other services that are running as the now missing service account. 
  • Test reports and you should be good to go.

Many issues, like the one above, can be avoided if a DBA works with the NT Admin to ensure the server is build correctly to support SQL Server and if a DBA performs the installation and configuration of the database server.

**Added 10/07/10 after an email or two with a couple other SQL Server DBA’s-
IF you don’t know any of the topics that I listed for requirements to install SQL Server, it should also be a requirement to know these before you are aliased as DBO or given admin privileges on a SQL Server box-  nuff’ said!!
~done rant!~ :)

The Seasons of a DBA

As busy as I am these days, I am seeing a light at the end of the tunnel, (no, it’s not a train!)  We have a new DBA training that is doing bang up job and I do believe there is some lull in the demands of our busy season. As a database administrator, I’m happiest when I have a number of demanding tasks, along with mysteries to challenge me that often, the business isn’t even aware of the level of importance it is to having resolved until I’ve implemented the resolution and they have reaped the benefits. 

I have two solid managers that allow me a wide berth to allocate many of my own tasks, along with the ones that they designated high priority.  I have quite a wide bandwidth and my ability to multi-task at a dizzying degree allows me to focus on more tasks than most people are comfortable with.  I’m a very low-maintenance employee, so it is rare for them to have to come smack me with something hard for not working on what I should be, (although when I’m telling one of them what I think he needs to hear than what I think he wants to hear, I’m sure he would like to hit me up-side the head just for the fun of it…:))

One of these managers is gracious enough to call me his “pinch hitter in the ninth inning of the world series”.  As for many business’ that world series is now-  we are exceptionally busy, have been building to this time in the year when demands run high and systems have to perform top-notch.  This requires all of my energy, time (and often brain cells) to ensuring I’m in top-notch, gung-ho, fire-fighting mode.  The demand to place my project DBA skills on the back burner during these crucial service times where quick-thinking, ability to assess situations at blinding speed and low impact, high resolution skills are essential.

As rewarding as this is, (and shows you just what you are made of!), I’m looking forward to the opportunities ahead to bring my project skills back to a fore-front.  A quality technical project, well done is something to take great pride in.  Once the season slows a bit, I get to work with new technology, test and implement new projects that will offer our business more opportunities to grow that we identified during the last number of months.  To return to new feature and design projects that are put on hold to ensure that priorities are met for the business at it’s high time, is a nice change of pace and widens the area of interest for any technical resource.

I am about to take on a number of projects involving 11g features with ASM, OEMGC, data guard, Apex, partitioning and parallel execution.  I have another set of project work that involves SQL Server reporting services new features, including work that bridges to our Oracle environment and work more with our MySQL environments. This is when a DBA gets to stretch their technical legs and learn new features and gain knowledge that is put on hold while supporting the current production environment. 

While working on these projects, research will need to be performed to ensure we are making the best decisions every step of the way.  This then allows another aspect of education that we rarely get when we are in fire-fighting mode-  learning in a slower, more controlled environment vs. hurried and very specific knowledge to solve a specific problem that is creating the “fire”.  Each of these decisions will be carefully discussed with the DBA team and documented.  No hurried trouble-shooting document added to handbooks or emails sent with “up to the minute required” information.  These are quality documents that contain all of the data that is needed to support the project work being performed, including the goal, scope and requirements for the project, (and should for any company’s project!), not just the technical details behind the project.

As the adrenaline rush ceases and demands/hours come back more in line with where we all like to be,  there is a sense of satisfaction knowing that the business needs were satisfied and that we were able to deliver what people needed to be successful in their own positions to offer the company more success in turn.

As a DBA, you can’t ask for a whole lot more… :)