The Good, the Bad and The Ugly

Two weeks ago, after a new mart had performed excellent for 24hrs, a sudden and detrimental decline in performance was seen. I’d been trying to track this issue for almost two months, but had only seen it in two or three processes total and never to this level. I had some history from previous DBA’s, but they didn’t know what was causing it and when a mart has a limited lifespan before you are dropping and recreating it from a 6TB warehouse- any DBA knows, this is not going to be easy.

payday loans lenders online

The main goal of the mart is to support a specific set of repeat processes to complete a function of the business. One of two larger processes of this main function was clearly shown as the culprit. Upon an attempt to see if improvement could be gained easily by utilizing an OEM SQL Profile, I received the error:

“The optimizer could not merge the view at line ID 52 of the execution plan.”
In my experience, this error is an indicator that a view, (either created by the developer or Oracle itself as it hashes, sorts, etc. so verify first) is causing issues for Oracle’s Cost Based Optimizer and the DBA involved should note this to the developer’s/IT Management that this view should be known as “volatile” to the environment. This means that it can cause performance issues due to it’s complexity and how it impacts the Oracle CBO.
Another reason the SQL Profile was not a good choice? The CBO recognized the WORST plan as the best choice when comparing the results of what it felt was the “better execution plan”.

How to track down the poor performing plan vs. the correct hashing plan? I’ve found a certain indicator in large warehouse environments- nested loops with rowid range scans as a solid indicator that Oracle has made a very poor choice.

Many different scenarios can cause this to exist, including a combination of Oracle parameters and statistics. I’m a new DBA and due to the number of “experts” that set up these databases and where their specialties lie, the Cost based optimizer and Oracle database parameters seem to be uncharted territories, so a little trust is still being gained in this arena… :)

For the problem at hand though, querying the database will show extensive time remaining, (20-200hrs) for “ROWID RANGE SCANS” per parallel slave session in long operations:

select opname, totalwork, time_remaining/60/60, sid from v$session_longops
where time_remaining>0 order by time_remaining desc;

Querying temp space usage will show large sorts per parallel slave sorting data that will not coorelate to the data set size, (30-90GB), where the data set that is being sorted should only be around 2-4GB:

select vs.sid, vs.osuser, vs. process, vs.sql_id, vtu.segtype, ((vtu.blocks*8)/1024)MB, vtu.tablespace from v$tempseg_usage vtu, v$session vs
where vtu.session_num=vs.serial# order by blocks desc;

Through the six simultaneous processes, I was able to trace three different execution plans for the same statement. The original(the good) from when performance was solid, the second, (the bad) which was sub-optimal, but would still finish with just a little extra temp used and then the third, what I called “the ugly” as these had estimated times of completion between 50-200 hours.
The difference in “The Ugly” plan was that the CBO, based off the hashing through the view, had deemed the statistics stale on all the objects involved, treating them, almost like a cluster table and updating the statistics through dynamic sampling. The statistics change cascaded through the system, affecting each of the jobs kicked off post the change. This was easy for me to see as a DBA, but I was contacted with a simple, panicked complaint of “the database has stopped processing anything!” from the users. This is always something to keep in mind as a DBA or developer- all it takes is ONE table to throw an execution plan off, doesn’t matter how well you collected statistics on the others.

Note: These views are very involved, so the explain plans are quite complex and incredibly long- and obviously I need to work on how best to present plans like these to a blog. Due to this if you are interested, I will be happy to email the full plans to anyone interested.

Statement Impacted:
CREATE TABLE dly_temp1_XXX (IBHID,FO_DAY_ID,LO_DAY_ID,FONO_DAY_ID,LONO_DAY_ID,FOFO_DAY_ID,LOFO_DAY_ID,FORO “tons of case statements and such”
STG1_OUTPUT_VW_NL_R48;
<–VIEW with more case statements to two tables The GOOD!!
the optimal plan for this statement, leading with an index on the largest table and using it to isolate the rows before heading into the table partition, then the global temp tables. No rowid range scans here and temp space usage is never over the dataset being worked with.
The BAD:Performs a full table scan well after it’s created it’s own temp tables as part of the process. This is sub-optimal and will not have the rowid range scans, but will still utilize tons of temp space for sorting and hashing for the wrong reasons.
The Ugly:
The CBO decides the global temp table is the best choice, (which could easily happen if there were no dynamic sampling, so I want to collect statistics dynamically, have the correct amount of temp allocated to sorting and hashing…) but we do not want a bind peek issue due to askew sizes on partitions, etc. on the permanent tables, dynamic sampling, etc. We have what we need, we need to keep Oracle on course.

Short Term Solution to Correct the Problem: During the intial build of these static tables, I collected my statistics VERY CAREFULLY with my own scripts. I know what tables need histograms on indexed columns, what tables and indexes need statistics dropped, what columns need gathered with columns size 1, the exact sample size, etc. I really need Oracle to stop working hard here and undoing all the work we put in during the build process.

DIAL BACK THE STATS to when the execution path was correct with the best statistics for the environment. OEM is very good about showing me exactly WHERE it went wrong, so I was able to simply dial back to before this time, (or query when the LAST_ANALYZED for the tables involved are to see the update to the stats if I did not have the convenience of OEM):

SQL> exec dbms_stats.restore_table_stats (‘DM_SCHMA’,'ITEM_DETL’,’17-DEC-09 04.00.00.000000000 PM -04:00′);
SQL> exec dbms_stats.restore_table_stats (‘DM_SCHMA’,'ORDER_DETL’,’17-DEC-09 04.00.00.000000000 PM -04:00′);

Now lock the statistics on the tables required to stay static, vs. the temp tables involved to deter the plan from changing:
SQL> exec dbms_stats.lock_table_stats(‘DM_SCHMA’,’ITM_DETL’);
SQL> exec dbms_stats.lock_table_stats(‘DM_SCHMA,’ORDR_DETL);
To allow for the most stable execution plans, I ended up locking down 18 tables that were called by the total views utilized in the main processing of the mart.

Long Term Solution: Since this was a case of dynamic sampling “pulling the rug out from under us”, updating statistics on tables that are in a static, read-only state, my choice was to add steps post the build and “proper” update of statistics to lock the stats on all tables involved in the views with a DBMS_STATS.LOCK_TABLE_STATS. I still rely on dynamic sampling for my temp tables and a few of my other tables, so this is a solid compromise that is easily placed into the datamart build scripts when these tables are first created and after the correct statistics script is run.

Recovering from “DBA Stupid Trick, #8006″

…also known as follow up post on “dropping unused columns…”

In my last post I discussed how to drop unused columns on a compressed, partitioned table. The reason I had come across this task, as I had listed in the post, was that a developer had a huge update that would have taken over 24 hrs that we could have done in 70 seconds if we utilized an exchange partition option.

This update on the two partitions needed to be performed BEFORE I could get the maintenance window I needed to drop the unused columns. Being the “too smart for my own britches” DBA that I am, I figured out a work around that I thought I would share and then also share the backlash of being too smart for your own britches… :)

To work around the compressed partition with unused columns, I was able to still utilize the exchange partition option with the following steps:

Instead of a CTAS, I had the developer create the table with the appropriate unused columns in the correct order as they would have been originally, the actual name of the columns does not matter, it’s the original COLUMN #, along with datatype, values and if they are nullable that’s important:

select segcol#,segcollength,name
from sys.col$
where obj# in
(select object_id from dba_objects where owner=”and object_name = ‘TABLE_NAME>’)
order by col#;

You will gather the following information on the Unused columns from your source table:

COLUMN NAME: Original Column # DATATYPE Value Nullable
SYS_C00017_09121318:40:16$ 7 NUMBER 22 Y
SYS_C00028_09121318:40:46$ 8 DATE Y
SYS_C00029_09121318:41:04$ 9 DATE Y

create table OWNER.NEW_TAB4712
( MEMBER_ID NUMBER not null,
CUSTOMER_NBR VARCHAR2(30) not null,
IND_ID NUMBER not null,
ACTIVITY_DT DATE,
FLEX_1 VARCHAR2(512),
FLEX_2 VARCHAR2(512),
COLUMN7 NUMBER, <–New Columns in place with the right values and nullable!
COLUMN8 DATE,
COLUMN9 DATE,
FLEX_3 VARCHAR2(512),
FLEX_4 VARCHAR2(512),
FLEX_5 VARCHAR2(512),
FLEX_6 VARCHAR2(512),
FLEX_7 VARCHAR2(512),
FLEX_8 VARCHAR2(512),
FLEX_9 VARCHAR2(512),
FLEX_10 VARCHAR2(512))
tablespace TBL_DATA pctfree 10 initrans 1;

ALTER TABLE NEW_TAB4712 set unused column COLUMN7;
ALTER TABLE NEW_TAB4712 set unused column COLUMN8;
ALTER TABLE NEW_TAB4712 set unused column COLUMN9;

With the tables moved to unused, these columns will be invisible to your transactions, but available to match your columns for exchange partition.

If interested, you will see that the columns on your new table now match the columns in dba_tab_cols for your source partitioned table, including unused columns, (not counting the unused column names, but remember, that’s not important):

select * from dba_tab_cols
where table_name=”
<–replace with your table name here.. and column_id is null;

Now, insert data to your new table that was in your partition:
Insert into new_tab4712 select * from source_table
where member_id=4712; <–Our partitioning key

The developer now updated the rows he needed in the new_tab4712, preparing it for the exchange. Now commit the work and exchange the partition:

ALTER TABLE source_table EXCHANGE PARTITION P_4712 WITH TABLE new_tab4712 UPDATE INDEXES;

You will not receive the mismatch error on the unused columns and able to successfully exchange a partition on a compressed table with unused columns!

NOW, let’s discuss the backlash… :)

When I wrote out these steps for my developer as a learning exercise, as well as show the benefits of exchange partition for fixing data issues, I didn’t tell him that after he exchanged the partition, (this developer is production support, so he is the one who actually does this work…) we would need to ensure that it was in the APPROPRIATE TABLESPACE!

Now this would not seem to be THAT big of a deal for most shops, but we build new datamart each week, based off a RMAN recovery of this production system, with a list of tablespaces that are excluded, so as not to bring over unnecessary objects.

Yes, you guessed it, the two partitions he performed the earlier steps on and exchanged into the 400GB partitioned table were in a different tablespace than the rest of the table and not brought over in the initial mart build step.

The mart build step that is sourced off this 400GB table errored out due to the error:
ERROR at line 1:
ORA-00376: file 167 cannot be read at this time
ORA-01110: data file 167: ‘/u01/app/oracle/product/10.2.0/db_1/dbs/MISSING00167′

Oh yeah, dumb DBA trick, (Kellyn strikes head against keyboard several times at this point!!) I was bound and determined that the other DBA, who is on call this week, was not going to have to start over on the mart build, already three days in before this was breeched, so I started digging in and figuring out how to correct it, (I think the other DBA has started taking bets at this point on if I’d be able to get myself out of this tangled mess, too!:))

Similar steps to above, so I won’t go into full detail, but these were the steps:

1. Build export/import script from production to the new mart that exports the two missing partitions and would import them back into the schema in the mart.
2. Drop the global index, (and yes, any indexes must be dropped on the table with the missing partitions, painful, painful, hurts, yes!)
3. Drop the two partitions from the table that were on the tablespace/datafiles not recovered in the initial step of the mart build, (with the index(es) gone, you no longer will receive the ORA-0376 error when attempting a drop partition.)
4. Create new, empty partitions with the appropriate values, SPECIFYING the correct tablespace that DOES exist and that they should have been in to begin with.
5. Run the shell script that exported out and imported in the partitions, both parition imports set to ignore=Y so it would simply import the partitions back into the pre-created partitions in step 4.
6. Recreate the global index that was needed as part of the mart build processes.

Problem fixed- the mart build was back on track, losing about 16 hours, 80% of that due to the global index creation, (remind me why I hate global indexes, no matter if we need them at times or not! )

Final step- MOVE THE DARN TWO PARTITIONS IN THE PRODUCTION SYSTEM to where they SHOULD BE, so I never, NEVER have to go through this again! :)

ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION

A developer needed to do some serious data changes to a partition. I thought I’d be a helpful DBA and suggested he do a CTAS of the one partition, make the change and we’d exchange the partition.

He was thrilled when the update statement that I’d killed after 24 hrs of no successful runs was cut down to 70 seconds with the CTAS and then all we needed to do was have me exchange it back in-

ALTER TABLE TAB1 EXCHANGE PARTITION P4706 WITH TABLE TAB_CPY_4706;

ERROR at line 1:
ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION

I did a quick compare of the table columns, datatypes and if the columns allowed nulls- no differences- so what’s up?

select col#, name
from sys.col$
where obj# in
(select object_id from dba_objects where owner=’TAB_OWNER’ and object_name = ‘TAB1′)
order by col#;

COL# NAME
0 SYS_C00029_09042414:39:06$
0 SYS_C00017_09042414:38:51$
0 SYS_C00028_09042414:39:00$
1 Column1
2 Column2 …

The columns #’s with 0 are “Unused Columns”, a feature that allows you to move, (or set) columns to unused and then drop them at a maintenance window if none is available at the time the column needs to be removed from a table. The column is then invisible to standard processing, but will cause issues in partitioned tables if you attempt to exchange a partition.

There is a way to create your CTAS, then add the unused columns to the table with all the appropriate names, datatypes, etc. to the table you wish to exchange with, then attempt to exchange that, (first setting the columns to unused in the table before exchanging, but so that it matches the existing partition format) but I haven’t attempted it, only heard of the theory.

The long-term fix for this, as exchanging partitions would be a nice load option without fancy workarounds, is to drop the unused columns. This is my goal and what I decided needed to be achieved this weekend so we didn’t go through this again:

ALTER TABLE TAB1 DROP UNUSED COLUMNS;

ERROR at line 1:
ORA-39726: unsupported add/drop column operation on compressed tables

Here’s the second challenge for the day- this large, partitioned table, happens to be compressed and you can’t drop unused columns from a compressed table. This feature is only available on tables that have compression disabled on them.

The long term fix then demanded that I uncompress all the partitions, (one partition at a time, updating the indexes to allow miscellaneous processing to continue over the weekend…) and then set the table to nocompress option. Note- This is a table move command, so you must ensure you have enough room for table movement and to uncompress the object and indexes involved in the tablespaces!

ALTER TABLE TAB1 MOVE PARTITION P1000 NOCOMPRESS UPDATE INDEXES;

ALTER TABLE TAB1 NOCOMPRESS;

Once this was completed, then I could drop my unused columns successfully:

ALTER TABLE TAB1 DROP UNUSED COLUMNS;

This is a long operation, as the process must drop the columns from every partition, which can be a considerable effort in a 250GB table, as this one was. After finally dropping those pesky columns, I re-added compression to the table and compressed the appropriate partitions.

No one said DBA work was always interesting, but today it was a battle I could win! :)

The Makings of a Great DBA Team

As I’ve spent two “tours of duty” as a Lead DBA in my career, I often keep in mind what makes a great DBA team, noting both when companies realize what it takes and when they don’t. I’ve watched great DBA’s walk out the door because during the interview process someone decided the candidate didn’t have the EXACT skills of another DBA or they just weren’t sure what they were looking for due to the fact that the DBA role can have such a wide range of skill demands when the manager really needed to look at what skill were ALREADY in place in the company with the existing Database Administrators.

My new manager, like my previous one, deserves a lot of credit, when it comes to building a solid DBA team. He noticed the value of having different skill sets in the wide demanding arena of database administration and supports us well. Many managers do not see this value, hiring on the idea that all DBA’s are the same and that they should be able to do all the same work vs. complimenting. Even though our main support is the same database administration tasks, we each specialize in different areas and are allowed to develop our stronger gifts in the areas that benefit the company as a whole.

I’m currently working with a DBA that I worked with at a previous company for almost 2 years. The DBA, Terry is one of these guys that is solid as a rock, nothing phasing him, (I swear the building could be on fire, which when you announced it to him frantically, he’d simply look at you and say, “Oh…really…OK, well I’ll get my stuff then…” :) ) He’s methodical, old-fashioned DBA that has incredibly strong development skills, but very quiet and keeps to himself.

This is a strong contrast to yours truly, who is all about the database, operations and performance tuning vs. development, all-over the board, multi-tasking 80 different tasks at once and is like Martha Stewart on acid. I’m quite out-spoken and even our new manager joked that in my interview, I said more than Terry had in two months of employment!

Although I can’t speak for Terry, (and yet many times do without even trying..:)) I’ve always felt he was a perfect balance for my working style. We actually compliment each other well and with our different personalities, we are able to appeal to different user’s needs in a DBA.

Technically, I don’t know how often I have frantically solved 90% of a problem, having 90% of the “picture” per say and I know I’m missing something, but just can’t figure it out. Having done all this leg work, it is a simple task to take my picture, that if we were to correlate it to a puzzle, I have stripes and a tail, so I’m thinking I’ve got a Tiger- take all this info to Terry and he having a complete different outlook on the database, is able to pick up where I left off, work in his steady, methodical way and come up with the last 10%.
Surprising how often my tiger ends up being a zebra, BTW… :)
I appreciate this in Terry- that he knows I sometimes think so fast, so furiously, figure out so much, that I’ve simply overlooked a simple, but absolutely essential piece that needed his eyes to locate, (otherwise I will sit there for hours going in circles!) Sometimes I feel stupid afterwards for not figuring out the last 10%, but most have to admit, having me do the amount of work that I can in a short amount of time and then have Terry come in and put a nice bow on and complete it? Not too shabby…:)

This is where we come back to what makes a great DBA team. It’s rarely that the best team have the same skills, but how their skills complement each other and balance out the other’s weaknesses. I believe Terry and I learn from each other consistently, yet don’t have the same interests in database administration and are happy to separate the environments by these interests, yet you know you can undoubtedly count on the other- the level of commitment is the same.

Although Terry and I were the only DBA’s at our previous company, there is a third DBA at our current one. Sanjay’s last day is on Friday and he will be departing for a position that he felt was a better fit for his stage in his career and commute, but I am sad to see him leave. This man has the best attitude and I’ll take attitude anyday, but he has the skills to go with it. His knowledge of physical design and hardware will surely be missed and I wish I would have had more opportunity to work with him. We had excellent conversations regarding the cost based optimizer, database design and other areas that I specialized in and he had a sincere interest in learning more about. As Terry and I already had a comfortable partnership, the one we had with Sanjay was just starting to build, but with respect has to come respect with decision and I do respect his to leave.

What I hope can be gleamed from this post, (or semi-rant/praise, whatever you wish to tag it as…) is that database administration is an incredibly involved and demanding arena. Rarely will you find a DBA that covers the entire venue and/or covers it well, so diversifying your group is essential. Choose DBA’s that have differing specialties with a solid foundation of database administration skills and similar commitments to each other. Diversity is the key to strength in your DBA team!