When RMAN Cloning Goes Bad…

One of the options for cloning an environment with limited disk space  involves creating a parameters file for the RMAN clone memory script to run that contains the “skip tablespace” feature. 
This allows the DBA to retain a working list of tablespaces that contain data that isn’t required in development/test or for other cloning purposes when performing the actual clone.

payday loans lenders online

startup auxiliary nomount;

run {
allocate auxiliary channel c1 device type disk;
…so on and so forth for all the channels desired…
duplicate target database to
SKIP TABLESPACE USERS,TBL_DATA1,TBL_INDX1,….;
}

This can also be extremely helpful in environments where disk allocation towards a clone is limited and restoring all of production isn’t an option.
 
If this choice is made, you will see the following messages in the clone log signifying that the tablespaces have been “skipped” by the clone:
 
“Datafile 127 skipped by request

Datafile 201 skipped by request
Datafile 249 skipped by request
Datafile 250 skipped by request”

and in the memory script section of the clone you will see a copy of your skip tablespace command listing out the tablespaces you requested to skip. 

Problem with this feature is that it can also create a lot of “white noise, benign” errors until you can clean up post the clone if you have a clone failure.

In a recent clone, I was the unlucky recipient of a connectivity problem due to a change in sqlnet parameters that impacted the clone at a very inopportune time:

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 09/27/2010 19:30:06
RMAN-03015: error occurred in stored script Memory Script
RMAN-06136: ORACLE error from auxiliary database: ORA-03113: end-of-file on communication channel

As the clone was finishing up with the last of the recovery and creating the temp files, I was just a few steps short of the “open database resetlogs” completion of the clone and knew the best course was to pull the SVN that the clone was working with from the log and restart the clone recovery:

run

{
set until scn 383434989336;
recover clone database;
alter clone database open resetlogs;
};

I received an unexpected and unwelcome error upon running the new recovery step:
alter database recover if needed

start until cancel using backup controlfile
Media Recovery Start
Mon Sep 27 19:39:21 2010
Errors in file /xxxxx/udump/_ora_20448.trc:
ORA-00600: internal error code, arguments: [kcvhvdf_1], [], [], [], [], [], [], []

Ouch! This appears to be an odd version of bug 9314439: RECOVERY FAILS ON CLONED DATABASE ORA-600 [KCVHVDF_1] which is an 11.1.0.7 bug, but I was receiving in a 10.2.0.4.3 database clone.  The issue was, I didn’t know it at the time-  the bug above describes how to check for datafile inconsistent dbid in the headers and as you’ll see, that’s not exactly where my problem ended up.. :)

Next was checksum errors:
ORA-00367: checksum error in log file header

ORA-00316: log 1 of thread 1, type 0 in header is not log file
ORA-00312: online log 1 thread 1: /xxxxx/redo01a.rdo

I ran the checks on each of the files and everything came back satisfactory on the checksum.
Next, armed with the information from bug 9314439, I check the mismatch in datafile to dbid for v$database:
select distinct(FHDBI) from x$kcvfh;
and
select DBID from v$database;
They matched…(of course they did, it wasn’t  a”datafile” that was causing me the issue…)

In a clone that utilizes the “skip tablespace” feature that also consists of a complicated integrity constraint design or materialized views, the final step of the clone, to drop and clean up the “missing” datafiles, can often fail.  This will result in the following messages in the alert log for the cloned database:

File #264 found in data dictionary but not in controlfile.

Creating OFFLINE file ‘MISSING00264′ in the controlfile.
This file can no longer be recovered so it must be dropped.
File #285 found in data dictionary but not in controlfile.
Creating OFFLINE file ‘MISSING00285′ in the controlfile.
This file can no longer be recovered so it must be dropped.

Through all of this, I had an mview refresh job to drop so as to stop some of the “white noise” errors that were populating the log and making the real errors even more difficult to locate.

ORA-12012: error on auto execute of job 141

ORA-00376: file 185 cannot be read at this time
ORA-01111: name for data file 185 is unknown – rename to correct file
ORA-01110: data file 185: ‘/xxxx/dbs/MISSING00185′

As the clone finally completed its failed attempts to clean up the complicated missing datafiles, it exited, opened the database and then the real issue began to finally show itself:

Errors in file /xxxxx/bdump/_dbw0_19081.trc:

ORA-01186: file 1003 failed verification tests
ORA-01157: cannot identify/lock data file 1003 – see DBWR trace file
ORA-01110: data file 1003: ‘/target_not_duplicate_dir/temp_1_03.dbf’
File 1003 not verified due to error ORA-01157

Uhmmmm….I don’t have directory paths for the target database on the clone destination server…what is up with that?? 
 
Since it’s tempfiles, I did a quick look at v$tempfile and it showed that all the tempfiles that were mapped and created as part of the duplicate and converted in the dbfile_convert parameter, show the original, TARGET database’s directory path!  No datafiles had this problem, only the tempfiles form the clone.
 
I have a helpful link that I keep around for looking at the X$ table definitions:
http://www.orafaq.com/wiki/X$_Table
 
Due to this, I was then able to change my query to:
Select distinct(FHDBI) from X$KCVFHMRR;
 
This resulted in TWO distinct DBID’s, showing a very similar bug to 9314439, but only effecting the tempfiles for the failed recovery in 10.2.0.4.3 vs. what they have seen for this bug in 11g.  The data dictionary had mismatched the DBID in the headers for the tempfiles, not the datafiles.
 
The clone consisted of a secondary temp tablespace that was created post the clone, so the resolution was to switch to the secondary temp tablespaces, drop the corrupt temp tablespace “with contents and datafiles”, (it refused to let go otherwise!) and to recreate the original temp tablepace.
ALTER DATABASE DEFAULT TEMORARY TABLESPACE OTHR_TEMP;

As this is a temp tablespace group, I am required to drop the tablespaces that comprise the group:
DROP TABLESPACE TEMP_4 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_3 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_2 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_1 INCLUDING CONTENTS AND DATAFILES;

CREATE TEMPORARY TABLESPACE TEMP_1 TEMPFILE ‘/xxxxx/temp_1_01.dbf’ size 51200M,
‘/xxxxx/temp_1_02.dbf’ size 51200M
Tablespace group temp;

CREATE TEMPORARY TABLESPACE TEMP_2 TEMPFILE ‘…;
ALTER TABLESPACE TEMP_2 TABLESPACE GROUP TEMP;
So on and so forth… recreating each of the temporary tablespaces and adding them to the tablespace group.
I then needed to redirect to use the larger temp tablespace group I’d just recreated.
ALTER DATABASE DEFAULT TEMPORARY TABLESPACE TEMP;

I then ran the post clean up script and proceeded with the final steps of the clone successfully.  The challenge here it to remember, that as a DBA, you have a lot of features that may save you space and time in a clone, but can also make it really difficult to track down the real problem or at least delay you from finding it when it comes to the chaos it can fill your alert log with!

Building Right or Build Twice- That is the Question…

This question seems to pop into my mind consistently over the years as a DBA.  I’m a “build it right or don’t build it at all” kind of DBA, but due to my gift for finding problems and fixing them, I find myself more and more often performing the second build on processes/procedures/designs, which I often would like to avoid.  I went through this repeatedly at a previous shop and it’s still fresh in my mind, even today…

Don’t get me wrong-  I think it’s a noble cause when you first come into a new shop and it’s either been neglected or didn’t have the DBA support it may have needed and the update in TLC is always appreciated, but if it’s something I’m revisiting because of a lack of requirements or a production change that wasn’t thoroughly benchmark tested, I find me getting a little cranky-  (usually with myself since I expect that I should be able to read minds at this point! :) )

I believe to utilize your DBA and development team efficiently, not wasting money on resources, building “right” is essential.  Designing the initial database, as well as future development should be reviewed repeatedly to verify you can answer the following questions with a resounding “YES!”:

  • Is the main schema design scalable and easily adaptable to our business?
  • Does the daily/weekly/monthly processing require no manual intervention by support staff?
  • Have we performed capacity planning and does the hardware meet the foreseen demand and growth of the environment?
  • Has the schema owners been secured and application security been set to the DBA satisfaction?
  • Are all database files secured on the server from access to anyone outside the database group?
  • Are naming conventions documented and followed in the database environment?
  • Have code reviews been adopted, new code/processing plan meeting requirements been adopted?
  • Have different groups been assigned responsibility for the aspects they must be accountable for that will impact the database environment, (i.e. application support on call for apps, network admins for network issues, etc.)

The companies I’ve experienced DBA’s overstressed and over-worked in the past are commonly places who can not answer yes to the questions above.  Without these requirements in place, the databases become kin to a house with a foundation made of Popsicle sticks, (often built on a side of a mountain prepped for a future mudslide, BTW…:)) 

If a DBA is not there for the initial build of a database, as often is the case, and is just the “unlucky home buyer” of the database, then there is a unique opportunity in front of you to attempt to correct and change the culture that has built this rickety structure. Changing a culture is a difficult challenge-  I won’t try to pull the wool over any one’s eyes.   It is attainable and if you have management support, you will already have a huge head start! 

If you do have management’s support, what steps should you take to stabilize an environment that is expecting it’s DBA’s to be infallible instead of it’s databases?

  • Communication-  Demand daily meetings.  10-15 minutes to discuss what is happening and who is doing what.  Often folks live in their own little world and are unaware of how many folks are impacting the database, thinking there’s in the “one little process” requiring manual intervention.
  • Automation-  What can be automated?  Anything and everything that requires manual work or intervention by a DBA or developer in production should be stopped as soon as possible.  Make it a priority to automate it with procedures, scripts, crons, whatever it takes, make it hands off!
  • “White Noise”-  This is what I call all the informational emails or “fluff” in important emails that can mask or make someone miss what we really need to know about, (like failures!)  Remove emails that are just to inform people processes/jobs “have completed”.  If they are concerned about failures, create scripts on secondary servers to monitor for the main ones, but all email should be “just the facts” and pages should be narrowed down to production issues that require a DBA to address.
  • Worst Case Scenario Development-  AKA, developers and DBA’s should develop all processes with the lightest footprint on the database.  Develop the code with the idea it is going to always be running against the heaviest load on it, the most limiting resources and the least amount of time requirements.  This is the code that will last-  this is the code that will not demand to be revisited by the DBA or developer in the near future, (we’re hoping we can have code that won’t have to be revisited at all, remember? :) )
  • Database=Two year old Toddler-  What does this mean?  For me, this means I am either the  DBA Mommy or often feel like a daycare provider.  Poorly developed databases are like toddlers and they will throw tantrums-  A LOT!  I need to be vigil, I need to have the resource time in my schedule to spend time with them, monitor their behavior and know when something is wrong with them.  Now, if you build them right, they will mature quite nicely and require less time, but please remember, the teenage years are often still ahead of us.
  • 500,000 Miles without an Oil Change-  This is how some companies run their databases.  Down the road, 80 mph, and then ask their DBA’s, “Can you change the oil while we continue down the road?  Oh, and while you’re under the hood, could you replace the transmission with a new one, too?”  Designate maintenance windows-  they should be a priority, not an after-thought.  Understand all that is required daily in automated maintenance to ensure the system continues to run, too-  statistics collection, backups and cleanup.  These are required for the health of the database, not just for the DBA’s sanity.
  • Make a List, Check it Twice-  The last task is the hardest.  Keep an up to date list of what makes up that “foundation of popsicle sticks”.  What are the design/process flaws in the environment that need to be corrected to build a sound, solid database and design small projects to correct them.  This will pay-forward in so many ways.  The goal should be to address objects/processes in the system that are not scalable with the database or require intervention due to design flaws.  Selling these projects to the business should include dollars saved in support/hardware or revenue that can be generated post the fix.

The subject of building/designing robust database environments is one I take very seriously and I may take this subject a bit too much to heart at times.   I was raised with a tough, dedicated work ethic- something I received from both my parents-  rural Canadians are a pretty tough bunch…   Taking on a challenge is right up my alley and I’m just not one to give up.  Can I accomplish it?  Eh yah, you betcha! :P