When RMAN Cloning Goes Bad…
One of the options for cloning an environment with limited disk space involves creating a parameters file for the RMAN clone memory script to run that contains the “skip tablespace” feature.
This allows the DBA to retain a working list of tablespaces that contain data that isn’t required in development/test or for other cloning purposes when performing the actual clone.
startup auxiliary nomount;
allocate auxiliary channel c1 device type disk;
…so on and so forth for all the channels desired…
duplicate target database to
SKIP TABLESPACE USERS,TBL_DATA1,TBL_INDX1,….;
This can also be extremely helpful in environments where disk allocation towards a clone is limited and restoring all of production isn’t an option.
If this choice is made, you will see the following messages in the clone log signifying that the tablespaces have been “skipped” by the clone:
“Datafile 127 skipped by request
Datafile 201 skipped by request
Datafile 249 skipped by request
Datafile 250 skipped by request”
and in the memory script section of the clone you will see a copy of your skip tablespace command listing out the tablespaces you requested to skip.
Problem with this feature is that it can also create a lot of “white noise, benign” errors until you can clean up post the clone if you have a clone failure.
In a recent clone, I was the unlucky recipient of a connectivity problem due to a change in sqlnet parameters that impacted the clone at a very inopportune time:
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-03002: failure of Duplicate Db command at 09/27/2010 19:30:06
RMAN-03015: error occurred in stored script Memory Script
RMAN-06136: ORACLE error from auxiliary database: ORA-03113: end-of-file on communication channel
As the clone was finishing up with the last of the recovery and creating the temp files, I was just a few steps short of the “open database resetlogs” completion of the clone and knew the best course was to pull the SVN that the clone was working with from the log and restart the clone recovery:
set until scn 383434989336;
recover clone database;
alter clone database open resetlogs;
I received an unexpected and unwelcome error upon running the new recovery step:
alter database recover if needed
start until cancel using backup controlfile
Media Recovery Start
Mon Sep 27 19:39:21 2010
Errors in file /xxxxx/udump/_ora_20448.trc:
ORA-00600: internal error code, arguments: [kcvhvdf_1], , , , , , , 
Ouch! This appears to be an odd version of bug 9314439: RECOVERY FAILS ON CLONED DATABASE ORA-600 [KCVHVDF_1] which is an 220.127.116.11 bug, but I was receiving in a 10.2.0.4.3 database clone. The issue was, I didn’t know it at the time- the bug above describes how to check for datafile inconsistent dbid in the headers and as you’ll see, that’s not exactly where my problem ended up..
Next was checksum errors:
ORA-00367: checksum error in log file header
ORA-00316: log 1 of thread 1, type 0 in header is not log file
ORA-00312: online log 1 thread 1: /xxxxx/redo01a.rdo
I ran the checks on each of the files and everything came back satisfactory on the checksum.
Next, armed with the information from bug 9314439, I check the mismatch in datafile to dbid for v$database:
select distinct(FHDBI) from x$kcvfh;
select DBID from v$database;
They matched…(of course they did, it wasn’t a”datafile” that was causing me the issue…)
In a clone that utilizes the “skip tablespace” feature that also consists of a complicated integrity constraint design or materialized views, the final step of the clone, to drop and clean up the “missing” datafiles, can often fail. This will result in the following messages in the alert log for the cloned database:
File #264 found in data dictionary but not in controlfile.
Creating OFFLINE file ‘MISSING00264′ in the controlfile.
This file can no longer be recovered so it must be dropped.
File #285 found in data dictionary but not in controlfile.
Creating OFFLINE file ‘MISSING00285′ in the controlfile.
This file can no longer be recovered so it must be dropped.
Through all of this, I had an mview refresh job to drop so as to stop some of the “white noise” errors that were populating the log and making the real errors even more difficult to locate.
ORA-12012: error on auto execute of job 141
ORA-00376: file 185 cannot be read at this time
ORA-01111: name for data file 185 is unknown – rename to correct file
ORA-01110: data file 185: ‘/xxxx/dbs/MISSING00185′
As the clone finally completed its failed attempts to clean up the complicated missing datafiles, it exited, opened the database and then the real issue began to finally show itself:
Errors in file /xxxxx/bdump/_dbw0_19081.trc:
ORA-01186: file 1003 failed verification tests
ORA-01157: cannot identify/lock data file 1003 – see DBWR trace file
ORA-01110: data file 1003: ‘/target_not_duplicate_dir/temp_1_03.dbf’
File 1003 not verified due to error ORA-01157
Uhmmmm….I don’t have directory paths for the target database on the clone destination server…what is up with that??
Since it’s tempfiles, I did a quick look at v$tempfile and it showed that all the tempfiles that were mapped and created as part of the duplicate and converted in the dbfile_convert parameter, show the original, TARGET database’s directory path! No datafiles had this problem, only the tempfiles form the clone.
I have a helpful link that I keep around for looking at the X$ table definitions:
Due to this, I was then able to change my query to:
Select distinct(FHDBI) from X$KCVFHMRR;
This resulted in TWO distinct DBID’s, showing a very similar bug to 9314439, but only effecting the tempfiles for the failed recovery in 10.2.0.4.3 vs. what they have seen for this bug in 11g. The data dictionary had mismatched the DBID in the headers for the tempfiles, not the datafiles.
The clone consisted of a secondary temp tablespace that was created post the clone, so the resolution was to switch to the secondary temp tablespaces, drop the corrupt temp tablespace “with contents and datafiles”, (it refused to let go otherwise!) and to recreate the original temp tablepace.
ALTER DATABASE DEFAULT TEMORARY TABLESPACE OTHR_TEMP;
As this is a temp tablespace group, I am required to drop the tablespaces that comprise the group:
DROP TABLESPACE TEMP_4 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_3 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_2 INCLUDING CONTENTS AND DATAFILES;
DROP TABLESPACE TEMP_1 INCLUDING CONTENTS AND DATAFILES;
CREATE TEMPORARY TABLESPACE TEMP_1 TEMPFILE ‘/xxxxx/temp_1_01.dbf’ size 51200M,
‘/xxxxx/temp_1_02.dbf’ size 51200M
Tablespace group temp;
CREATE TEMPORARY TABLESPACE TEMP_2 TEMPFILE ‘…;
ALTER TABLESPACE TEMP_2 TABLESPACE GROUP TEMP;
So on and so forth… recreating each of the temporary tablespaces and adding them to the tablespace group.
I then needed to redirect to use the larger temp tablespace group I’d just recreated.
ALTER DATABASE DEFAULT TEMPORARY TABLESPACE TEMP;
I then ran the post clean up script and proceeded with the final steps of the clone successfully. The challenge here it to remember, that as a DBA, you have a lot of features that may save you space and time in a clone, but can also make it really difficult to track down the real problem or at least delay you from finding it when it comes to the chaos it can fill your alert log with!