Oracle

Follow up: When an RMan Clone Goes Bad

A follow up to this previous post.  I thought I had this one down-  had followed the standard protocol when you receive a 600 error.  I looked it up on MOS and had my bug number, documentation and proof of what the bug was, right?

Gotta love being a DBA-  the day you think you’ve figured out, something comes up to surprise you and you learn something new. 

ORA-00600: internal error code, arguments: [kcvhvdf_1], [], [], [], [], [], [], []

Odd version of bug 9314439: RECOVERY FAILS ON CLONED DATABASE ORA-600 [KCVHVDF_1] which is an 11.1.0.7 bug, but I was receiving in a 10.2.0.4.3 database clone.

 
Nope, nice try, Kellyn, (oh-oh, she’s started to refer to herself in the third person, it can’t be good!)   This is where the DBA Gods come down and say, “You ain’t that hot and here’s a reminder…”  🙂
 
This 600 error had an RMAN error that went with it:  RMAN-06136: ORACLE error from auxiliary database: ORA-03113: end-of-file on communication channel
  
When the error showed up the next time, it seemed to confirm the bug.  Then the third time it appeared, but with numerous other errors in the duplicate.  This sent up red flags for me along with the other DBA I’ve worked with for years.  As he started to look into the feasibility of saving the duplicate process, I started searching through the miscellaneous errors. 
  
RMAN-03009: failure of sql command on clone_default channel at …
RMAN-06136: ORACLE error from auxiliary database
ORA-19563: header validation failed for file
 “Media Recovery Start
 
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.”  Ohhh, what fun! 🙂

 
The clone had failed on multiple steps-

  •  Recovery of multiple datafiles.
  • The switch on the datafiles once recovered.
  • Creation of the temp tablespace. 

 As we’d been experiencing some odd sqlnet disconnect errors on another database server, we decided to go back to the original 3113 error.  There weren’t any 3113/3114/3135 errors, but the miscellaneous errors did make sense if the disconnects had occured at the OS level instead of the database session level.
 
Using the times of the failures in the duplicate log, I went to the sqlnet.log for the target database.  For each and every error in the duplicate log, there was a corresponding error in the sqlnet.log for the same time:
Time:
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT= ))

 Ahhh…the true source of the problem reveals itself…  The duplicate was at a state where it wasn’t likely to be successfully recovered, so a restart from scratch would be required.  I issued a drop database, which only resulted in dropping the control files, redo logs and system tablespace’s datafile-  all the rest of the datafiles had to be removed at the OS level. 
I then tweaked the sqlnet.ora and re-ran the duplicate, extending the expire_time to ensure we could stay connected.   Suddenly, no bug and the duplicate completed successfully…
 
Now, the final question is-  All the people that have been experiencing the bug 9314439, are they really experiencing a bug or is the 600 error really a 12170 disconnect error due to sqlnet/network issues?

2 thoughts on “Follow up: When an RMan Clone Goes Bad

Comments are closed.