Saving you 15 minutes is a blog about all the little things you hit on a day to day basis which could probably save another DBA about 15 minutes, if they knew about it…
Databases upgrades on large production environments using clustering technology not replicated in pre-production environments are pieces of work which are only enjoyable once you have finished. The day before an upgrade I was copying some RMAN pieces to a test server to then do a quick test upgrade. I had done this many times before as part of the upgrade plan and I found myself with a bit of spare time so I thought I would give it another go, just in case.
Process is simple with the layout identical to production, restore controlfiles, mount the database restore the database, recover, simple. However when I came to the restore I kept getting this error out of RMAN: ORA-01861: literal does not match format string.
I spent a bit of time trying to work out what was going wrong, in fact probably too much time, as what I was trying to do was so simple, in a way it makes it more confusing. Anyway I started to look around on MOS and found this, RMAN Recovery Session Fails with ORA-1861.
It seems that my controlfile has an invalid date, in fact by dumping out the controlfile to a text file (SQL> alter session set events ‘immediate trace name controlf level 20′;) and then using a Unix split command (the output was too big to look at in one chunk) I found the issue. It seems that the latest controlfile backup had a completion time of 31st February, that well known made up date.
How this happened I don’t know and I am currently trying to work it out. Here is the entry from the dump.
RECID #477319 Recno 5319 Record timestamp 02/25/14 16:29:49 piece #1 copy #1 pool 0 Backup set key: stamp=831602190, count=467255 V$RMAN_STATUS: recid=18446741324928359424, stamp=18446741324928359416 Flags: <concurrent access> Device: DISK Handle: /u01…. Media-Handle: Comment: Tag: TAG20131116T003630 Completion time 02/31/14 02:17:24
What was slightly more interesting is when I came back to replicate this issue I couldn’t, it was using a controlfile backup from the following day. I dumped out the controlfile and then try to locate the same record.
RECID #467208 Recno 8008 Record timestamp 11/16/13 00:36:34 piece #1 copy #1 pool 0 Backup set key: stamp=831602190, count=467255 V$RMAN_STATUS: recid=18446741324928359552, stamp=18446741324928359544 Flags: <deleted> <concurrent access> Device: DISK Handle: /u01 Media-Handle: Comment: Tag: TAG20131116T003630 Completion time 11/16/13 00:36:35
Of course I didn’t hit the same issue when I came to do the restore so you could argue what’s the problem. For me it is an issue of consistency, I don’t like intermittent issues which I can’t control. So in fact every day since I have hit this I have been doing test restores but I can’t replicate the issue. As we have more confidence in the process now it seems academic to try and work out what caused the original rogue record. I guess which each full restore and re-cataloguing of the recovery area all the backup pieces will have different IDs etc. Further dumping of the controlfiles has shown Completion Times in the future which doesn’t seem right but as these are valid dates it won’t through up the same literal error. Anyway if you have ever come across this before and you can shed some light on the Oracle internals of controlfiles then maybe you can save me 15 minutes.