Today we had some trouble with a cluster using Business Copy for backing up the data while the application is running. There is a framework for the automatic conduction of the backup process: it tells oracle to be prepared for the backup (no more write queries for the next x minutes), after that pairs the PVOL and SVOL LUNs inside the XP Box, and all the data will be synchronized from the PVOL to the SVOL. After that the pairs will be splitted, and a vgchgid needed in order to change the VGID on the SVOL halves because otherwise the LVM would be screwed. After that, the SVOLs with the new LVM header will be mounted under another mount point, and voila’ – we have an instant, verbatim snapshot of the original volume group, in a consistent state – so go ahead with those tape drives and we can make a decent backup from it, it can take all night long with no impact to production. But if the configuration is wrong (I mean the /etc/horcm*conf files) this ‘thingy’ can be very dangerous…
# ll /etc/horcm*.conf
-r–r–r– 1 root sys 32710 Feb 22 2008 /etc/horcm.conf
-r–r–r– 1 root root 1891 Mar 24 11:43 /etc/horcm0.conf
-r–r–r– 1 root root 1873 Mar 24 11:44 /etc/horcm1.conf
In these horcm[0-9].conf files are the mappings for the PVOL-SVOL pairs – each one of these files describe a level of this Business Copy – e.g. horcm0.conf is the conf file for the first level of PVOLs, and horcm1.conf contains the matching SVOL to these PVOLS. And we can create multiple levels of PVOL-SVOL pairs, and all these can be paired in an automated way – this way we can have multiple snapshots for all the weekdays and so on… At our storage provider there is a three-tier classification of LUNs: A C-Class is the most simple, it is a single LUN straight from one of the storage boxes. A B-Class is a normal pair of disks for LVM/Veritas mirroring, and an A-Class has three separate disks: one SMPL for LVM/Veritas-based mirroring, and one PVOL-SVOL pair from a separate box for this Business Copy.
Box No. 1. Box No. 2. SMPL <-----------------------> PVOL I SVOL
Between the SMPL and the PVOL there is an LVM/Veritas Based mirroring, and inside Box No. 2. there is this Business Copy-based mirroring, it will be mirrored only if the PVOL and SVOL halves are in a so-called pair state. This way more Tbytes can be synchronized right inside the box and it takes only some minutes.
In our case it wasn’t the matter of wrong configuration, but the PVOL halves were mapped to a non-existent (i.e. decomissioned) machine, and the backup process was initiated somehow in an unexplainable way and all our C-Class LUNs (normal performance, non-mirrored) were overwritten with the contents of some PVOLs from inside the box. This is not the way it works, and screwed all the data of one of our highly poductive SAP database. Ouch. And as a side-effect it created from all of our SMPL-based simple C-Class LUNs an SVOL instance. As the trouble was known to us, I’ve opened a high priority ticket to our storage team to check the XP box/FC net/cabling. They throwed it back with an usual “it’s all right”. As it used to be, the problem wasn’t cleared right away, so in order to have more time I have opened a case at HP support in order to get completely sure that the HW side of our server is ok. At that time I had the feeling that the problem is not with our HW/SW but at the storage side – and the HP support just strenghtened this feeling in me. I was already filed a SW case too for this when I noticed the slight difference in the xpinfo output – the C-Class disks were displayed as SVOL – though they should be SMPL.
The solution was: I told to the storage guys that they should set the LUN to the SMPL state and after that we could do a pvchange for all the affected LUNs, and do a vgcfgrestore. The data was lost, so there wasn’t any other way than to revert back to tape backups and do a restore – after we’ve created a new filesystem with newfs. Here is a snippet from the syslog after we’ve pvchange’d the LUNs back to the VG:
Jul 24 20:54:19 hppmp02 LVM: pvchange -a y /dev/dsk/c89t2d5
Jul 24 20:54:19 hppmp02 vmunix: LVM: VG 64 0×110000: Data in one or more logical volumes on PV 31 0×592500 was lost when the disk was replaced.
Jul 24 20:54:19 hppmp02 vmunix: This occured because the disk contained the only copy of the data.
Jul 24 20:54:19 hppmp02 vmunix: Prior to using these logical volumes, restore the data from backup.
Jul 24 20:54:19 hppmp02 vmunix: LVM: VG 64 0×110000: PVLink 31 0×592500 Recovered.