A Monday morning we found several messages like this from Bacula:
09-Sep 18:27 Tape-A JobId 15196: Please mount Volume "A0000038" or label a new one for: Job: sites.2012-09-08_01.50.03_40 Storage: "LTO-4" (/dev/nst0) Pool: Default Media type: LTO-4
That was Bacula complaining about the wrong volume on the drive. However, after examining the status of storage the weird thing was that the tape with barcode “A0000038″ was actually mounted on the drive, it was purged and had 0 bytes used… WTF?
We started suspecting our tape rotation algorithm was not taking into account that maybe purged tapes were not used if data retention hadn’t expired, but Automatic Volume Recycling says otherwise and some tests on other purged tapes with expired data retention failed too.
Several days of searching similar issues later, it seemed clear that there was something wrong only with some purged tapes from the media catalog. So i came up with a quick test for each one of our purged tapes in order to make sure that they were labeled correctly. First, i loaded “A0000038″ and read its label:
# mtx -f /dev/sg4 load 22 Loading media from Storage Element 22 into drive 0...done # dd if=/dev/nst0 bs=65000 count=1 | od -a | head 0+0 records in 0+0 records out 0 bytes (0 B) copied, 3.81024 s, 0.0 kB/s 0000000
You can see that it failed to find any Bacula label, so next i tried the same test with another tape in full state that had been used the day before:
# mtx -f /dev/sg4 load 16 Loading media from Storage Element 16 into drive 0...done # dd if=/dev/nst0 bs=65000 count=1 | od -a | head 0+1 records in 0+1 records out 64512 bytes (65 kB) copied, 3.73563 s, 17.3 kB/s 0000000 G ; = < nul nul nul D nul nul nul nul B B 0 2 0000020 nul nul stx gs L etx O sub del del del ~ nul nul nul nul 0000040 nul nul nul 240 B a c u l a 1 . 0 i 0000060 m m o r t a l nl nul nul nul nul vt nul eot dc3 0000100 1 8 O si ff nul eot dc3 J + W stx p nul nul nul 0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul A 0 0 0000140 0 0 0 0 6 nul nul D e f a u l t nul B 0000160 a c k u p nul L T O - 4 nul v m - h 0000200 o s t - 4 nul T a p e - A nul V e r 0000220 . 5 . 0 . 1 2 4 F e b r u
(Note that device /dev/nst0 has a minimum block size of 64512 bytes, otherwise dd would raise a Cannot allocate memory error.)
This time the Bacula label was found on the tape (the Bacula 1.0 immortal string is a clue), so we were now on the right track. Then i reviewed our wiki for the tape change procedure and there was an optional step in the tape purging section explaining how to blank a tape from another Bacula director before labeling it for this one. Therefore, it seemed that someone had been blanking the tapes after purging them and Bacula didn’t know what to make with them even though they had the right barcodes.
The fix was quick: i tested every purged tape to make a list of the blanked ones, then deleted them from the catalog and labeled them back from the bconsole:
* delete volume="A0000038" * label barcodes slots=22 pool=Default
Bacula is happy again, and we are too!