Arxiu d'etiquetes: bacula

Please mount Volume or label a new one?

A Monday morning we found several messages like this from Bacula:

09-Sep 18:27 Tape-A JobId 15196: Please mount Volume "A0000038" or label a new one for:
    Job:          sites.2012-09-08_01.50.03_40
    Storage:      "LTO-4" (/dev/nst0)
    Pool:         Default
    Media type:   LTO-4

That was Bacula complaining about the wrong volume on the drive. However, after examining the status of storage the weird thing was that the tape with barcode “A0000038” was actually mounted on the drive, it was purged and had 0 bytes used… WTF?

We started suspecting our tape rotation algorithm was not taking into account that maybe purged tapes were not used if data retention hadn’t expired, but Automatic Volume Recycling says otherwise and some tests on other purged tapes with expired data retention failed too.

Several days of searching similar issues later, it seemed clear that there was something wrong only with some purged tapes from the media catalog. So i came up with a quick test for each one of our purged tapes in order to make sure that they were labeled correctly. First, i loaded “A0000038” and read its label:

# mtx -f /dev/sg4 load 22
Loading media from Storage Element 22 into drive 0...done
# dd if=/dev/nst0 bs=65000 count=1 | od -a | head
0+0 records in
0+0 records out
0 bytes (0 B) copied, 3.81024 s, 0.0 kB/s
0000000

You can see that it failed to find any Bacula label, so next i tried the same test with another tape in full state that had been used the day before:

# mtx -f /dev/sg4 load 16
Loading media from Storage Element 16 into drive 0...done
# dd if=/dev/nst0 bs=65000 count=1 | od -a | head
0+1 records in
0+1 records out
64512 bytes (65 kB) copied, 3.73563 s, 17.3 kB/s
0000000   G   ;   =   < nul nul nul   D nul nul nul nul   B   B   0   2
0000020 nul nul stx  gs   L etx   O sub del del del   ~ nul nul nul nul
0000040 nul nul nul 240   B   a   c   u   l   a       1   .   0       i
0000060   m   m   o   r   t   a   l  nl nul nul nul nul  vt nul eot dc3
0000100   1   8   O  si  ff nul eot dc3   J   +   W stx   p nul nul nul
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul   A   0   0
0000140   0   0   0   0   6 nul nul   D   e   f   a   u   l   t nul   B
0000160   a   c   k   u   p nul   L   T   O   -   4 nul   v   m   -   h
0000200   o   s   t   -   4 nul   T   a   p   e   -   A nul   V   e   r
0000220   .       5   .   0   .   1       2   4       F   e   b   r   u

(Note that device /dev/nst0 has a minimum block size of 64512 bytes, otherwise dd would raise a Cannot allocate memory error.)

This time the Bacula label was found on the tape (the Bacula 1.0 immortal string is a clue), so we were now on the right track. Then i reviewed our wiki for the tape change procedure and there was an optional step in the tape purging section explaining how to blank a tape from another Bacula director before labeling it for this one. Therefore, it seemed that someone had been blanking the tapes after purging them and Bacula didn’t know what to make with them even though they had the right barcodes.

The fix was quick: i tested every purged tape to make a list of the blanked ones, then deleted them from the catalog and labeled them back from the bconsole:

* delete volume="A0000038"
* label barcodes slots=22 pool=Default

Bacula is happy again, and we are too! 🙂

Reduir els correus de Bacula

El Bacula és un sistema de gestió de backups professional en programari lliure (també es pot contractar suport empresarial, si cal). En la configuració predeterminada és costum enviar un correu per cada treball que indiqui si ha finalitzat correctament o no. Però en un entorn amb una pila de servidors i diversos treballs per servidor, això implica rebre diàriament molts correus que habitualment indiquen que tot ha anat bé. En el meu cas, són entre 60-70 correus diaris.

Afortunadament, si hom disposa d’un sistema de monitoratge compatible amb Nagios, pot utilitzar un connector que examina els logs del bacula director per veure si s’ha produït algun error. Per tant, ja no cal seguir rebent aquest allau diari de correus que indiquen que els treballs han acabat bé. Per fer-ho només cal que canvieu la configuració dels Messages anomenats Standard al fitxer /etc/bacula/bacula-dir.conf i on posava «mail» hi poseu «mail on error»:

Messages {
  Name = Standard
  mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: %t %e of %c %l\" %r"
  operatorcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" %r"
  mail on error = bacula@example.com = all, !skipped
  operator = bacula@example.com = mount
  console = all, !skipped, !saved
  append = "/var/lib/bacula/log" = all, !skipped
  catalog = all
}

Amb aquest canvi, tots els treballs que utilitzin els missatges estàndard passaran a enviar correus només en cas que es produeixi algun error. Però potser us interessa que els treballs de recuperació sí que notifiquin si han acabat bé i així podeu evitar d’estar pendents dels logs, oi? Doncs és ben fàcil també: només cal crear un nou tipus de missatge pels treballs de recuperació que tingui «mail» enlloc del «mail on error» i indicar-ho a la secció corresponent:

Job {
  Name = "RestoreFiles"
  Type = Restore
  Client = bacula-fd
  Storage = Tape
  FileSet = "Full Set"
  Pool = Default
  Messages = Restore
  Where = /tmp/bacula-restores
}

Messages {
  Name = Restore
  mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: %t %e of %c %l\" %r"
  operatorcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" %r"
  mail = bacula@example.com = all, !skipped
  operator = bacula@example.com = mount
  console = all, !skipped, !saved
  append = "/var/lib/bacula/log" = all, !skipped
  catalog = all
}

Es tracta d’un petit canvi que pot augmentar significativament la productivitat d’un equip de sysadmins amb una pila de servidors dels quals es fan còpies de seguretat cada dia.