<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">Our annual broad crawl is still underway.
It began on 8th October and must finish before the end of December. We
have already crawled more than 68.52 TB. The crawl is taking more time
than last year because of several technical problems.</font><br><br><font size=2 face="sans-serif">One problem came from an unresponsive
Heritrix process: the Heritix process was no longer reachable even by the
HarvestController. The communication port was blocked and new instances
of Heritrix were created but stopped instantly. Bert, from the IT team,
has activated a new monitoring script that will kill the HarvestController
in case of a hung Heritrix process.</font><br><br><font size=2 face="sans-serif">The second source of problems came from
the infrastructure: the hardware resources of the disks were saturated
causing too high latency rates for both read and write operations, which
meant we had to reduce the number of threads which slows down the crawl
overall. </font><br><font size=2 face="sans-serif">Other infrastructure problems: </font><br><font size=2 face="sans-serif">- A CPU on a physical machine failed.
The physical machine was removed from the park and all the virtual machines
(VM) were moved to another physical machine but the network connexion was
lost for four VMs during the moving. Consequently several jobs failed.
Moreover we had to replace a virtual hard disk on a VM, and due to a failure
to copy the deduplication index several jobs were launched and failed.</font><br><font size=2 face="sans-serif">- Due to an oversight the maximum number
of files which can be opened by a process in the same time was not changed
increased for the broad crawl and several jobs were launched and failed.
</font><br><br><font size=2 face="sans-serif">But the most important problem comes
from the broker. Several times during the crawl, the broker crashed leading
to the failure of all active jobs of the broad crawl and even of those
of the selective crawls. So far it has been impossible to find the reason
for the crashes.The saturated hard disks with their latencies may be responsible
for it. Bert will investigate further.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><font face="sans-serif"><hr />
<p>Exposition <strong><em><a href="http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.nadar_legende_photographique.html">Les Nadar, une légende photographique</a></em></strong> – jusqu'au 3 février 2019 | François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>