[Netarchivesuite-curator] BnF NAS update for December

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Dec 4 11:48:55 CET 2018

Hello all,

Our annual broad crawl is still underway. It began on 8th October and must 
finish before the end of December. We have already crawled more than 68.52 
TB. The crawl is taking more time than last year because of several 
technical problems.

One problem came from an unresponsive Heritrix process: the Heritix 
process was no longer reachable even by the HarvestController. The 
communication port was blocked and new instances of Heritrix were created 
but stopped instantly. Bert, from the IT team, has activated a new 
monitoring script that will kill the HarvestController in case of a hung 
Heritrix process.

The second source of problems came from the infrastructure: the hardware 
resources of the disks were saturated causing too high latency rates for 
both read and write operations, which meant we had to reduce the number of 
threads which slows down the crawl overall. 
Other infrastructure problems: 
- A CPU on a physical machine failed. The physical machine was removed 
from the park and all the virtual machines (VM) were moved to another 
physical machine but the network connexion was lost for four VMs during 
the moving. Consequently several jobs failed. Moreover we had to replace a 
virtual hard disk on a VM, and due to a failure to copy the deduplication 
index several jobs were launched and failed.
- Due to an oversight the maximum number of files which can be opened by a 
process in the same time was not changed increased for the broad crawl and 
several jobs were launched and failed. 

But the most important problem comes from the broker. Several times during 
the crawl, the broker crashed leading to the failure of all active jobs of 
the broad crawl and even of those of the selective crawls. So far it has 
been impossible to find the reason for the crashes.The saturated hard 
disks with their latencies may be responsible for it. Bert will 
investigate further.

Best regards,
The BnF digital legal deposit team
Exposition  Les Nadar, une légende photographique  – jusqu'au 3 février 2019 | François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20181204/470b9d2c/attachment.html>

More information about the Netarchivesuite-curator mailing list