[Netarchivesuite-curator] BnF NAS update for December

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Dec 5 10:26:23 CET 2017


Hello all,

Our first broad crawl with NAS5 and H3 is finished! We crawled 101.55 TB 
in 6 weeks. We encountered 4 problems during this crawl:

- a storage saturation problem with our new infrastructure (we lost 16 
jobs of the broad crawl and a few jobs from selective crawls)
- an out of memory problem on the GUI and the broker (with no data loss)
- the use of public_suffixes.dat introduced in NAS5 made H3 create a lot 
of queues by host for the domain blogspot.com instead of a single queue by 
domain
- some second level TLDs were also created as domains and broaden the 
crawl scopes

We received only 5 complaints from web publishers compared to around 15 in 
2016. During the coming weeks, we are going to analyse the crawl reports 
and the quality of the archives to produce a report on the crawl.

In parallel, we had scheduling issues: our daily news crawls stopped three 
times. Two jobs were submitted with the same ID and this changed the 
status of the selective harvest from active to inactive.

Best regards,
The BnF digital legal deposit team


Exposition  Paysages français – Une aventure photographique (1984 - 2017)  - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20171205/849e468b/attachment.html>


More information about the Netarchivesuite-curator mailing list