[Netarchivesuite-curator] Netarchive NAS update for March

Sabine Schostag sas at kb.dk
Tue Apr 4 07:51:29 CEST 2017


Dear all,

hereby a brief update from KB, Denmark



On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed their state and there was lot of manual "put out fires" work. The crawl finished one on March 26.

 With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner's servers. For a couple of weeks ago we restarted with 6 daily front page crawls for the national news sites - so far without complaints from the site owners.

 We have NSF performance problems with the wayback calender display and we still can't display pages using the https protocol.

 The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.



Best,

Sabine





________________________________
Fra: Netarchivesuite-curator <netarchivesuite-curator-bounces at ml.sbforge.org> på vegne af peter.stirling at bnf.fr <peter.stirling at bnf.fr>
Sendt: 24. marts 2017 16:30
Til: netarchivesuite-curator at ml.sbforge.org
Emne: [Netarchivesuite-curator] BnF NAS update for March

Hello all,

After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we went into production and started our first crawls this week! We will give more details in our next update.

The beginning of the year is also the time for writing our annual report. In 2016, we crawled 125.47 TB of data including the largest broad crawl in our collection (90.5 TB). This year we chose to study the top level domains (TLDs) in the broad crawl  to measure the impact of including new regional TLDs in the seed list. The use of the TLD varies from one region to another (commercial purposes, public purposes, personal websites...) and the number of active websites is not proportional to the geographical area. We also analysed Epub files, as we did last year, to see if there is any evolution: their number is quite similar but the number of domains where they are hosted is growing. Overall, we exceeded our predictions due to the increase of the average weight of the harvested files.

Best regards,
The BnF digital legal deposit team
________________________________

Pass BnF lecture/culture illimité à 15 EUR - Pass Recherche à 50 EUR<http://www.bnf.fr/fr/la_bnf/anx_actu_bib/a.pass_bnf.html> - Tout lire, tout voir, tout écouter !

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170404/777e96c2/attachment.html>


More information about the Netarchivesuite-curator mailing list