<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none"><!-- p { margin-top: 0px; margin-bottom: 0px; }--></style>
</head>
<body dir="ltr" style="font-size:12pt;color:#000080;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
<p>Dear all,</p>
<p>hereby a brief update from KB, Denmark</p>
<p> </p>
<p>On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed
their state and there was lot of manual “put out fires” work. The crawl finished one on March 26.</p>
<p><br>
With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner’s servers. For a couple of weeks ago we restarted with 6 daily front page crawls for
the national news sites – so far without complaints from the site owners.</p>
<p><br>
We have NSF performance problems with the wayback calender display and we still can’t display pages using the https protocol.</p>
<p><br>
The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.</p>
<p> </p>
<p>Best,</p>
<p>Sabine</p>
<p> </p>
<p> </p>
<p></p>
<div style="color: rgb(33, 33, 33);">
<hr tabindex="-1" style="width: 98%; display: inline-block;">
<div id="divRplyFwdMsg" dir="ltr"><font color="#000000" face="Calibri, sans-serif" style="font-size: 11pt;"><b>Fra:</b> Netarchivesuite-curator <netarchivesuite-curator-bounces@ml.sbforge.org> pĺ vegne af peter.stirling@bnf.fr <peter.stirling@bnf.fr><br>
<b>Sendt:</b> 24. marts 2017 16:30<br>
<b>Til:</b> netarchivesuite-curator@ml.sbforge.org<br>
<b>Emne:</b> [Netarchivesuite-curator] BnF NAS update for March</font>
<div> </div>
</div>
<div><font face="sans-serif" size="2">Hello all,</font><br>
<br>
<font face="sans-serif" size="2">After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we went into production and started our first crawls this week! We will give more details in our next update.</font><br>
<br>
<font face="sans-serif" size="2">The beginning of the year is also the time for writing our annual report. In 2016, we crawled 125.47 TB of data including the largest broad crawl in our collection (90.5 TB). This year we chose to study the top level domains
(TLDs) in the broad crawl to measure the impact of including new regional TLDs in the seed list. The use of the TLD varies from one region to another (commercial purposes, public purposes, personal websites...) and the number of active websites is not proportional
to the geographical area. We also analysed Epub files, as we did last year, to see if there is any evolution: their number is quite similar but the number of domains where they are hosted is growing. Overall, we exceeded our predictions due to the increase
of the average weight of the harvested files. </font><br>
<br>
<font face="sans-serif" size="2">Best regards,</font><br>
<font face="sans-serif" size="2">The BnF digital legal deposit team</font><br>
<font face="sans-serif">
<hr>
<p><strong><a href="http://www.bnf.fr/fr/la_bnf/anx_actu_bib/a.pass_bnf.html">Pass BnF lecture/culture illimité ŕ 15 € – Pass Recherche ŕ 50 €</a></strong> - Tout lire, tout voir, tout écouter !</p>
<p style="color: rgb(0, 128, 0);"><strong>Avant d'imprimer, pensez ŕ l'environnement.</strong></p>
</font></div>
</div>
</body>
</html>