[Netarchivesuite-curator] Brief November update from Netarchive

Sabine Schostag sas at statsbiblioteket.dk
Mon Nov 11 16:16:12 CET 2013


Dear all,

Hereby in brief, what we are working on at Netrchive:

We upgraded our production environment to vers. 4.2 – I’am sure the new features will spare us some time. :)

We are working on improving our documetation, I am quite sure that we can migrate most of our documetation to NAS using the extended fields. Adreas from ONB has promissed to fix the bugs in the extended fields, so we are looking forward to the next upgrade of NAS and hope, that Andreas will get the time for bug fixing before the next release in January.

We just started our 4th broad crawl for 2013 for about one week ago.

In the beginning of October we starded an event crawl on the local and regional elections, which will take place th 19th of November.

Best, Sabine


SABINE SCHOSTAG
LIBRARIAN, WEB CURATOR
DIRECT +45 8946 2148

THE NETARCHIVE
[cid:image001.png at 01CEDEF9.5712B5B0]STATSBIBLIOTEKET
STATE AND UNIVERSITY LIBRARY
VICTOR ALBECKS VEJ 1
8000 AARHUS C
DENMARK

VAT NO. 1010 0682

From: netarchivesuite-curator-bounces at ml.sbforge.org [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Friday, November 08, 2013 5:35 PM
To: netarchivesuite-curator at ml.sbforge.org
Subject: [Netarchivesuite-curator] BnF NAS update for November


Hello,

We started our 2013 broad crawl on the 21st October, with a list of just over 4 million domains. We have had a few problems :

- New management hardware in the storage rack had some unpredictable side effects. Under heavy charge, too long delays appeared in writing data into ARC files which run Heritrix into timeout errors. We therefore had to reduce the number of threads per job. This has worked, although we still have to be careful when launching our selective crawls.
- The broad crawl also caused problems for the creation of the deduplication index for our daily crawl of subscription press. We now put the jobs from the broad crawl on pause each day while the index is created, which means we lose about 40 minutes per day.
- We discovered a bug in the new version of NAS, which affected the number of configurations per job in the broad crawl - when a selective crawl was launched before all the jobs of the broad crawl were created, this caused the number of configurations per job to be set to the number used in the selective crawls (500 instead of 3,500). This bug seems to be random so we hadn't seen it during our tests. We suspended the launch of selective crawls until all the jobs for the broad crawl had been created, and will need to do the same for the second stage of the broad crawl.

We have managed to work round these problems and the crawl is continuing, but the first stage will take slightly longer than planned.

Also during November we will be starting a crawl for the centenary of World War I. The first crawl will be based on the official commemoration sites, and we will be expanding the scope for subsequent crawls during the period 2014-2018.

Best regards,
The BnF digital legal deposit team






________________________________

Participez à la Grande Collecte 1914-1918<http://www.bnf.fr/fr/la_bnf/anx_actu_bib/a.grande_collecte_14-18.html>

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20131111/51a4693e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 584 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20131111/51a4693e/attachment.png>


More information about the Netarchivesuite-curator mailing list