[Netarchivesuite-curator] KB/SB NAS update for July to September

Sabine Schostag sas at statsbiblioteket.dk
Mon Sep 14 17:39:50 CEST 2015

Dear all,
Here is what we worked on at Netarchive, July to September.

  *   Broad crawl: Last week we finished the first step of our 3rd broad crawl (limit 10 MB)
  *   Event crawls:
     *   We finished the event crawl on the parliamentary elections in the end of June, but continued harvesting Social Media profils connected to this event crawl.
     *   Last week we started a new event crawl on the Eurpean refugee crisis, that is to say mainly social media activities connected to the Danish handling of refugees.
     *   We are preparing an event crawl on the referendum on the Danish opt out on rules for political asylum etc.
  *   Fulltext search: Colleagues from the IT department at SB have experimented with image search – some of the curators at KB/SB have seen a demonstration: it looks quite exiting.
  *   Harvest problems: We had problems with both our broad and selective crawls in July. We almost did not harvest anything for about one week.
  *   Access policy and strategy: We started working on this issue.
  *   Newspaper paywalls: We have problems with harvesting content behind paywalls. We need IP validation access, because we have not enough technical ressources to implement other solutions. We are now  focusing on paywalls:
     *   Can Netarchive offer to pay for possible expenses for the implementation of IP validation
  *   Curator-seminar, held in September: How do we want Netarchive to be in 5 years, 10 years? What is realistic?

Sabine Schostag
Web curator
[cid:image001.png at 01CFE4A6.C9C92360]STATE AND UNIVERSITY LIBRARY
Victor Albecks Vej 1
VAT NO. 1010 0682
DIRECT +45 8946 2148

From: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Friday, September 11, 2015 3:07 PM
To: netarchivesuite-curator at ml.sbforge.org
Subject: [Netarchivesuite-curator] BnF NAS update for September

Hello all,

First of all, we're pleased to announce that Marie Chouleur arrived on the 1st September to take up the post of head of digital legal deposit at the BnF.

During July and August, we have performed technical tests and started a trial run for the broad crawl:
- we have tried to include IDNs: it was necessary to rewrite some of them in correct UTF8 syntax, but this did not work with NAS and Heritrix. So we will have to wait for Heritrix 3 to crawl these specific domains.
- the new storage array was delivered and different kinds of configurations were tried before finding the right one. There were some communication problems between the crawlers and the array.
- we changed the operating system of the servers from CentOS 5 to CentOS 6, which turned out to be a lot of work. At first, we put CentOS 6 on the crawlers but access to the indexer was much less powerful than under the old system. The consequence was that each job almost stopped working after two or three hours. We tried several configurations before we eventually moved the index to an external nfs server on the storage array.

Right now, our engineers have solved these problems and are doing some final tests. We hope to be able to start the real crawl mid-september.

Best regards,
The BnF digital legal deposit team

Entrez dans l'Histoire des rois de France en participant ? l'acquisition du br?viaire royal de Saint-Louis de Poissy<http://www.bnf.fr/fr/acces_dedies/mecenat_partenariat/s.mecenat_saint_louis.html>

Avant d'imprimer, pensez ? l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20150914/1178eea1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 584 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20150914/1178eea1/attachment.png>

More information about the Netarchivesuite-curator mailing list