[Netarchivesuite-curator] Netarchive NAS update for September/October 2016

Sabine Schostag sas at statsbiblioteket.dk
Fri Oct 21 17:13:04 CEST 2016

Hi all

Hereby an update form KB/SB

We have finished our “little” broad crawl:

Start time

Stop time








We have nearly finished the reorganization of our selective crawls according to the new strategy:

·         Daily crawls of all national news sites

·         Daily crawls of all regional news sites

·         Weekly crawls of all local news sites

·         Monthly crawls of political parties’ sites

·         Trimonthly crawls of ministries’ and administrative bodies’ sites

·         “Streamlining” of Twitter crawls

·         Analyze of depth and frequency for a crawl of organizations and associations
We renewed our account at Archive-IT, it is supposed to be used for Facebook crawls
NAS 5.2 is released for developers test. Test for curators is planned for the end of October.
We are upgrading the citrix installation, which gives access to wayback.

We have testet  Ilya Kraemers W/ARC player for displaying https pages: it works fine, but there are some security issues to be fixed.

Beyond that: Business as usual

Have a nice weekend.


From: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of Sabine Schostag
Sent: Wednesday, August 03, 2016 4:21 PM
To: 'peter.stirling at bnf.fr' <peter.stirling at bnf.fr>; 'netarchivesuite-curator at ml.sbforge.org' <netarchivesuite-curator at ml.sbforge.org>
Subject: Re: [Netarchivesuite-curator] Netarchive NAS update for August

Hi all,

Here follows an update from KB/SB:

We are still working on the reorganization of the selective crawls.

Following our new collection strategy – extension of the selective crawls and smaller broad crawls – we now collect all national Danish news media selectively – both newspaper websites and news media only existing online.
We investigate all local new media in order to decide frequency and depth for the future crawls.

As Heritrix 3 is not able to archive Facebook profiles. But Archive-IT is able to collect Facebook profiles with an API. We will use We will collect about 100 representative open Facebook profiles at Archive-IT, at the moment we are doing the selection of the profiles

All the best, Sabine

Sabine Schostag
Web curator
[cid:image001.png at 01CF5E4A.E0F00190]STATE AND UNIVERSITY LIBRARY
Victor Albecks Vej 1
VAT NO. 1010 0682

From: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr<mailto:peter.stirling at bnf.fr>
Sent: Wednesday, August 03, 2016 2:00 PM
To: netarchivesuite-curator at ml.sbforge.org<mailto:netarchivesuite-curator at ml.sbforge.org>
Subject: [Netarchivesuite-curator] BnF NAS update for August

Hello all,

As in previous years, in July we started to work on our annual broad crawl. We have asked for seed lists from our different partners (registrars, registers and BnF databases). In 2016, we have managed to expand the number of domains from TLDs for overseas French departments (.gf, .gp, .mq, .pf) and from regional TLDs (.alsace, .bzh, .paris): at this point, we have more than 36,000 domain names from these TLDs.

In July and August, we are using our tool nas-preload to deduplicate URLs and domains from the seven different identified sources and to check the DNS response, with the aim of transferring only active domains into NAS.

We have just started to work on the migration from Heritrix 1 to Heritrix 3. We plan to achieve the first stage of non-regression at the end of February 2017. It is a big challenge as we also have to adjust other tools connected to Heritrix 3.

Best regards,
The BnF Digital Legal Deposit team

Expositions :
Miquel Barcel?. Sol y sombra<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.miquel_Barcelo.html> - du 22 mars 2016 au 28 ao?t 2016 - BnF - Fran?ois-Mitterrand

Avant d'imprimer, pensez ? l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20161021/98ae4c4a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 584 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20161021/98ae4c4a/attachment-0001.png>

More information about the Netarchivesuite-curator mailing list