[Netarchivesuite-curator] NAS September update from KB/SB

Sabine Schostag sas at statsbiblioteket.dk
Mon Sep 17 15:20:11 CEST 2012


Hi all,

Hereby the September update from Netarchive:


§  Our third broad crawl 2012 is well under way, first step (with a limit of 10 MB per domain) started on Aug. 15th, second step (with a limit of 8 GB per domain) started on Sept. 3rd. So far we have harvested about 18 TB.


§  Some jobs from our most frequently done harvest definitions (6 times a day) had been a little bit tricky: they did not stop, were just “hanging” when they were 99 % done. We located the problem: domains from one of the big Danish media groups. Fortunately they cooperated to solve the problem.


§  We just started a new event crawl on a tax cause of the Danish Prime Minister’s husband. This event is of potential interest because a commission has been settled to investigate a supposed political leak, and also the role of the press is part of the case.


§  We are working on an article for the library journal “Microform and Digitisation Review” on the curational aspects of the work with Netarchive


Best,
Sabine


SABINE SCHOSTAG
LIBRARIAN, WEB CURATOR
DIRECT +45 8946 2148

THE NETARCHIVE

[cid:image001.png at 01CD94E7.EE16CBC0]STATSBIBLIOTEKET

STATE AND UNIVERSITY LIBRARY
VICTOR ALBECKS VEJ 1
8000 AARHUS C
DENMARK

VAT NO. 1010 0682


From: netarchivesuite-curator-bounces at ml.sbforge.org [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Tuesday, September 04, 2012 10:32 AM
To: netarchivesuite-curator at ml.sbforge.org
Cc: DBN_DLWEB at bnf.fr
Subject: [Netarchivesuite-curator] BnF NAS Update for September


Hello all,

Here is our update for September.

This summer, BnF launched a new type of harvest. We observed that blog platforms did not have a good representation in our broad crawl because of the small budget dedicated to each domain. So we prepared a selective crawl with 16 well-known French platforms (such as free.fr, skyrock.com, typepad.com). We extracted the names of sites located on these domains from all the host reports found in NAS (that means reports from 2010 to 2012). We only kept those which are still active. This gave us a list of 430 000 seeds, which we harvested during a period of two weeks. We still need to do quality assurance.

Best regards,

The BnF digital legal deposit team
________________________________

Participez à l'acquisition d'un Trésor national : le Livre d'heures de Jeanne de France<http://www.bnf.fr/fr/anx_mecenat/a.mecenat_jeanne_france.html>

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120917/eac3ac79/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 584 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120917/eac3ac79/attachment.png>


More information about the Netarchivesuite-curator mailing list