[Netarchivesuite-curator] NAS update for December
Sabine Schostag
sas at kb.dk
Tue Dec 5 10:23:43 CET 2017
Dear all.
Hereby an update from KB DK:
Our fourth broad crawl for 2017 with a budget of 10 MB per domain started on November 14 and finished on November 23. We captured a little less than four TB.
Our event harvest on the local and regional elections on November 21 are almost finished. We will give the different definitions one or two more crawls.
Our electional Facebook crawl will be run with Archive-IT, we calculated that we could crawl about 1000 Facebook profiles within our account budget. Setting up the crawl takes quite some time. Intentionally we will run the Facebook crawl after the elections, as we will be able to capture content retrospectively.
As mentioned before we also used BCWeb for the electional harvest – as BCWeb only was accessible internally at KB, it is kind of a pilot project for the use of BCWeb with a colleague outside Netarchive. In the next couple of weeks, we will evaluate on this different elements of the event harvest.
Best,
Sabine
From: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Tuesday, November 07, 2017 10:37 AM
To: netarchivesuite-curator at ml.sbforge.org
Subject: [Netarchivesuite-curator] BnF NAS update for November
Hello all,
Our 2017 broad crawl was launched on the 16th October. The settings are 1500 URLs per domain, with a limit of 3 days per job. Our prediction of the overall volume based on our tests seems to have been underestimated: we had calculated around 77 TB with these settings and after three weeks of crawling we are now expecting a final volume of around 97 TB. This is still within our overall storage budget but we are keeping a close watch on the volume of data collected. So far we have encountered no major problems, both H3 and the new infrastructure are functioning correctly.
We are also continuing to work on updating our full-text indexing process with the aim of indexing our news crawls since 2016. We have been updating the indexing schema to follow recent developments on warc-indexer and we will be working on the organisation of the index to improve query performance. The research project that will use this index to study neologisms is starting this week, so we will be working closely with a research engineer over the next few weeks.
Best regards,
The BnF digital legal deposit team
________________________________
Exposition Paysages français – Une aventure photographique (1984 - 2017)<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.paysages_francais.html> - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand
Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20171205/c6e2618c/attachment.html>
More information about the Netarchivesuite-curator
mailing list