[Netarchivesuite-curator] Monthly update from KB DK

Sabine Schostag sas at kb.dk
Tue Dec 4 10:47:59 CET 2018


Dear all,

Hereby a brief update on Netarchive’s production activities

Broad crawl
Step 2 of our third broad crawl (with a data limit of 14 GB per domaine) is still ongoing. It progresses rather slowly. The reason might be the growing centralization of webhosting sites. We also have problems with the job scheduling/running of jobs and monitoring of the broad crawls in “GUI open”
Selective crawls
We often run into problems, which we cannot solve without developers assistance e.g.

-          IP-validated access to content behind pay walls (the website owner claims to have established the access, but it does not work

-          Quite some websites are blocking our crawlers even though they are obliged to give access according to the legal deposit law


Event crawl
We run our mini event crawl mini-event harvest “Week 46”: web sites of local broadcast stations’ (both radio and television)

Special crawl
We had a follow up to the the special crawl for man hunt by Danish police on 28 September, when Danish Secret Service (PET) revealed, that Iranian Secret Service was prevented in an assassination on Danish soil. We crawled foreign news media articles on the revelation.

On behalf of the Netarchive team

Best,
Sabine

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20181204/534b05f8/attachment.html>


More information about the Netarchivesuite-curator mailing list