[Netarchivesuite-curator] KB DK NAS update for January

Sabine Schostag sas at kb.dk
Tue Jan 8 09:34:52 CET 2019

Dear all,

Hereby an update from Netarchive:

Our 3rd broad crawl for 2018 finished on January 5. We have harvested about 45 TB and more than 650 mill. Documents in 344 harvest jobs. We noticed, that our crwls run very slowly because of the throttling rules in the firewall we set up in order to reduce disturbances for the website owners.
Webhosting is going to be more and more centralized in Denmark. Therefore we made agreements for throttling with the webhosts.
We will rethink our strategy for the broad crawls, that is to say we mostl likely will exclude more big sites (e.g. municipalities’ websites) from the “regular” broad crawls and crawl them selectively in the same frequence as the broad crawls.

We had problems with wayback access – ten days in December it accessed the living web instead of the archive. This happened in connection with an upgrade process of our citrix and we had to roll bag to the former version.  Furthermore Wayback (Blacklight) stil does not perform very well: lots off images are missing and lots of sites using https protocol are not displayed.
We are working on the adjustment of our procedures, documentation etc. according to the EU General Data Protection Regulation (GDPR)

We are looking forward to BNF’s omprovements of BCWeb ☺

With the best wishes for the year 2019 and an ongoing fruitful cooperation in our NAS community
On behalf of the Netarchive team


[cid:image002.jpg at 01D4A735.66EE5E30]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20190108/66cd37fa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 44361 bytes
Desc: image002.jpg
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20190108/66cd37fa/attachment-0001.jpg>

More information about the Netarchivesuite-curator mailing list