[Netarchivesuite-curator] Institution update from KB DK

Sabine Schostag sas at kb.dk
Tue Oct 8 11:44:02 CEST 2019


Dear all.

Hereby a brief update from Netarchive:

Broad crawl
We finished our third broad crawl for 2019 (with a limit of 50 MB/step 1 and 16 GB/step 2) on 10 September. In 602 jobs we harvested a total of about 93 TB or 187 million objects. There are lots of sites blocking us, we will solve that by giving our new broad crawl harvesters new IP addresses and updating our throttling firewall rules. Simultaniously we ran the selective crawls connected to the broad crawls: Research databases, Municipalities and regions, Ministries and Government agencies, YouTube.
Now we are doing the "cleaning up" and improvements to prepare the next broad crawl
Selective crawls
Getting IP-validated access to content behind paywalls is still a big issue (to get in touch with the right person from the website owners).

Ongoing projects

-      Implementation of SolR Wayback

-      New user access procedure and form

-      Data mining/extractions from the archive: make sure with our legal department, that we follow all relevant laws

-      Visual QA of https-seeds: configuration of pc's for reading Warc-files
On behalf of the Netarchive Team


Best, Sabine

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20191008/a702894c/attachment.html>


More information about the Netarchivesuite-curator mailing list