Questions about Crawling-Infrastructure

aponb at gmx.at aponb at gmx.at
Tue May 5 14:25:01 CEST 2020

Hi all,

as I said in the zoom-conference, it would like to know how the
situation is in your Library. I also were asking @slack - you can see
the Post here:
https://iipc.slack.com/archives/C1U8Y0117/p1587978571050800 - I sent the
post to the devel and curator list, so sorry for crossposting!

Some time ago our IT-Department installed a new Firewall-System. It took
some time to realize, that there is a proxy active, which prevents
clients to access certain pages. We asked the departement to exclude our
crawlers from this proxy, and crawling for some blocked pages worked
again. Later again we were still blocked for certain pages. We again
requested to fix the problem, but then we got the answer, these certain
pages were categorized as "Malware - High Risk" pages which are not
allowed to access by any client. Of course the process of categorizing
is done automatically by calling some kind of malware-Databases like
https://www.brightcloud.com/tools/url-ip-lookup.php which gives you many
false positves.

So we have now the problem, that we are unable to crawl some pages which
were selected for our event crawls and also for the next domain crawl it
means, that we can not crawl a complete domain. In fact we would crawl
the internal proxy page instead.

The only solution our IT-Department offered us is to move the whole
Webarchiving-Infrastructure into an isolated network plus an independent
internet access. That includes the crawling, the storage and the
accessing process. If it could be done, depends on the costs and
resources which will be evaluated now. A result is not expected very soon.

Have you already experienced this kind of problem in your institution
and if yes how did you handle it?
Are your crawlers allowed to access the internet without proxy?
What is the opinion of your IT-Department about crawling possible
malware pages without proxy?
Do you have special protection on your systems where you store your
(w)arc files?
On our Webarchiv-Terminals we are using our default virus scanner to
prevent executing Malware. How is your Access-Terminal secured?

Thanks for your time!

