[Netarchivesuite-users] Your URI/sec and KB/sec figures?

Peter Svanberg Peter.Svanberg at kb.se
Mon Jun 24 17:05:06 CEST 2019


I continue with my curiosity, I hope it's OK.

You mean about 70 TByte fetched in about 100-120 days? (Or was the selective "mega" included in 70?)

And 20 TByte is what ends up being to stored in the archive?

Approximately how many URI:s does this correspond to - before and after deduplication?

We have been advised to do the broad crawl in several steps with increasing max thresholds, is that what you do in 2 steps? With what thresholds different (and what levels)?

And with reference to the subject line ... what is your typical URI/sec and KB/sec figures in a single job?

Med venlig hilsen

Peter Svanberg


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 15:16
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

The 70 TB is  based on NAS GUI/crawllog numbers - and before deduplication and gz -  about 20 TB gz uploaded.

"A broadcrawl"  runs about 2 - 2 1/2 months  - and we do some job follow up during step 2 ( this part takes about 1 ½ month) and the selective broad crawl job "mega big sites" (runs for a month or more and here we use another queue assign policy and much lower delays and harvest only domains which can take a huge number of  crawling requests!)

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 2:54 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Thank you Tue, this is very interesting information!

About 70 TB in how many days?

You emphasize "harvested" - do you aim at that more data is downloaded but not archived (sorted out duplicates/irrelevant?)?

I'll return when I have gathered corresponding info on our environment.

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 12:22
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hi Peter

We have currently only minor performance issues during harvesting. We have almost finished with our 2. broadcrawl this year - it will end up between 60-70 TB harvested pages.
Our harvesting capacity is 90-100 Heritrix harvesters including some virtual Umbra harvesters...
We are using physical servers for the broadcrawl harvesters and virtual servers for selective harvesters.
The 5 physical servers have each:
32 G MEM, 24 CPU's, 4 TB local storage
The 5 Virtual servers using NFS:
20 G RAM, 8 CPU's and 3 TB NFS storage
On each server we have between 8-10 Heritrix instances running - withdrawn the Umbra harvesters which only have one per server.
Between the  harvester and the www we have a firewall and throttling firewall agreements with about 5 webhotels, because they blocked/throttled our harvesters.

Best regards
Tue


From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 11:39 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hello!

I discovered a Heritrix mailinglist(*). Amongst some interesting tips on making the crawl faster, I also read some speed figures far from what we ever get. So I ask you: what do you get as speed values?

Our latest 19 selective harvests have the following figures (from crawl-report.txt in the jobs metadata WARC file):

URIs/sec: slowest job 0,83; fastest job 9,8;  average 5,11
KB/sec: slowest 34; fastest 863; average 313

(I realize that this besides NAS/Heritrix configuration depends much on hardware, memory, disk I/O, network capacity etc. but don't know which such figures that are most relevant to add to this comparison. Suggestions?)

* https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190624/573dc05a/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list