[Netarchivesuite-users] Your URI/sec and KB/sec figures?; Individual new limits

Tue Hejlskov Larsen tlr at kb.dk
Wed Jun 26 20:46:35 CEST 2019


You have 2 levels of bytelimits - on the domain and harvest definition level.
So when you e.g. run a broad crawl  step 1 with a harvest "byte limit" level on 50 MB  it means that even though the domain byte limit is on 200 GB for a domain it will stop after 50 MB and if the domain  byte limit is on 5 MB it will stop after 5 MB and the status will be set to "byte limit reached". But if the domain completes within the harvest 50 MB limit it will be set to complete.

We do the sql extract after step 2 to find all the domains which hit the "byte limit"   on 2, 4, 6, 8, 10, 12, 14, 16 GB and deside which ones we will increase (or lower) the domain byte limit for.
We decide also which huge domains (> 16  GB) to move to the selective broad harvest definitions "mega_big_sites" or other selective broad harvest crawls with special harvest setup.

It's a way to avoid that jobs runs "away" in crawler traps or harvesting the whole www by a mistake and to increase the broadcrawl harvesting in a controlled way.

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Wednesday, June 26, 2019 7:09 PM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?; Individual new limits

Thanks for you patience!

My aim with URL statistics was to get some hints on what speed is needed (leading to request for more resources from the IT department).

(I split in several follow up e-mails with different subjects.)

Why do you have to do sql handling to individually set new max limits? What is the difference between that and running step 2 with a new global limit?

Regards!

Peter

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 21:26
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

See my TLR comments below

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 5:05 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

I continue with my curiosity, I hope it's OK.

You mean about 70 TByte fetched in about 100-120 days? (Or was the selective "mega" included in 70?)
TLR>>> yes we run both broadcrawl  step 1 and step 2 and selective broad crawl jobs in parallel, because we have 2 dedicated harvester farms in AAR and CPH.

And 20 TByte is what ends up being to stored in the archive?

TLR>>> yes

Approximately how many URI:s does this correspond to - before and after deduplication?

TLR >>> We are talking  million/billion urls, - just to mention - we have 1.1 billion 5003 "byte limit reached" in step 2. All urls are recorded in the crawllogs also  the deduplicated are annotated there.
So you need to specify which type of return code urls you want numbers for :). We do some daily statistics for some of the return codes. It is really huge numbers!

TLR>>>Deduplication gives appr.  40-50 % and gz 40-50 %.

We have been advised to do the broad crawl in several steps with increasing max thresholds, is that what you do in 2 steps? With what thresholds different (and what levels)?

TLR>>> yes we have following steps:

1)      Step 1:  50 MB byte limit : all domains in the jobs database : duration 1-2 weeks
We do some sqlextraction from the jobdatabase  and increase the max bytelimit  per domain for some 10- 20.000 domains before each new step 2 broad crawl.

2)      Step 2: 16 GB byte limit: all domains which have hit 50 MB limit: duration 1-2 months.

3)      And we run  about 3-4  big selective broad crawl harvests in parallel - with different (huge) domains  - taken out of the step 2 broad crawl. They run  3 - 6 weeks each and each harvest creates about 10 -20 jobs running in parallel in AAR together with the normal daily selective harvests.
So we are using the most of our harvester capacity in long periods during the "broad crawl".

And with reference to the subject line ... what is your typical URI/sec and KB/sec figures in a single job?

TLR>>> I have not looked into that ( we have between 50-90 different jobs /day) , because we are using the NAS std. setup ( you have got a copy of that earlier).
The main problem was earlier - domains which blocked/throttled us and next that our capacity agreements with the biggest webhotels where too low. The biggest one ( a .be company) have  about 50-75 % of all .dk domains.
After we have  increased our max  concurrent requests agreements with them to 40MB/sec for our harvester ip ranges in AAR and CPH and upgraded to NAS 5.5 we have no big performance issues anymore.


Med venlig hilsen

Peter Svanberg

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 15:16
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

The 70 TB is  based on NAS GUI/crawllog numbers - and before deduplication and gz -  about 20 TB gz uploaded.

"A broadcrawl"  runs about 2 - 2 1/2 months  - and we do some job follow up during step 2 ( this part takes about 1 ½ month) and the selective broad crawl job "mega big sites" (runs for a month or more and here we use another queue assign policy and much lower delays and harvest only domains which can take a huge number of  crawling requests!)

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 2:54 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Thank you Tue, this is very interesting information!

About 70 TB in how many days?

You emphasize "harvested" - do you aim at that more data is downloaded but not archived (sorted out duplicates/irrelevant?)?

I'll return when I have gathered corresponding info on our environment.

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 12:22
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hi Peter

We have currently only minor performance issues during harvesting. We have almost finished with our 2. broadcrawl this year - it will end up between 60-70 TB harvested pages.
Our harvesting capacity is 90-100 Heritrix harvesters including some virtual Umbra harvesters...
We are using physical servers for the broadcrawl harvesters and virtual servers for selective harvesters.
The 5 physical servers have each:
32 G MEM, 24 CPU's, 4 TB local storage
The 5 Virtual servers using NFS:
20 G RAM, 8 CPU's and 3 TB NFS storage
On each server we have between 8-10 Heritrix instances running - withdrawn the Umbra harvesters which only have one per server.
Between the  harvester and the www we have a firewall and throttling firewall agreements with about 5 webhotels, because they blocked/throttled our harvesters.

Best regards
Tue


From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 11:39 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hello!

I discovered a Heritrix mailinglist(*). Amongst some interesting tips on making the crawl faster, I also read some speed figures far from what we ever get. So I ask you: what do you get as speed values?

Our latest 19 selective harvests have the following figures (from crawl-report.txt in the jobs metadata WARC file):

URIs/sec: slowest job 0,83; fastest job 9,8;  average 5,11
KB/sec: slowest 34; fastest 863; average 313

(I realize that this besides NAS/Heritrix configuration depends much on hardware, memory, disk I/O, network capacity etc. but don't know which such figures that are most relevant to add to this comparison. Suggestions?)

* https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190626/fe7c69c6/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list