[Netarchivesuite-users] Your URI/sec and KB/sec figures?; Deduplication

Fri Jun 28 17:29:59 CEST 2019

Okay, the Lucene index files (not databases, I agree) is just read-only during the harvest, and compares to previous crawl.

I haven't looked into Wayback things, are the NAS Waybackindexer and Aggregater standard available tools?

I suppose you have found a reasonable balance, but have you considered

1.       Also check text/html for duplicates?

2.       Do revisit/links also between different URLs with same hash? Would find all those standard pages which lots of URLs are redirected to.

3.       Avoid duplicates (on URL+hash or just hash) within one crawl? Would be more complicated, of course, probably requiring rewriting the warc files.

In a short test now the proportion avoidable URLs when comparing on hash was between 12 and 14 %.

Regards,

Peter

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Tue Hejlskov Larsen
Skickat: den 26 juni 2019 21:28
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?; Deduplication

We are using the NAS Indexserver and there is no database - only a cached filesystem with generated lucene gz indexes.
Every time a new harvest is generated a new index-server job is generated by the Indexserver to collect only the previous harvest job crawllogs files and cdxfiles  included in the metadata warc files for each harvest. Out of that is a lucene gz index generated in the cache directory. The next step is that  the jobs in the harvest are generated  by the job database and picked up by the active e.g. broad crawl harvesters.  They request the index in the Index server cache and  it is copied to the actual harvester server cache and used by all harvesters on that server. So here it is very important to only have 1 broad crawl harvester active in the beginning - otherwise it will bomb your ftp server on that server. Remember to do the same for each broad crawl server.  So when the harvester have the cached index on the local harvester server it starts running and each harvested url is compared with by url and checkum in the cached lucene index. If it is already there it is removed and the duplicate is annotated in the crawllog and a revisit is made.

After the job is finished we have a NAS Waybackindexer and Aggregater which runs every day. It uses a local derby database to manage which files are CDX indexed. The Waybackindexer runs a complete archive file list and compare it with the derby database and all new files are indexet using the job metadata warc file included cdx files and the crawllog dublicate annotations to generate a complete CDX for all harvested object also the duplicates in the job warcfiles. We are not using the revisits and the OpenWayback cdx indexserver.

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Wednesday, June 26, 2019 7:10 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?; Deduplication

Deduplication: You use is.hi.bok.deduplicator.DeDuplicator, which (if I understand correct):

·         Saves visited 2xx-URL:s with its checksum value in a Lucine index database.

·         Decides equality via checksum and can treat any two fetched URL:s as equal.

·         The index database can contain data from several earlier crawls, not only the current.

You limit the use (perhaps via default parameter values) to

·         Just consider same URLs, or "equivalent" URL (same domain and path but different www[0-9] hostname prefix)

·         Don't apply to mime-type text/*.

Do you use this on a broad crawl? With one index database for the whole process, being updated from all harvesters?

Do you get 40-50 % data reduction on a broad crawl with this limits? Okay, many large URL:s have links from many other URL:s, I suppose.

Have you considered not using those limits? Are there performance losses in having to many revisit records?

I suppose there is an equivalent module for OpenWayback which use the same index database to find the warc file for the record pointed out by a revisit record? And also for pyweb?

Regards!

Peter Svanberg

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 21:26
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

See my TLR comments below

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 5:05 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

I continue with my curiosity, I hope it's OK.

You mean about 70 TByte fetched in about 100-120 days? (Or was the selective "mega" included in 70?)
TLR>>> yes we run both broadcrawl  step 1 and step 2 and selective broad crawl jobs in parallel, because we have 2 dedicated harvester farms in AAR and CPH.

And 20 TByte is what ends up being to stored in the archive?

TLR>>> yes

Approximately how many URI:s does this correspond to - before and after deduplication?

TLR >>> We are talking  million/billion urls, - just to mention - we have 1.1 billion 5003 "byte limit reached" in step 2. All urls are recorded in the crawllogs also  the deduplicated are annotated there.
So you need to specify which type of return code urls you want numbers for :). We do some daily statistics for some of the return codes. It is really huge numbers!

TLR>>>Deduplication gives appr.  40-50 % and gz 40-50 %.

We have been advised to do the broad crawl in several steps with increasing max thresholds, is that what you do in 2 steps? With what thresholds different (and what levels)?

TLR>>> yes we have following steps:

1)      Step 1:  50 MB byte limit : all domains in the jobs database : duration 1-2 weeks
We do some sqlextraction from the jobdatabase  and increase the max bytelimit  per domain for some 10- 20.000 domains before each new step 2 broad crawl.

2)      Step 2: 16 GB byte limit: all domains which have hit 50 MB limit: duration 1-2 months.

3)      And we run  about 3-4  big selective broad crawl harvests in parallel - with different (huge) domains  - taken out of the step 2 broad crawl. They run  3 - 6 weeks each and each harvest creates about 10 -20 jobs running in parallel in AAR together with the normal daily selective harvests.
So we are using the most of our harvester capacity in long periods during the "broad crawl".

And with reference to the subject line ... what is your typical URI/sec and KB/sec figures in a single job?

TLR>>> I have not looked into that ( we have between 50-90 different jobs /day) , because we are using the NAS std. setup ( you have got a copy of that earlier).
The main problem was earlier - domains which blocked/throttled us and next that our capacity agreements with the biggest webhotels where too low. The biggest one ( a .be company) have  about 50-75 % of all .dk domains.
After we have  increased our max  concurrent requests agreements with them to 40MB/sec for our harvester ip ranges in AAR and CPH and upgraded to NAS 5.5 we have no big performance issues anymore.

Med venlig hilsen

Peter Svanberg

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 15:16
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

The 70 TB is  based on NAS GUI/crawllog numbers - and before deduplication and gz -  about 20 TB gz uploaded.

"A broadcrawl"  runs about 2 - 2 1/2 months  - and we do some job follow up during step 2 ( this part takes about 1 ½ month) and the selective broad crawl job "mega big sites" (runs for a month or more and here we use another queue assign policy and much lower delays and harvest only domains which can take a huge number of  crawling requests!)

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 2:54 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Thank you Tue, this is very interesting information!

About 70 TB in how many days?

You emphasize "harvested" - do you aim at that more data is downloaded but not archived (sorted out duplicates/irrelevant?)?

I'll return when I have gathered corresponding info on our environment.

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 12:22
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hi Peter

We have currently only minor performance issues during harvesting. We have almost finished with our 2. broadcrawl this year - it will end up between 60-70 TB harvested pages.
Our harvesting capacity is 90-100 Heritrix harvesters including some virtual Umbra harvesters...
We are using physical servers for the broadcrawl harvesters and virtual servers for selective harvesters.
The 5 physical servers have each:
32 G MEM, 24 CPU's, 4 TB local storage
The 5 Virtual servers using NFS:
20 G RAM, 8 CPU's and 3 TB NFS storage
On each server we have between 8-10 Heritrix instances running - withdrawn the Umbra harvesters which only have one per server.
Between the  harvester and the www we have a firewall and throttling firewall agreements with about 5 webhotels, because they blocked/throttled our harvesters.

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 11:39 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Your URI/sec and KB/sec figures?

Hello!

I discovered a Heritrix mailinglist(*). Amongst some interesting tips on making the crawl faster, I also read some speed figures far from what we ever get. So I ask you: what do you get as speed values?

Our latest 19 selective harvests have the following figures (from crawl-report.txt in the jobs metadata WARC file):

URIs/sec: slowest job 0,83; fastest job 9,8;  average 5,11
KB/sec: slowest 34; fastest 863; average 313

(I realize that this besides NAS/Heritrix configuration depends much on hardware, memory, disk I/O, network capacity etc. but don't know which such figures that are most relevant to add to this comparison. Suggestions?)

* https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190628/d1f6d40c/attachment-0001.html>