[Netarchivesuite-users] Your URI/sec and KB/sec figures?; Heritrix instances
Tue Hejlskov Larsen
tlr at kb.dk
Wed Jun 26 20:13:48 CEST 2019
Yes in our HW setup it gives the best throughput with broadcrawl harvesters on 5 physical servers with 10 broad crawl instances on each and 8 selective harvesters on 5 virtuel servers running with nfs.
I admit that in the first 2-3 weeks of a broad crawl I guess that there is smoke in the serverroom from the physical servers, because the avg. load is between 100-200 % according to the top command, but the servers and jobs are NOT failing and we have niced the ftp server on each server and running with OS 40.000 open files and 20.000 nprocs.
We do have growing problems with the virtual servers with nfs when they are heavy loaded with almost 40 selective broad crawl jobs (8 instances on each server )(timeouts, staled drives or OS panic)
It is important to separate selective harvest jobs and broadcrawl jobs, because a broad crawl in our environment generates about 500-600 broadcrawl jobs and if the harvesters were not separated in more harvester channel pools - no daily selective harvest would be executed. They would just hang in the "new" queue while the broadcrawl is running - in about 2 months.
And on the selective harvester servers it is important to manage that the selective broad crawl job does not take all the harvester instances in the that pool.
Best regards
Tue
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Wednesday, June 26, 2019 7:25 PM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?; Heritrix instances
You say you have 8-10 Heritrix instances per (physical or virtual) server, is that a good way to increase the throughput? And do you mean that you have so many HarvestControllerApplication<http://kw3-admprod-04.kb.se/Status/Monitor-JMXsummary.jsp?removeapplication=*&location=-&machine=*&applicationname=dk.netarkivet.harvester.heritrix3.HarvestControllerApplication&applicationinstanceid=-&httpport=-&channel=*&replicaname=*&index=0> processes in every server - but still just one snapshot channel?
Do you others use this trick also?
Regards!
Peter
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 24 juni 2019 12:22
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?
Hi Peter
We have currently only minor performance issues during harvesting. We have almost finished with our 2. broadcrawl this year - it will end up between 60-70 TB harvested pages.
Our harvesting capacity is 90-100 Heritrix harvesters including some virtual Umbra harvesters...
We are using physical servers for the broadcrawl harvesters and virtual servers for selective harvesters.
The 5 physical servers have each:
32 G MEM, 24 CPU's, 4 TB local storage
The 5 Virtual servers using NFS:
20 G RAM, 8 CPU's and 3 TB NFS storage
On each server we have between 8-10 Heritrix instances running - withdrawn the Umbra harvesters which only have one per server.
Between the harvester and the www we have a firewall and throttling firewall agreements with about 5 webhotels, because they blocked/throttled our harvesters.
Best regards
Tue
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, June 24, 2019 11:39 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Your URI/sec and KB/sec figures?
Hello!
I discovered a Heritrix mailinglist(*). Amongst some interesting tips on making the crawl faster, I also read some speed figures far from what we ever get. So I ask you: what do you get as speed values?
Our latest 19 selective harvests have the following figures (from crawl-report.txt in the jobs metadata WARC file):
URIs/sec: slowest job 0,83; fastest job 9,8; average 5,11
KB/sec: slowest 34; fastest 863; average 313
(I realize that this besides NAS/Heritrix configuration depends much on hardware, memory, disk I/O, network capacity etc. but don't know which such figures that are most relevant to add to this comparison. Suggestions?)
* https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190626/455eb5dd/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list