<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle20
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle21
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle22
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DA" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Yes in our HW setup it gives the best throughput with broadcrawl harvesters on 5 physical servers with 10 broad crawl instances on each and 8 selective harvesters on 5 virtuel servers running with
nfs.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">I admit that in the first 2-3 weeks of a broad crawl I guess that there is smoke in the serverroom from the physical servers, because the avg. load is between 100-200 % according to the top command,
but the servers and jobs are NOT failing and we have niced the ftp server on each server and running with OS 40.000 open files and 20.000 nprocs.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">We do have growing problems with the virtual servers with nfs when they are heavy loaded with almost 40 selective broad crawl jobs (8 instances on each server )(timeouts, staled drives or OS panic)
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">It is important to separate selective harvest jobs and broadcrawl jobs, because a broad crawl in our environment generates about 500-600 broadcrawl jobs and if the harvesters were not separated in
more harvester channel pools – no daily selective harvest would be executed. They would just hang in the “new” queue while the broadcrawl is running – in about 2 months.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">And on the selective harvester servers it is important to manage that the selective broad crawl job does not take all the harvester instances in the that pool.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Best regards<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Tue<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="mso-fareast-language:DA">From:</span></b><span style="mso-fareast-language:DA"> NetarchiveSuite-users <netarchivesuite-users-bounces@ml.sbforge.org>
<b>On Behalf Of </b>Peter Svanberg<br>
<b>Sent:</b> Wednesday, June 26, 2019 7:25 PM<br>
<b>To:</b> netarchivesuite-users@ml.sbforge.org<br>
<b>Subject:</b> Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?; Heritrix instances<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">You say you have 8–10 Heritrix instances per (physical or virtual) server, is that a good way to increase the throughput? And do you mean that you have so many
</span><span lang="SV"><a href="http://kw3-admprod-04.kb.se/Status/Monitor-JMXsummary.jsp?removeapplication=*&location=-&machine=*&applicationname=dk.netarkivet.harvester.heritrix3.HarvestControllerApplication&applicationinstanceid=-&httpport=-&channel=*&replicaname=*&index=0"><span lang="EN-GB" style="font-size:10.5pt;font-family:"Verdana",sans-serif;color:#336699;background:white;text-decoration:none">HarvestControllerApplication</span></a></span><span lang="EN-GB">
processes in every server – but still just one snapshot channel?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Do you others use this trick also?<span style="color:#1F497D"><o:p></o:p></span></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Regards!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Peter<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-GB" style="mso-fareast-language:SV">Från:</span></b><span lang="EN-GB" style="mso-fareast-language:SV"> NetarchiveSuite-users <<a href="mailto:netarchivesuite-users-bounces@ml.sbforge.org">netarchivesuite-users-bounces@ml.sbforge.org</a>>
<b>För </b>Tue Hejlskov Larsen<br>
<b>Skickat:</b> den 24 juni 2019 12:22<br>
<b>Till:</b> <a href="mailto:netarchivesuite-users@ml.sbforge.org">netarchivesuite-users@ml.sbforge.org</a><br>
<b>Ämne:</b> Re: [Netarchivesuite-users] Your URI/sec and KB/sec figures?<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Hi Peter<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">We have currently only minor performance issues during harvesting. We have almost finished with our 2. broadcrawl this year – it will end up between 60-70 TB
<i>harvested</i> pages.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Our harvesting capacity is 90-100 Heritrix harvesters including some virtual Umbra harvesters…<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">We are using physical servers for the broadcrawl harvesters and virtual servers for selective harvesters.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">The 5 physical servers have each:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">32 G MEM, 24 CPU’s, 4 TB local storage<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">The 5 Virtual servers using NFS:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">20 G RAM, 8 CPU’s and 3 TB NFS storage<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">On each server we have between 8-10 Heritrix instances running – withdrawn the Umbra harvesters which only have one per server.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Between the harvester and the www we have a firewall and throttling firewall agreements with about 5 webhotels, because they blocked/throttled our harvesters.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Best regards<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D">Tue<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="mso-fareast-language:DA">From:</span></b><span style="mso-fareast-language:DA"> NetarchiveSuite-users <</span><span lang="SV"><a href="mailto:netarchivesuite-users-bounces@ml.sbforge.org"><span lang="DA" style="mso-fareast-language:DA">netarchivesuite-users-bounces@ml.sbforge.org</span></a></span><span style="mso-fareast-language:DA">>
<b>On Behalf Of </b>Peter Svanberg<br>
<b>Sent:</b> Monday, June 24, 2019 11:39 AM<br>
<b>To:</b> </span><span lang="SV"><a href="mailto:netarchivesuite-users@ml.sbforge.org"><span lang="DA" style="mso-fareast-language:DA">netarchivesuite-users@ml.sbforge.org</span></a></span><span style="mso-fareast-language:DA"><br>
<b>Subject:</b> [Netarchivesuite-users] Your URI/sec and KB/sec figures?<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-GB">Hello!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">I discovered a Heritrix mailinglist(*). Amongst some interesting tips on making the crawl faster, I also read some speed figures far from what we ever get. So I ask you: what do you get as speed values?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Our latest 19 selective harvests have the following figures (from crawl-report.txt in the jobs metadata WARC file):<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">URIs/sec: slowest job 0,83; fastest job 9,8; average 5,11<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">KB/sec: slowest 34; fastest 863; average 313<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">(I realize that this besides NAS/Heritrix configuration depends much on hardware, memory, disk I/O, network capacity etc. but don’t know which such figures that are most relevant to add to this comparison. Suggestions?)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">* </span><span lang="SV"><a href="https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages"><span lang="EN-GB">https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages</span></a></span><span lang="EN-GB"><o:p></o:p></span></p>
</div>
</body>
</html>