[Netarchivesuite-users] Limits

Søren Vejrup Carlsen svc at kb.dk
Wed Mar 18 20:50:23 CET 2009


Hi Andreas.

>- What is the maximum numbers of domains per jobs and what are the
>criteria for splitting up jobs within a snapshot harvest (especially if
>you start an harvest based on a previous one)

A harvest comprises a number of domain-configurations (domain + seedlist + Heritrix order.xml).
These domain-configurations are divided into jobs based on the following criteria:
  - the domain-configurations in each job must use the same heritrix order.xml
  - the number of bytes last fetched from the domain, and therefore the expected size of domain
So each job in the harvest is expected to harvest almost the same amount of data, and
therefore finish with their harvesting equally fast. 

>- The maximium size of an arc-File is defined in the order.xml, which is
>not used for the metadata.arc-File.  What's the maximum size of that
>file? Are there any limits or will that file also splitted at some stage?
Actually, there is no limit to this file. We have metadata-1.arc files over 4 GBytes.
Most of this data is the crawl.log, which store uncompressed in the metadata-1.arc file.

And we have no plans to split this file. However we want go away from storing in ARC-files,
and do it in WARC-files.

>- What is the maximum numbers of filedirs in a bitarchive, which I can
>configure in the settings.xml? Are there any restrictions?
There is maximum numbers of filedir. However, it is only the first filedir in the list,
that is actually used for storage; When this filedir is filled up, then the next one is
in line for filling up. We have a feature request for this issue:
https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1573&group_id=7&atid=108

I hope this helps

Regards
Søren
-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af aponb at gmx.at
Sendt: 18. marts 2009 17:34
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Limits

Hi!

As I am doing some Test Snapshot harvests with around 1000 domains now,
I have some questions about the behaviour of the Netarchive System, when
in future 100tousands domains will be in use.

- What is the maximum numbers of domains per jobs and what are the
criteria for splitting up jobs within a snapshot harvest (especially if
you start an harvest based on a previous one)

- The maximium size of an arc-File is defined in the order.xml, which is
not used for the metadata.arc-File.  What's the maximum size of that
file? Are there any limits or will that file also splitted at some stage?

- What is the maximum numbers of filedirs in a bitarchive, which I can
configure in the settings.xml? Are there any restrictions?

Thanks in advance for your time!
Regards
a.




_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users




More information about the NetarchiveSuite-users mailing list