[Netarchivesuite-users] Activating object count limit in harvest definitions, configurations and jobs

Kåre Fiedler Christiansen kfc at statsbiblioteket.dk
Thu Aug 27 11:57:39 CEST 2009

On Wed, 2009-08-19 at 17:26 +0200, nicolas.giraud at bnf.fr wrote:
> Hi,
> My current task at BnF is to allow using URL count as a domain budget,
> instead of data size. I have browsed the code and have found that
> everything works at the data model and DAO level.
> One of my concerns is to understand how this will impact the process
> of splitting a harvest definition into jobs. If I have understood
> things correctly, the critical code for this is located in the method
> dk.netarkivet.harvester.datamodel.Job#canAccept(DomainConfiguration).
> I would like to have some textual explanation of the calculations
> performed here, I am not fully understanding what happens just by
> reading the code. If using URL count for the budget, size limit should
> be set to -1 (Constants.HERITRIX_MAXBYTES_INFINITY)?
> My next concern is to insert the proper configuration in order.xml,
> but prior to asking more info about this, I have to read some doc ;)
> Cheers,
> Nicolas

Hi Nicolas.

First of all, sorry about not responding to this sooner. It had somehow
slipped my mind. If I forget to answer a request another time, don't
hesitate to ask again after a few days to job my memory :-)

There is a rather lengthy description of how we split jobs here:
I hope that answers all of your questions, otherwise I can try to
elaborate on specific points.

I'm fairly certain you are right that canAccept is the only place
splitting is decided, but probably "getExpectedNumberOfObjects", which
is used by that method, is interesting as well. I actually think both
methods know about object limits to a certain degree, but it will have
to be reviewed that that part of the code has been kept up to date.

I think there may even be code that puts object limits in the order
files. It is conceivable that all you need to do is add the input field
in the user interface, but again: this code has not been used for a long
time and may not be kept up to date.


More information about the NetarchiveSuite-users mailing list