[Netarchivesuite-users] Few questions about netarchivesuite

Kåre Fiedler Christiansen kfc at statsbiblioteket.dk
Mon Oct 27 16:07:49 CET 2008

On Mon, 2008-10-27 at 14:12 +0100, Tomas Ukkonen wrote:
> Hi
> Thanks for your reply.
> However, I think you misunderstood my question
> so maybe I was somewhat unclear what I meant.
> > > 1) Is there anyway to delete domains?
> > >
> > > You just choose Definitions => Selective Harvests => Find Harvest
> > > definition => Edit => Find domain => Remove - and the domain is gone
> Yes. I can remove domains from Selective Harvests but I would like
> to remove domains _completely_. So that it isn't listed - for an example -
> on Domain Statistics page anymore.
> For an example, I have added an domain (+ its seeds) to the system that
> I would like to remove from the system for good. If I try to make
> domain's seed list empty get an error message saying:
> "Parameter seedList must not be empty."
> I would like to remove domain (and its seeds) so that
> it isn't included to Snapshot Harvests anymore.
> Is this currently possible?

Short answer is: No, it is not currently possible. Domains are supposed
to live forever, to be able to browse statistics of previous harvests
etc., at least that's the current design.

It is however possible to exclude it from the snapshot harvest, although
it is a bit of a hack. You can set the domain limit in the default
configuration to 0, or you can set hte Heritrix template to be a
non-harvesting template.

It may be a good idea to have the possibility of making domains
active/inactive, as with harvest definitions, but it is not currently

> > > 2) Currently ...  Is this currently the best way to archive this?
> > >
> > > If you have created a new harvest definition, you just have
> > to put new
> > > seeds into the box ...
> > >
> > > Definitions => Selective Harvests => Find Harvest
> > definition => Edit
> > > => Add seeds
> > >
> > > - and the system will find and create all domains that are
> > new in the
> > > harvest definition, and place the respective seeds in the
> > domains seed
> > > list for this harvest definition
> Hmm..
> I'm thinking I'm trying to use the system in a way that it
> wasn't meant for. I would like to have abstract 'URL lists'
> domains which I could then remove or add to Selective Harvests.
> You seem to suggest that I should make Selective Harvests
> around certain theme which I then could rerun. This is OK
> but may force you to change all harvest definitions when
> links change unless you create abstract TLD domains which
> seed lists you can edit.

The idea of abstract TLD domains will not work properly, because a lot
of the harvest limit logic assumes that seeds for a harvest are based on
the domains included in the harvest, and will extract statistics based
on domains.

NetarchiveSuite takes a very domain-centric view of the world. It
assumes seeds belong to domains, and a harvest is a set of domains with
particular configurations (seedlists and limits).

Having seeds from different domains in the same seedlist is currently
not supported.

> Btw. Is there anyway to remove "Selective Harvests" or
> "Snapshot Harvests" ?

Again, no, because it would effectively remove any possiblity of showing
the history of harvests.

> I can deactive them but with more heavy use the web
> interface will probably quickly become filled with
> de-actived old harvests.

I see the point, and I guess that some feature where you could archive
the harvests (and possibly the same with domains) would probably be a
good idea. It is not supported currently, unfortunately.

So basically, you have pointed out some deficiencies in the system, that
will have to be addressed at some point. I have added an item to the
feature request tracker where we will track the issue of adding the
possibility to archive elements:


You can follow progress of the item there.

Of course, if you are a developer, you are welcome to try to add a
patch. We would of course be helpful with pointers and discussions, if
you feel like giving it a shot.

Kaare Fiedler Christiansen - NetarchiveSuite-developer
Universitetsparken 1, 8000 Aarhus C, Denmark.
Phone: +45 89462036

More information about the NetarchiveSuite-users mailing list