Søren Vejrup Carlsen svc at kb.dk
Wed Jun 11 18:06:42 CEST 2014

Hi Peter.
Ad 1. You cannot do that from the GUI. You must delete them from the database (DELETE FROM jobs where job_id=X)
Ad 2. Domain vs. seeds

A seed is a URL (http://www.kb.dk) and a seed belongs to a specific Top-Level-domain (TLD), in this case kb.dk

NAS can accept a list of TLDs (kb.dk, bold.dk,...) and ingest them. When this happens, a Domain entity is created in NAS and  a basic seed is created for that domain , currently "www." + TLD
In NAS, you can create new seedlist for a domain, and edit old ones.

Another important concept in NAS is DomainConfiguration. This is basically a combination of a heritrix template and a seedlist

Each domain has at least one DomainConfiguration, which is a combination of the default seedlist ("www." +tld) and the default heritrix template)

A snapshot harvests combines the default domain DomainConfiguration of all domains in Netarchivesuite into multiple heritrix crawl-jobs in order to distribute onto multiple machines.

Each crawl-job is limited to harvesting the TLDs of the seeds in the combined seedlist (of all the default seedlist)

Ad selective harvest. When setting up a selective harvest, you can give NAS a series of seeds (URLS). In this case, domain entities are created for non-existing domains, and a seedlist added with the seed(s) from the urlist as only content.

A new DomainConfiguration is created for all these seeds using a Heritrix template of choice and some other criteria like (maxobject, maxbytes) 

When a Heritrix-crawl-job is created, all DomainConfiguratrions with the same Heitrix Template can be put together into one job. 

I hope this makes it easier to understand


For the heritrix crawler specific questions, please ask them on archive-crawler list:

Best Regards
Søren Vejrup Carlsen, NetarchiveSuite developer

Sendt: 11. juni 2014 17:28
Hello again,

since I'm not able to continue my netarchive suite exploration without
further knowledge, I've to reask three of my former questions.

>> 1) How can I delete/pause scheduled jobs, the ones classified as "new"
>> on the "Job Status" page? I can't find anything on the job status page
>> or on the "Details for Job X" page. For Deactivating on the Selective
>> Harvests page is to late because the job is already in the job-queue.
>> I can cancel them via heritrix gui one by one after they started, but
>> that takes a lot of time for several jobs an is kinda unfortunate,
>> because NAS thinks the job completd succesfully if I terminate the job
>> via heritrix-gui.

>> 2) > Yes it is only possible to add complete domains in the general domain
>>> listing. Subdomains or subdirectories are handle though seed lists,
>>> which are either defined on domains
>> hm, ok. so to stay with the given example
>> (http://facebook.com/ladygaga), I would have to create a new selective
>> harvest definition, add facebook.com, edit the selective harvest
>> definition, add facebook.com/ladygaga as seed,  and then only the seeds
>> are harvested? do I have to deactivate facebook.com somehow so that
>> itself is not harvested but only the subdirectory? I tried it this way
>> and only 3,588 Bytes and 2 Documents got harvested.
>> "Domain/Seeds for harvestdefinition ladygaga
>> Search results: 1, displaying results 1 to 1.
>> previous / next
>> facebook.com (1 Seeds)
>>        http://facebook.com/ladygaga
>> Total: 1 Domains / 1 Seeds"
>> Actually I don't really understand the concept of seeds (domains seen to
>> be easy :) ), the given information on sbforge doesn't really help and I
>> can't find anything in the heritrix documentation.

>> 6) Httrack got a so called "near flag". With this it also downloads
>> content beeing embedded on the harvested page like e. g. an embedded
>> youtube video. Is something like this also possible with the NAS? Or
>> would that be a case for "Missing URL Collection"?

thanks a lot, best

