[Netarchivesuite-users] Some Questions

Mikis Seth Sørensen mss at statsbiblioteket.dk
Fri Jun 13 08:54:31 CEST 2014

Hi Peter

Ad 6.
In NetarchiveSuite you have to explicitly define the harvest configuration
for what to harvest, a 'near flag' would be to imprecise for this. So
missing content for a harvest can either be added to the next harvest by
adding it to the seed list, f.ex. by using the "Missing URL Collection",
or by changing the harvest configuration to collect more content.

NAS will not collect streaming content, like youtube videos. Collecting
and exposing streaming content is one of the ongoing problems in the web
archiving community as such.


On 6/11/14 6:06 PM, "Søren Vejrup Carlsen" <svc at kb.dk> wrote:

>Hi Peter.
>Ad 1. You cannot do that from the GUI. You must delete them from the
>database (DELETE FROM jobs where job_id=X)
>Ad 2. Domain vs. seeds
>A seed is a URL (http://www.kb.dk) and a seed belongs to a specific
>Top-Level-domain (TLD), in this case kb.dk
>NAS can accept a list of TLDs (kb.dk, bold.dk,...) and ingest them. When
>this happens, a Domain entity is created in NAS and  a basic seed is
>created for that domain , currently "www." + TLD
>In NAS, you can create new seedlist for a domain, and edit old ones.
>Another important concept in NAS is DomainConfiguration. This is
>basically a combination of a heritrix template and a seedlist
>Each domain has at least one DomainConfiguration, which is a combination
>of the default seedlist ("www." +tld) and the default heritrix template)
>A snapshot harvests combines the default domain DomainConfiguration of
>all domains in Netarchivesuite into multiple heritrix crawl-jobs in order
>to distribute onto multiple machines.
>Each crawl-job is limited to harvesting the TLDs of the seeds in the
>combined seedlist (of all the default seedlist)
>Ad selective harvest. When setting up a selective harvest, you can give
>NAS a series of seeds (URLS). In this case, domain entities are created
>for non-existing domains, and a seedlist added with the seed(s) from the
>urlist as only content.
>A new DomainConfiguration is created for all these seeds using a Heritrix
>template of choice and some other criteria like (maxobject, maxbytes)
>When a Heritrix-crawl-job is created, all DomainConfiguratrions with the
>same Heitrix Template can be put together into one job.
>I hope this makes it easier to understand
>For the heritrix crawler specific questions, please ask them on
>archive-crawler list:
>Best Regards
>Søren Vejrup Carlsen, NetarchiveSuite developer
>Fra: NetarchiveSuite-users [netarchivesuite-users-bounces at ml.sbforge.org]
>på vegne af Peter M [imagenoise at aol.com]
>Sendt: 11. juni 2014 17:28
>Til: netarchivesuite-users at ml.sbforge.org
>Emne: Re: [Netarchivesuite-users] Some Questions
>Hello again,
>since I'm not able to continue my netarchive suite exploration without
>further knowledge, I've to reask three of my former questions.
>>> 1) How can I delete/pause scheduled jobs, the ones classified as "new"
>>> on the "Job Status" page? I can't find anything on the job status page
>>> or on the "Details for Job X" page. For Deactivating on the Selective
>>> Harvests page is to late because the job is already in the job-queue.
>>> I can cancel them via heritrix gui one by one after they started, but
>>> that takes a lot of time for several jobs an is kinda unfortunate,
>>> because NAS thinks the job completd succesfully if I terminate the job
>>> via heritrix-gui.
>>> 2) > Yes it is only possible to add complete domains in the general
>>>> listing. Subdomains or subdirectories are handle though seed lists,
>>>> which are either defined on domains
>>> hm, ok. so to stay with the given example
>>> (http://facebook.com/ladygaga), I would have to create a new selective
>>> harvest definition, add facebook.com, edit the selective harvest
>>> definition, add facebook.com/ladygaga as seed,  and then only the seeds
>>> are harvested? do I have to deactivate facebook.com somehow so that
>>> itself is not harvested but only the subdirectory? I tried it this way
>>> and only 3,588 Bytes and 2 Documents got harvested.
>>> "Domain/Seeds for harvestdefinition ladygaga
>>> Search results: 1, displaying results 1 to 1.
>>> previous / next
>>> facebook.com (1 Seeds)
>>>        http://facebook.com/ladygaga
>>> Total: 1 Domains / 1 Seeds"
>>> Actually I don't really understand the concept of seeds (domains seen
>>> be easy :) ), the given information on sbforge doesn't really help and
>>> can't find anything in the heritrix documentation.
>>> 6) Httrack got a so called "near flag". With this it also downloads
>>> content beeing embedded on the harvested page like e. g. an embedded
>>> youtube video. Is something like this also possible with the NAS? Or
>>> would that be a case for "Missing URL Collection"?
>thanks a lot, best
>NetarchiveSuite-users mailing list
>NetarchiveSuite-users at ml.sbforge.org
>NetarchiveSuite-users mailing list
>NetarchiveSuite-users at ml.sbforge.org

More information about the NetarchiveSuite-users mailing list