[Netarchivesuite-users] Some Questions
Søren Vejrup Carlsen
svc at kb.dk
Wed Jun 11 18:06:42 CEST 2014
Hi Peter.
Ad 1. You cannot do that from the GUI. You must delete them from the database (DELETE FROM jobs where job_id=X)
Ad 2. Domain vs. seeds
A seed is a URL (http://www.kb.dk) and a seed belongs to a specific Top-Level-domain (TLD), in this case kb.dk
NAS can accept a list of TLDs (kb.dk, bold.dk,...) and ingest them. When this happens, a Domain entity is created in NAS and a basic seed is created for that domain , currently "www." + TLD
In NAS, you can create new seedlist for a domain, and edit old ones.
Another important concept in NAS is DomainConfiguration. This is basically a combination of a heritrix template and a seedlist
Each domain has at least one DomainConfiguration, which is a combination of the default seedlist ("www." +tld) and the default heritrix template)
A snapshot harvests combines the default domain DomainConfiguration of all domains in Netarchivesuite into multiple heritrix crawl-jobs in order to distribute onto multiple machines.
Each crawl-job is limited to harvesting the TLDs of the seeds in the combined seedlist (of all the default seedlist)
Ad selective harvest. When setting up a selective harvest, you can give NAS a series of seeds (URLS). In this case, domain entities are created for non-existing domains, and a seedlist added with the seed(s) from the urlist as only content.
A new DomainConfiguration is created for all these seeds using a Heritrix template of choice and some other criteria like (maxobject, maxbytes)
When a Heritrix-crawl-job is created, all DomainConfiguratrions with the same Heitrix Template can be put together into one job.
I hope this makes it easier to understand
-------------------------------------------
For the heritrix crawler specific questions, please ask them on archive-crawler list:
https://groups.yahoo.com/neo/groups/archive-crawler/info
Best Regards
Søren Vejrup Carlsen, NetarchiveSuite developer
________________________________________
Fra: NetarchiveSuite-users [netarchivesuite-users-bounces at ml.sbforge.org] på vegne af Peter M [imagenoise at aol.com]
Sendt: 11. juni 2014 17:28
Til: netarchivesuite-users at ml.sbforge.org
Emne: Re: [Netarchivesuite-users] Some Questions
Hello again,
since I'm not able to continue my netarchive suite exploration without
further knowledge, I've to reask three of my former questions.
>> 1) How can I delete/pause scheduled jobs, the ones classified as "new"
>> on the "Job Status" page? I can't find anything on the job status page
>> or on the "Details for Job X" page. For Deactivating on the Selective
>> Harvests page is to late because the job is already in the job-queue.
>> I can cancel them via heritrix gui one by one after they started, but
>> that takes a lot of time for several jobs an is kinda unfortunate,
>> because NAS thinks the job completd succesfully if I terminate the job
>> via heritrix-gui.
>> 2) > Yes it is only possible to add complete domains in the general domain
>>> listing. Subdomains or subdirectories are handle though seed lists,
>>> which are either defined on domains
>>
>> hm, ok. so to stay with the given example
>> (http://facebook.com/ladygaga), I would have to create a new selective
>> harvest definition, add facebook.com, edit the selective harvest
>> definition, add facebook.com/ladygaga as seed, and then only the seeds
>> are harvested? do I have to deactivate facebook.com somehow so that
>> itself is not harvested but only the subdirectory? I tried it this way
>> and only 3,588 Bytes and 2 Documents got harvested.
>>
>> "Domain/Seeds for harvestdefinition ladygaga
>> Search results: 1, displaying results 1 to 1.
>>
>> previous / next
>> facebook.com (1 Seeds)
>> http://facebook.com/ladygaga
>>
>> Total: 1 Domains / 1 Seeds"
>>
>> Actually I don't really understand the concept of seeds (domains seen to
>> be easy :) ), the given information on sbforge doesn't really help and I
>> can't find anything in the heritrix documentation.
>> 6) Httrack got a so called "near flag". With this it also downloads
>> content beeing embedded on the harvested page like e. g. an embedded
>> youtube video. Is something like this also possible with the NAS? Or
>> would that be a case for "Missing URL Collection"?
thanks a lot, best
peter
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
More information about the NetarchiveSuite-users
mailing list