[Netarchivesuite-users] Some Questions

Sun May 11 17:10:41 CEST 2014

Hey Mikis, hello other Users,

thanks for your answers Mikis and Thomas! I'm still trying to find my
way with NAS and there's not much info on the net, so sorry for
bothering you again.

1) How can I delete/pause scheduled jobs, the ones classified as "new"
on the "Job Status" page? I can't find anything on the job status page
or on the "Details for Job X" page. For Deactivating on the Selective
Harvests page is to late because the job is already in the job-queue.
I can cancel them via heritrix gui one by one after they started, but
that takes a lot of time for several jobs an is kinda unfortunate,
because NAS thinks the job completd succesfully if I terminate the job
via heritrix-gui.

2) > Yes it is only possible to add complete domains in the general domain
> listing. Subdomains or subdirectories are handle though seed lists, which
> are either defined on domains (see

hm, ok. so to stay with the given example
(http://facebook.com/ladygaga), I would have to create a new selective
harvest definition, add facebook.com, edit the selective harvest
definition, add facebook.com/ladygaga as seed,  and then only the seeds
are harvested? do I have to deactivate facebook.com somehow so that
itself is not harvested but only the subdirectory? I tried it this way
and only 3,588 Bytes and 2 Documents got harvested.

"Domain/Seeds for harvestdefinition ladygaga
Search results: 1, displaying results 1 to 1.

previous / next
facebook.com (1 Seeds)
        http://facebook.com/ladygaga

Total: 1 Domains / 1 Seeds"

Actually I don't really understand the concept of seeds (domains seen to
be easy :) ), the given information on sbforge doesn't really help and I
can't find anything in the heritrix documentation.

3) > This is handled by starting more HarvestControllerApplication instances
> (see
>https://sbforge.org/display/NASDOC/The+Deploy+Configuration+File#TheDeployC
>onfigurationFile-HarvestControllerApplication).

could have found that out myself, sorry for that.

meanwhile I found out that if I don't add all urls in a selective
harvest as domains but some as seeds, heritrix is using more than 1
active thread, whereas if I pass them all as domains I only get "1
active of 50 threads" (all domains relate to different servers). why is
that so (probably related to my not understanding of seeds :) - maybe
extending the user manual regarding this would be nice...)?

4) >Hardware:
>Two machines.
>1: Index-builder. 256 GB ram, 48 CPU. Runs the 3 described software
>applications and a SOLR server. It is only responsible for building
>INDEX-files each of size 1 TB.
>When optimizing the index the SOLR server needs 32 GB ram. Completed
>index-files(size 1 TB) are copied and removed.
>It takes 10 days to build a 1 TB  optimized index with 40 worker.

well, thats impressive. our small underfunded ngo based archive plays in
a completly different league :/

I'd be interested in the size of the index compared to the raw data.
How big is the (optimized) index of lets say 1 TB of harvested average
webpages of mixed content (html, pics and maybe a few videos)?

5) If I look at /home/test/QUICKSTART/bitarkiv/filedir it seems, that
for each job-id two files are creates. For test reasons I harvested a
small, non changing webpage, the runs 1, 2 and 3 relate to the following
files:

143K Mar 28 19:58 2-1-20140328184831-00000-serve.warc
69K Mar 28 19:58 2-metadata-1.warc
143K Mar 28 20:10 3-1-20140328185959-00000-serve.warc
69K Mar 28 20:10 3-metadata-1.warc
146K Apr 14 19:47 4-1-20140414173740-00000-serve.warc
69K Apr 14 19:47 4-metadata-1.warc

Looking at the sizes of the files, it doesn't seem that the harvester
module did any de-duplication?

6) Httrack got a so called "near flag". With this it also downloads
content beeing embedded on the harvested page like e. g. an embedded
youtube video. Is something like this also possible with the NAS? Or
would that be a case for "Missing URL Collection"?

thanks again for helping!
ciao
peter