[Netarchivesuite-users] Some Questions
Mikis Seth Sørensen
mss at statsbiblioteket.dk
Tue May 13 09:12:16 CEST 2014
Statistics from the last index:
Index size: 906GB
Arc-files: 99815 arc-files each of size 100MB
These arc-files are random pages from the danish internet in 2007, so I
consider them a fair sample data size to average out outliers.
But the 2007 internet pages are slightly small than web-pages harvested
in 2013 etc. There is a tendency webpages size grows over time. I have not
seen this pattern yet, but only heard others claim of this correlation.
However the pictures/word-documents binaries etc. are not in the index of
course, only all the text meta data we harvest with Tika goes into the
We'll get back to you regarding your other questions shortly :-)
On 5/11/14 5:10 PM, "Peter M" <imagenoise at aol.com> wrote:
>Hey Mikis, hello other Users,
>thanks for your answers Mikis and Thomas! I'm still trying to find my
>way with NAS and there's not much info on the net, so sorry for
>bothering you again.
>1) How can I delete/pause scheduled jobs, the ones classified as "new"
>on the "Job Status" page? I can't find anything on the job status page
>or on the "Details for Job X" page. For Deactivating on the Selective
>Harvests page is to late because the job is already in the job-queue.
>I can cancel them via heritrix gui one by one after they started, but
>that takes a lot of time for several jobs an is kinda unfortunate,
>because NAS thinks the job completd succesfully if I terminate the job
>2) > Yes it is only possible to add complete domains in the general domain
>> listing. Subdomains or subdirectories are handle though seed lists,
>> are either defined on domains (see
>hm, ok. so to stay with the given example
>(http://facebook.com/ladygaga), I would have to create a new selective
>harvest definition, add facebook.com, edit the selective harvest
>definition, add facebook.com/ladygaga as seed, and then only the seeds
>are harvested? do I have to deactivate facebook.com somehow so that
>itself is not harvested but only the subdirectory? I tried it this way
>and only 3,588 Bytes and 2 Documents got harvested.
>"Domain/Seeds for harvestdefinition ladygaga
>Search results: 1, displaying results 1 to 1.
>previous / next
>facebook.com (1 Seeds)
>Total: 1 Domains / 1 Seeds"
>Actually I don't really understand the concept of seeds (domains seen to
>be easy :) ), the given information on sbforge doesn't really help and I
>can't find anything in the heritrix documentation.
>3) > This is handled by starting more HarvestControllerApplication
>could have found that out myself, sorry for that.
>meanwhile I found out that if I don't add all urls in a selective
>harvest as domains but some as seeds, heritrix is using more than 1
>active thread, whereas if I pass them all as domains I only get "1
>active of 50 threads" (all domains relate to different servers). why is
>that so (probably related to my not understanding of seeds :) - maybe
>extending the user manual regarding this would be nice...)?
>>1: Index-builder. 256 GB ram, 48 CPU. Runs the 3 described software
>>applications and a SOLR server. It is only responsible for building
>>INDEX-files each of size 1 TB.
>>When optimizing the index the SOLR server needs 32 GB ram. Completed
>>index-files(size 1 TB) are copied and removed.
>>It takes 10 days to build a 1 TB optimized index with 40 worker.
>well, thats impressive. our small underfunded ngo based archive plays in
>a completly different league :/
>I'd be interested in the size of the index compared to the raw data.
>How big is the (optimized) index of lets say 1 TB of harvested average
>webpages of mixed content (html, pics and maybe a few videos)?
>5) If I look at /home/test/QUICKSTART/bitarkiv/filedir it seems, that
>for each job-id two files are creates. For test reasons I harvested a
>small, non changing webpage, the runs 1, 2 and 3 relate to the following
>143K Mar 28 19:58 2-1-20140328184831-00000-serve.warc
>69K Mar 28 19:58 2-metadata-1.warc
>143K Mar 28 20:10 3-1-20140328185959-00000-serve.warc
>69K Mar 28 20:10 3-metadata-1.warc
>146K Apr 14 19:47 4-1-20140414173740-00000-serve.warc
>69K Apr 14 19:47 4-metadata-1.warc
>Looking at the sizes of the files, it doesn't seem that the harvester
>module did any de-duplication?
>6) Httrack got a so called "near flag". With this it also downloads
>content beeing embedded on the harvested page like e. g. an embedded
>youtube video. Is something like this also possible with the NAS? Or
>would that be a case for "Missing URL Collection"?
>thanks again for helping!
>NetarchiveSuite-users mailing list
>NetarchiveSuite-users at ml.sbforge.org
More information about the NetarchiveSuite-users