[Netarchivesuite-users] Some Questions

Peter M imagenoise at aol.com
Sun May 18 21:48:56 CEST 2014

Hello again,

I tried a new installation with the new 4.4 Quickstart manual. Works
fine and is much better for people who just wanna do a quick try with
the wget c&p commands and openmq installation script, thanks for that!

As the release of 4.4 already took place, you could change in the
quickstart manual

wget -N -O NetarchiveSuite.zip


wget -N -O NetarchiveSuite.zip

I just wanted to let you know, concering my former question 5, that I
did some new runs with 4.4 and deduplication seems to work.
Runs 1,2 and 3 of a not-changed homepage:

1.9M May 18 18:16 2-2-20140518160641-00000-serve.warc
 74K May 18 18:16 2-metadata-1.warc
117K May 18 19:23 3-2-20140518171314-00000-serve.warc
 69K May 18 19:23 3-metadata-1.warc
117K May 18 21:00 4-2-20140518185013-00000-serve.warc
 70K May 18 21:00 4-metadata-1.warc

I wanted to have a look at the contents of the warc files, which would
be very useful for quality assessment and searching for crawler traps,
but couldn't access them. "unzip" (cannot find zipfile directory) and
"jar xv" (no message at all, but no files extracted) failed. What did I

> Ad 4) 
> Statistics from the last index:
> Index size: 906GB
> Arc-files: 99815 arc-files each of size 100MB

ok, so roughly a relation of 1:10 raw data size to index size.

In the NetarchiveSuite 5.0-Milestone1 Release Notes is written, that
"The changes to the project since 4.4 are internal to the
NetarchiveSuite development", so, just out of curiosity, wen might there
be an integration fo heritrix 3.2?

thanks and all the best


More information about the NetarchiveSuite-users mailing list