[Netarchivesuite-users] Some Questions

Søren Vejrup Carlsen svc at kb.dk
Mon May 19 11:35:57 CEST 2014

Hi Peter.
Thanks for the download tips.

About looking at the contents of the warc-files, with the metadata warc files, where you don't have any binary data,  you can just a textreader of your choice, like less, vi, or emacs.
You can browse the other warc-files in the same way, but the result will be more ugly.
You can also try out the JWAT-TOOLS (https://sbforge.org/display/JWAT/JWAT-Tools)

We are planning on a NetarchiveSuite release with Heritrix 3.X support at the end of this year

best regards
Søren V. Carlsen

-----Oprindelig meddelelse-----
Fra: NetarchiveSuite-users [mailto:netarchivesuite-users-bounces at ml.sbforge.org] På vegne af Peter M
Sendt: 18. maj 2014 21:49
Til: netarchivesuite-users at ml.sbforge.org
Emne: Re: [Netarchivesuite-users] Some Questions

Hello again,

I tried a new installation with the new 4.4 Quickstart manual. Works fine and is much better for people who just wanna do a quick try with the wget c&p commands and openmq installation script, thanks for that!

As the release of 4.4 already took place, you could change in the quickstart manual

wget -N -O NetarchiveSuite.zip


wget -N -O NetarchiveSuite.zip

I just wanted to let you know, concering my former question 5, that I did some new runs with 4.4 and deduplication seems to work.
Runs 1,2 and 3 of a not-changed homepage:

1.9M May 18 18:16 2-2-20140518160641-00000-serve.warc
 74K May 18 18:16 2-metadata-1.warc
117K May 18 19:23 3-2-20140518171314-00000-serve.warc
 69K May 18 19:23 3-metadata-1.warc
117K May 18 21:00 4-2-20140518185013-00000-serve.warc
 70K May 18 21:00 4-metadata-1.warc

I wanted to have a look at the contents of the warc files, which would be very useful for quality assessment and searching for crawler traps, but couldn't access them. "unzip" (cannot find zipfile directory) and "jar xv" (no message at all, but no files extracted) failed. What did I miss?

> Ad 4)
> Statistics from the last index:
> Index size: 906GB
> Arc-files: 99815 arc-files each of size 100MB

ok, so roughly a relation of 1:10 raw data size to index size.

In the NetarchiveSuite 5.0-Milestone1 Release Notes is written, that "The changes to the project since 4.4 are internal to the NetarchiveSuite development", so, just out of curiosity, wen might there be an integration fo heritrix 3.2?

thanks and all the best

NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org

More information about the NetarchiveSuite-users mailing list