[Netarchivesuite-users] Some Questions
Mikis Seth Sørensen
mss at statsbiblioteket.dk
Mon May 5 11:50:18 CEST 2014
Please find the answers to your question below:
On 5/3/14 12:06 PM, "Peter M" <imagenoise at aol.com> wrote:
>Dear Netarchive Suite users,
>I got a few questions and would be very thankfull for some help.
>I'm Running Quickstart Installation of the 4.2 Suite.
>1) How can I delete a selective harvest from the list?
>I can delete the domains, not the harvest.
NetarchiveSuite doesn't currently support deletion of Harvest Definitions,
and we are a bit wary of adding this feature, see NAS-1967
<https://sbforge.org/jira/browse/NAS-1967> Make it possible to delete
harvestdefinitions from harvest database
>2) Is it only possible to add complete domains or subdomains and not
>If I, e. g., wanna add "facebook.com/ladygaga" I get:
>"The following domains are illegal and cannot be added".
Yes it is only possible to add complete domains in the general domain
listing. Subdomains or subdirectories are handle though seed lists, which
are either defined on domains (see
harvest definitions (see
>3) How do I run several jobs at the same time?
>In my setup (single machine) I very often got selective harvests with
>only one domain. Heritrix beeing very polite/conservative with just 1
>thread per page even with a non-fancy server and a slow connection I got
>lots of ressources left. Running several instances of heritrix at the
>same time to harvest several jobs would help saving a lot of time.
This is handled by starting more HarvestControllerApplication instances
>4) You wrote in 2011 "Only a very limited number of researchers are
>currently using the Wayback access to the Danish webarchives. The
>Viewerproxy is used for Curator access to the Archive." Is that still
>the case? So how do researchers access different versions from a urls
>which have been harvested at different dates? Do they have access to the
>Harveststatus-jobdetails.jsp to press manualy "Select this job for QA
A better wording would have been:
Only a very limited number of researchers are using the Danish
webarchives. They use Wayback to access the archived web content.
So this means that because of the current legislative restrictions only a
very limited set of users are accessing the Danish webarchives, they have
to be researchers to be allowed to do this and they all use Wayback.
Viewerproxy access is only available for Harvest administrators, eg.
curator super users.
>5) Your integration of fulltext search with solr seems to have been
>successfull, are you going to publish a HOWTO or make the wiki
>public? Are you using Solr Cell/ExtractingRequestHandler or custom code?
Here the (lengthy) answer from one of our fulltext search developers:
The short answer is custom code. I will give you a brief overview of the
software and hardware. If you have further questions you are welcome to
ask for more details.
For Arc parsing (with Tika) we use the 3.party product::
This product defines the Solr Schema and can indexing a single ARC-file
towards a SOLR server with the schema.
We have changed the Schema to use doc-values(on disk) for 6 fields that
we want to use facets/grouping on(and define the fields as single-valued
as this was missing in the SOLR schema and required for doc-values).
2. ARCHON. (ARC bookkeeping application)
Very simple custom web-application that keeps track of ARC-files. ARC
files status can be new, running, completed(and assigned a shard ID) and
rejected (failed parsing). The arc-files locations and status are
persisted in a DB.
Another simple custom application (job) that requests unprocessed ARC
files from ARCHON and starts up a WARC-INDEXER (OS process) with the arc
file. ARCTIKA defines a worker-pool
that runs the WARC-Indexer job, we use 40 concurrent workers on the
index-builder machine. When a worker is completed it sends the status to
ARCHON as completed(and shardID) for that ARC-file.
When index-size has reached a predefined size (1TB, when optimized) we
stop indexing to that shardID.
1: Index-builder. 256 GB ram, 48 CPU. Runs the 3 described software
applications and a SOLR server. It is only responsible for building
INDEX-files each of size 1 TB.
When optimizing the index the SOLR server needs 32 GB ram. Completed
index-files(size 1 TB) are copied and removed.
It takes 10 days to build a 1 TB optimized index with 40 worker.
Runs a SolrCloud cluster with 25 Solr servers (shards) each assigned 8 GB
ram and using a 1 TB index. The 1 TD index'es are stored on seperate SSD
disks, and this is the explanation for the 1 TB index limit.
The cluster can resolve queries with facets in 1-2 seconds, and the SSD
disks are critical in reaching this performance level.
See https://github.com/netarchivesuite/netsearch for ARCHON and ARCTIKA
Thomas Egense <teg at statsbiblioteket.dk>
>( 6) You tested openwayback 2.0 beta, is it possible to access
>https-harvested sites? )
We haven't tested https support in openwayback yet, this was only a
regression test to see if openwayback would work in our current system. We
haven't got any immediate plan to switch to openwayback either. First
Heritrix 3, then openwayback :-). Https access is also a major concern for
us, so we are anxious too hear of a general solution for this.
>Thanks a lot and have a nice weekend!
>NetarchiveSuite-users mailing list
>NetarchiveSuite-users at ml.sbforge.org
More information about the NetarchiveSuite-users