[Netarchivesuite-users] Programmatic access to NAS, incremental harvesting and documentation

Mikis Seth Sørensen mss at statsbiblioteket.dk
Mon Feb 9 11:25:45 CET 2015


Hi Peep

There is currently no API access to NAS. So if you would like to add new
domains and seeds you need to modify the DB directly, either by using a
DBClient app or by creation custom code for this. You can have a look at
the DomainDBDAO.java
<https://github.com/netarchivesuite/netarchivesuite/blob/638c4ce7a1eca0b3e8
790fdd28615f56c0e92b67/harvester/harvester-core/src/main/java/dk/netarkivet
/harvester/datamodel/DomainDBDAO.java> for inspiration, or use the
DomainDBDAO directly in your code. An alternative is to access the webGUI
programatically. We do this in our system test using Selenium, see the
system-test module for inspiration
(https://github.com/netarchivesuite/netarchivesuite/tree/master/integration
-test/system-test/src/test/java/dk/netarkivet/systemtest/functional).

The NAS branch with Heritrix3 is not stable yet. The current roadmap is:
1. Start march: Alpha release with limited possibility to configure and
run a harvest.
2. Start april: Beta release able to perform harvests with correct running
job, warc files and archiving.
3. Start june: 5.0 Release after testing at institutions. Initial template
migration finished.

So the hope is to have a alpha release in a the beginning of march. I
would recommend waiting for the beta release before trying any real
harvests, the alpha release is just for initial sanity testing.


Best
Mikis
 

On 2/9/15, 1:18 AM, "Peep Küngas" <peep.kungas at soatrader.com> wrote:

>Hello Mikis Seth Sørensen,
>
>It was a pleasure to meet you in Tallinn. Unfortunately I could not
>participate the development meetings. Anyway, I have some questions,
>which I hope you can give some simple explanation...
>
>Namely, I am wondering whether there is programmatic access to NAS (i.e.
>for adding new domains and seeds once they have been found from external
>sources)? In our case would like to add new domains/seeds to the
>harvesting jobs as soon as we discover them from both the Web and from
>other data sources.
>
>Additionally, what is your opinion on the stability of NAS with Heritrix
>3? There is a  build NetarchiveSuite-H3 at
>https://sbforge.org/jenkins/view/NetarchiveSuite/ with 98% tests
>successfully executed. We are setting up a crawler at the moment to
>create a snapshot of shallow part of the Estonian Web (estimated size is
>10TB) with plans to start processing it in 2nd quarter of 2015.
>Therefore I am wondering whether we could opt for the release with
>Heritrix 3 right now.
>
>Best regards
>Peep Küngas




More information about the NetarchiveSuite-users mailing list