[Netarchivesuite-devel] Integration of Netarchive suite with Permission management system

Thu Oct 28 18:42:20 CEST 2010

Hi Adam.
I mentioned to you at the NAS-tutorial in Vienna, that integrating with NAS should be fairly tractable.

The whole state of the harvesting being done by NAS is stored in a database. Currently both MySQL (used by ONB), Postgresql (used by BNF), and Derby (used by netarkivet.dk, and the default database bundled with netarchivesuite).

See the relevant systemdesign here:
http://netarkivet.dk/suite/System%20Design%203.12#System_Design_3.12.2BAC8-Detailed_Harvester_Description.Detailed_Harvester_Design_Description

The scheduler included with NAS is the mechanism that generates the harvestjobs (inserts entries into the jobs table), and submits them to a JMS queue to be picked up by a ready harvesting process.

These harvestjobs are based on harvestdefinitions, which each defines harvest for a number of domain-configurations(specifies the order.xml for a specific domain including the seedlist) and with a certain schedule.

Any harvestdefinition can be either active or inactive. In the latter case, it is ignored by the scheduler.
In the former case, the scheduler generates jobs  according to the schedule assigned to harvestdefinition.

How to use the NetarchiveSuite java API outside Netarchivesuite.
===============================================================

For most of our harvest tables, we have access classes (DAO:

DomainDAO (administration of domains, and and their configurations)
HarvestDefinitionDAO (administration of harvestdefinitions: FullHarvest means snapshot harvest, 
and Partialharvest means Selective harvest)
JobDAO      (administration of harvest jobs)
ScheduleDAO (administration of harvest schedules) 
TemplateDAO (administration of heritrix templates)

http://netarchive.dk/apidocs/3.12.0/dk/netarkivet/harvester/datamodel/package-summary.html

For an simple example of how to use these DAO classes look at the code for
The dk.netarkivet.harvester.tools.HarvestTemplateApplication class.

http://kb-prod-udv-001.kb.dk/netarchivesuite/maven/docs/xref/dk/netarkivet/harvester/tools/HarvestTemplateApplication.html

Furthermore you need to tell your program where the harvest database is located by assigning a settings.xml
file to your program, and also where your NAS libraries are to be found.
I have attached an example how we give this information to our NAS applications. This kind of start and stop scripts are generated by our deploy program.

I hope this helps you on your way.

Best Regards

Søren V. Carlsen (QA of Netarchivesuite)

-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Adam Brokeš
Sendt: 27. oktober 2010 17:17
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Integration of Netarchive suite with Permission management system

Dear all,

I am working at National Library of the Czech Republic as crawl engineer and developer. My new task is development of some kind of connection between our permission management system and heritrix. The permission management system (WAdmin) is written in PHP and using MySQL db for storing data about resources, seeds, publishers, contracts and so on. The new system should automate (at least prepare) this kind of crawls:

1] resources with contracts - this domains was selected by our curators and have permissions from publishers to provide access. The resource has assigned crawl frequency (1, 2, 6, 12 months) and the work flow is: every month generate seedlist, use profile in heritrix and crawl the list, after that index new ARCs. That means all resources has the same crawl date in one batch.
2] thematic crawls - random crawls based on events (using specific profile) 3] domain crawls (2 per year) - before every harvest we get list of all domains from our national domain registrator (there is lot of additional data, so we do some sorting and text processing), cleaned list is used as seedlist with prepared profile (which is tuned with every new version of heritrix) 4] test crawl - curators sometimes encounter page where could be potentially problems for heritrix or wayback, so we do test crawl to explore possible pitfalls before we try to contact publisher

Beside that I would like to integrate few modules (statistics, our quality assurance workflow and linkextractor). I decided to write the system in Grails framework programmed in Groovy language which facilitate using whole Java enviroment and classes.

I took into consideration two possibilities:
1] using Heritrix 3 and its RESTful methods to control it or integrate heritrix more tightly to my system, I am just experiment with that and this approach doesn't seem hard but I need to develop some parts which are in NetarchiveSuite already (datamodel, controlling of heritrix) 2] integrate netarchive suite into our work flow, which means developing my system more as middle ware between WAdmin and NS. Here I am quite unsure how much time I need for that. I am writing to this list to get some idea and if anybody from NS experts could point me the right direction and estimate what the integration would take I would be really pleased and it will help me a lot.

Last few days I was experimenting and exploring both solutions but I am still undecided. If heritrix isn't to much simple and NS too complex for our needs :)

In attachment is component diagram how could the connection look like.

Last question: I expect that we are not alone in this situation (some libraries has same policy as us - collect everything and give access based on permission), I was discussing that with Bjarne Andersen in Singapore, but last time in Vienna I was told that this was put on hold. Just curious, has anybody tried to develop something like us?

Thank you very much and keep that high quality standard in your product! :)

Best regads,

Adam Brokes
--
http://en.webarchiv.cz
http://brokes.net
brokes at webarchiv.cz
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: starting_NAS.sh.txt
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20101028/1d7ae5b2/attachment-0002.txt>