[Netarchivesuite-devel] Integration of Netarchive suite with Permission management system

Fri Oct 29 10:19:37 CEST 2010

Hi Adam.
More insight into how NAS itself is using the API can be extracted from the java classes in dk.netarkivet.harvester.webinterface package, which are used by the JSP pages in the webpages/HarvestDefinition directory: http://kb-prod-udv-001.kb.dk/netarchivesuite/maven/docs/xref/dk/netarkivet/harvester/webinterface/package-summary.html

Best Regards
Søren

-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Adam Brokeš
Sendt: 27. oktober 2010 17:17
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Integration of Netarchive suite with Permission management system

Dear all,

I am working at National Library of the Czech Republic as crawl engineer and developer. My new task is development of some kind of connection between our permission management system and heritrix. The permission management system (WAdmin) is written in PHP and using MySQL db for storing data about resources, seeds, publishers, contracts and so on. The new system should automate (at least prepare) this kind of crawls:

1] resources with contracts - this domains was selected by our curators and have permissions from publishers to provide access. The resource has assigned crawl frequency (1, 2, 6, 12 months) and the work flow is: every month generate seedlist, use profile in heritrix and crawl the list, after that index new ARCs. That means all resources has the same crawl date in one batch.
2] thematic crawls - random crawls based on events (using specific profile) 3] domain crawls (2 per year) - before every harvest we get list of all domains from our national domain registrator (there is lot of additional data, so we do some sorting and text processing), cleaned list is used as seedlist with prepared profile (which is tuned with every new version of heritrix) 4] test crawl - curators sometimes encounter page where could be potentially problems for heritrix or wayback, so we do test crawl to explore possible pitfalls before we try to contact publisher

Beside that I would like to integrate few modules (statistics, our quality assurance workflow and linkextractor). I decided to write the system in Grails framework programmed in Groovy language which facilitate using whole Java enviroment and classes.

I took into consideration two possibilities:
1] using Heritrix 3 and its RESTful methods to control it or integrate heritrix more tightly to my system, I am just experiment with that and this approach doesn't seem hard but I need to develop some parts which are in NetarchiveSuite already (datamodel, controlling of heritrix) 2] integrate netarchive suite into our work flow, which means developing my system more as middle ware between WAdmin and NS. Here I am quite unsure how much time I need for that. I am writing to this list to get some idea and if anybody from NS experts could point me the right direction and estimate what the integration would take I would be really pleased and it will help me a lot.

Last few days I was experimenting and exploring both solutions but I am still undecided. If heritrix isn't to much simple and NS too complex for our needs :)

In attachment is component diagram how could the connection look like.

Last question: I expect that we are not alone in this situation (some libraries has same policy as us - collect everything and give access based on permission), I was discussing that with Bjarne Andersen in Singapore, but last time in Vienna I was told that this was put on hold. Just curious, has anybody tried to develop something like us?

Thank you very much and keep that high quality standard in your product! :)

Best regads,

Adam Brokes
--
http://en.webarchiv.cz
http://brokes.net
brokes at webarchiv.cz