[Netarchivesuite-devel] Integration of Netarchive suite with Permission management system

Adam Brokeš adam.brokes at gmail.com
Wed Oct 27 17:17:10 CEST 2010


Dear all,

I am working at National Library of the Czech Republic as crawl
engineer and developer. My new task is development of some kind of
connection between our permission management system and heritrix. The
permission management system (WAdmin) is written in PHP and using
MySQL db for storing data about resources, seeds, publishers,
contracts and so on. The new system should automate (at least prepare)
this kind of crawls:

1] resources with contracts - this domains was selected by our
curators and have permissions from publishers to provide access. The
resource has assigned crawl frequency (1, 2, 6, 12 months) and the
work flow is: every month generate seedlist, use profile in heritrix
and crawl the list, after that index new ARCs. That means all
resources has the same crawl date in one batch.
2] thematic crawls - random crawls based on events (using specific profile)
3] domain crawls (2 per year) - before every harvest we get list of
all domains from our national domain registrator (there is lot of
additional data, so we do some sorting and text processing), cleaned
list is used as seedlist with prepared profile (which is tuned with
every new version of heritrix)
4] test crawl - curators sometimes encounter page where could be
potentially problems for heritrix or wayback, so we do test crawl to
explore possible pitfalls before we try to contact publisher

Beside that I would like to integrate few modules (statistics, our
quality assurance workflow and linkextractor). I decided to write the
system in Grails framework programmed in Groovy language which
facilitate using whole Java enviroment and classes.

I took into consideration two possibilities:
1] using Heritrix 3 and its RESTful methods to control it or integrate
heritrix more tightly to my system, I am just experiment with that and
this approach doesn't seem hard but I need to develop some parts which
are in NetarchiveSuite already (datamodel, controlling of heritrix)
2] integrate netarchive suite into our work flow, which means
developing my system more as middle ware between WAdmin and NS. Here I
am quite unsure how much time I need for that. I am writing to this
list to get some idea and if anybody from NS experts could point me
the right direction and estimate what the integration would take I
would be really pleased and it will help me a lot.

Last few days I was experimenting and exploring both solutions but I
am still undecided. If heritrix isn't to much simple and NS too
complex for our needs :)

In attachment is component diagram how could the connection look like.

Last question: I expect that we are not alone in this situation (some
libraries has same policy as us - collect everything and give access
based on permission), I was discussing that with Bjarne Andersen in
Singapore, but last time in Vienna I was told that this was put on
hold. Just curious, has anybody tried to develop something like us?

Thank you very much and keep that high quality standard in your product! :)

Best regads,

Adam Brokes
--
http://en.webarchiv.cz
http://brokes.net
brokes at webarchiv.cz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: component-NSaWAH.png
Type: image/png
Size: 24205 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-devel/attachments/20101027/baeda765/attachment-0002.png>


More information about the Netarchivesuite-devel mailing list