[Netarchivesuite-devel] [archive-crawler] Heritrix 3.0.0-beta test release now available

Søren Vejrup Carlsen svc at kb.dk
Mon Oct 19 14:06:56 CEST 2009


FYI

-----Oprindelig meddelelse-----
Fra: archive-crawler at yahoogroups.com [mailto:archive-crawler at yahoogroups.com] På vegne af Gordon Mohr
Sendt: 7. oktober 2009 18:35
Til: archive-crawler at yahoogroups.com
Emne: [archive-crawler] Heritrix 3.0.0-beta test release now available

A new test release of Heritrix 3.0 is now available, version 3.0.0-beta.

We encourage expert Heritrix users curious about the new version or
willing to help with testing to try this beta release and share feedback.

Full information on obtaining and running this release is available on
the project wiki:

http://webarchive.jira.com/wiki/display/Heritrix/Heritrix3

== What's New in this Beta Release ==

* a first version of a tool to help users migrate their Heritrix 1.X
order.xml configurations to the Heritrix 3 configuration system

* synchronization bottlenecks affecting large crawls in the frontier and
major data structures have been removed

* an 'action' directory allows URIs and configuration/reporting scripts
to be passed to a running crawl via the filesystem

* a 'parallelQueues' setting allows multiple connections to a single
host (a new version of a feature that was once called 'valence')

* with the bundled model configuration, seed list text may include '-'
directives to exclude by SURT prefix like the traditional '+' directives

== What's New in Heritrix 3 ==

Heritrix 3 has a new, Spring-based system for configuring and
instantiating/launching crawls. The Spring-originated XML configuration
metadata format is now our format for describing crawls, as well.

The web-based user-interface in Heritrix 3 has been streamlined and
updated to have consistent URLs and simple forms for most actions,
including viewing and editing job files or running arbitrary script code
within the context of a job. Programmatic operations against the web
interface have replaced JMX as the preferred manner to remote-control
Heritrix.

Also, Heritrix 3 moves to a model where a single job, in a single job
directory, may be be relaunched in place many times (instead of creating
a new job directory before each launch).

== Limitations ==

As a prerelease test version, there are still known gaps in
functionality, interface, and documentation; we're working towards a
official 3.0 release before the end of 2009. The current prioritized
roster of issues to be addressed is viewable in the project issue
tracker.

Distribution packages (.tar.gz or .zip) may be downloaded directly from
our Maven2 repository:

http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix/3.0.0-beta/

As always, problem reports, ideas, fix/feature contributions, and other
kinds of feedback are all welcome here on the list and on the project
wiki and JIRA issue tracker:

Heritrix Wiki: http://webarchive.jira.com/wiki/display/Heritrix
Heritrix JIRA: http://webarchive.jira.com/browse/HER

Thanks!

- Gordon @ IA




------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/archive-crawler/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/archive-crawler/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:archive-crawler-digest at yahoogroups.com 
    mailto:archive-crawler-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    archive-crawler-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/





More information about the Netarchivesuite-devel mailing list