[Netarchivesuite-users] Upload errors

Tue Jun 3 09:53:50 CEST 2008

On Mon, 2008-06-02 at 16:08 +0200, aponb at gmx.at wrote:
> > Hi there.
> > It is difficult to answer your question without looking at the settings.xml, the application is using.
> > This will tell us what kind of RemoteFile it is trying to create a singleton for (and so we can better figure out what the problem is)
> > In the next stable release, this information will be in the log as well.
> > So please send us your settings.xml for the harvester application
> Actually I am using the startApp and startHarvestApp functions of your harvest Script, so the port will be assigned automatically.

Yes, that will make sure that each application gets unique ports.
However, the tool will need a unique port as well. This is probably what
fails. Possibly, we should generate shell scripts for running the tools
as well, so the ports are assigned for those too...

> But it is not a single script anymore, so the values must be initialized correct at the beginning, so maybe there is a mistake. I will check.
> It's time to use more than one machine! Do you have any suggestions when using four machines?

It really depends on your needs.

There are three things that are reasonable to distribute:

1) The archive

There are two ways to distribute the archives:
a) Replicating the archive on two (or more) locations
   You should choose this option if you need greater "bit security".
   The system gives you the option to actively check that the two
   replicas are in sync.
b) Distributing one archive location on more machines
   You should choose this option if you need more storage than you can 
   access on one machine, or if you need more CPU power per megabyte in
   your archive.

2) The harvesters

Distributing the harvesters is the most obvious choice. Distributing the
harvesters should be done to allow more CPU power and RAM for each
harvester, by having more dedicated machines for harvesting.

If you are running more simultaneous harvests, it is necessary to have
at least more than one harvester application. One machine can easily
handle more than one harvester application if you have a dedicated
machine for harvesting.

3) The viewerproxy (QA and access)

The viewerproxy application is currently mostly suited for QA, since
controlling what index is used (i.e. what parts of the archive can
currently be browsed) will control the index for all users of the
application.

Thus, to have different people accessing the archive, you need a
separate viewerproxy application for each person, or they will need to
coordinate what currently is the active index.

However, again, several instances can easily be started on the same
computer.

We have in our installation chosen the following distribution:

The archive:

We have two replicated locations, geographically far apart. Each
BitarchiveApplication is running on servers with no other part of
NetarchiveSuite running, and those machines have heave access
restrictions, to avoid any accidental loss of bits.

We have two different strategies for the two locations:
One location has lots of machines with relatively little storage per
server. This gives lots of CPU power on that location.
The other location has only one server with lots of storage mounted.

The BitarchiveMonitorApplications (one for each location) and the
ArcRepositoryServer are all running on the same machine, which we call
the "admin-machine".

The harvesters:

We have a number of dedicated harvester machines, each with four
HarvestControllerApplications running, one of the four on each machine
are selected for domain-wide snapshot-harvests.

The access machines:

We have two servers in use: On for QA and one for end-user access.
Each of these are running a ViewerProxyApplication per user (we have
relatively few users, I think the number on each server is around 8,
Bjarne will know more).

On one of the machines we are also running the IndexServerApplication.

General:

Finally, the GUIApplication that controls all of the things, and which
hosts the Scheduler is on the same machine as the ArcRepository and the
BitarchiveMonitors (the admin-machine).

As for recommendation:

As I assume you are currently having the four machines in one geographic
location, I would suggest the following layout:

1) A bitarchive machine
Running the BitarchiveApplication. Try to restrict shell access to this
machine as much as possible.
2) A harvester machine
Running four HarvestControllerApplications. If you use the
snapshot-harvest functionality, let one have "LOWPRIORITY", to take
those jobs.
3) An access machine
Running the index server and a viewerproxy.
4) An admin machine
Running the BitarchiveMonitorApplication, the ArcRepositoryApplication,
and the GUIApplication

In planning ahead, you need to consider if you need a replicated archive
on another location, if so get a server elsewhere with storage
capabilities, and start a BitArchive application there. If that happens,
I think we may need to open a feature request on upgrading a current
archive to more locations, as it is currently not supported, and will
need a little work to get up and running.

Also, expect that you may need more harvester machines.

Enjoy, and don't hesitate to ask again.

Best,
  Kåre
-- 
Kaare Fiedler Christiansen - NetarchiveSuite-developer
THE STATE AND UNIVERSITY LIBRARY, 
Universitetsparken 1, 8000 Aarhus C, Denmark.
Phone: +45 89462036