[Netarchivesuite-users] Comments on plans for bitpreservation

Kåre Fiedler Christiansen kfc at statsbiblioteket.dk
Thu Jan 3 10:10:34 CET 2008


On Thu, 2007-12-27 at 16:25 +0100, Lars Clausen wrote:
> Here's some comments on the plans for improved bitpreservation
> (http://netarchive.dk/suite/AssignmentGroupB2):
> 
> B.2.1 looks fine -- I think 8 man-days is more than is needed if it
> really is mostly refactoring.

It is, but I stand by my estimate.

The thing is that unit tests are not very easy to refactor, at best they
are unsuited to test the new functionality, but it is more probable that
they need to be rewritten altogether.

That combined with the fact that after the refactor, a somewhat
extensive review will be necessary means that I think we should be
careful about underestimating this task.

> B.2.2 is rather fuzzy on the database design.  Firstly, I'll give the
> tables some names so I can talk about them: The first one is Filenames,
> the second is Checks, the third is Uploadstatus.  I might be misnaming
> these based on the issues below.
> 
> The Checks table can contain entries with null checksum, can there in
> that case be more than one entry for a file in a given bitarchive?  Or
> does the checksum-less entry overwrite any older ones?  Or are old
> checks simply never removed?
> 
> The Uploadstatus table is poorly defined -- what exactly is
> admindatastate?  If it's supposed to be an ArcRepositoryEntry, then
> there's duplicated information, and besides it's not a simple type, but
> has information for each location.  If it represents the StoreState,
> there should be also one per bitarchive per file.  

You are right, the third table should definitely store StoreStates, and
thus contain the bitarchive in the table as well. This may mean that the
intended logic isn't as simple as I've described.

> This task is definitely more than 6 man-days, since it includes moving
> DBConnect to common and setting up a new backup timer (I'm tempted to
> call this the time to abandon the embedded DB for production, simply to
> avoid the backup issues, and let the embedded DB be only for QuickStart
> and tests).

I don't see that as any major task, it's mostly just renaming the
cĺasses.

However, the choice of embedded or external DB can still be left open.
In our Danish deployment the database will still only be accessed from
one location, which is the web interface application.

I do agree that 6 md is probably underestimated (as are the rest of the
tasks in this group), but not due to this particular part of the task.

> One issue that has had me worried is how exactly the date of a check is
> given.  It's preferable that each operation stored (missing files check,
> checksum check etc) has the same date, but should it be the date of the
> start of the operation or the date of adding the data?  In any case, it
> should be clear that the date in addChecksumInformation and
> addFileListInformation should be constant per invocation.

Agreed. I'm not sure it matters if one or the other is chosen for dates.

> What should addAdmindataInformation do if there's already the same or
> similar admin data there?  Still not clear on the representation here.
> Also, it would appear that you can only add all of admindata at once,
> rather than make incremental updates.  Whether or not old data is
> overridden in addAdmindataInformation, it seems like an inefficient
> approach for small additions.

It is intended to overwrite old data.

Inefficient? Well, yes, possibly. But remember this is before we add
segment information. Everything is inefficient at this stage... Once we
add segments to the methods, it will be better, and we should always be
sure to have the newest admin data, what if something changed in them?

> The Javadoc for getWrongFilesInLastUpdate has a cut-n-paste error in the
> first sentence.

Whoops.

> The Javadoc for getMissingFilesInLastUpdate and
> getWrongFilesInLastUpdate talk about a 'last known update date' --
> should these be different dates for missing-files updates and checksum
> updates?  I assume from the presence of
> getDateOfLast{Missing,Wrong}FilesUpdate that they should, but it should
> be specified.

Yes, and agreed, it should.

> I get the feeling that there's some assumptions especially about admin
> data that are not clearly stated.

Possibly. I assumed that it is not the intention of the assignment to
give absolute painstakingly small details, some of the clarifications
are left for the implementor.

> In B.2.3 I have only one question -- why does the segment URL include an
> appid?  Isn't host and directory enough to identify the files?
> Including appid means it's difficult later to give a bitarchive a
> different appid.

Because you might have more BitarchiveApplications on the same host.
Specifically, we do have two bitarchive application on one host in our
own deployment.

> I agree with your solution of sending off a FileListJob to update admin
> data, but wonder if it shouldn't be something that can be done later, in
> case admin data is lost?  If so, it can be decoupled from reading old
> versions and simply be a way to manually update admin data with segment
> information.

Agreed.

> I think this will take more than 4 man-days -- it touches on many parts
> of the system, and includes a tricky rewrite of the DAO methods.

Agreed.

> I'll save B.2.4 and B.2.5 for later.

Good idea. :-) They will probably change slightly with the lessons
learned from B1-B3.

Best,
  Kåre




More information about the NetarchiveSuite-users mailing list