[Netarchivesuite-users] Comments on plans for bitpreservation
Lars Clausen
lc at statsbiblioteket.dk
Thu Dec 27 16:25:28 CET 2007
Here's some comments on the plans for improved bitpreservation
(http://netarchive.dk/suite/AssignmentGroupB2):
B.2.1 looks fine -- I think 8 man-days is more than is needed if it
really is mostly refactoring.
B.2.2 is rather fuzzy on the database design. Firstly, I'll give the
tables some names so I can talk about them: The first one is Filenames,
the second is Checks, the third is Uploadstatus. I might be misnaming
these based on the issues below.
The Checks table can contain entries with null checksum, can there in
that case be more than one entry for a file in a given bitarchive? Or
does the checksum-less entry overwrite any older ones? Or are old
checks simply never removed?
The Uploadstatus table is poorly defined -- what exactly is
admindatastate? If it's supposed to be an ArcRepositoryEntry, then
there's duplicated information, and besides it's not a simple type, but
has information for each location. If it represents the StoreState,
there should be also one per bitarchive per file.
This task is definitely more than 6 man-days, since it includes moving
DBConnect to common and setting up a new backup timer (I'm tempted to
call this the time to abandon the embedded DB for production, simply to
avoid the backup issues, and let the embedded DB be only for QuickStart
and tests).
One issue that has had me worried is how exactly the date of a check is
given. It's preferable that each operation stored (missing files check,
checksum check etc) has the same date, but should it be the date of the
start of the operation or the date of adding the data? In any case, it
should be clear that the date in addChecksumInformation and
addFileListInformation should be constant per invocation.
What should addAdmindataInformation do if there's already the same or
similar admin data there? Still not clear on the representation here.
Also, it would appear that you can only add all of admindata at once,
rather than make incremental updates. Whether or not old data is
overridden in addAdmindataInformation, it seems like an inefficient
approach for small additions.
The Javadoc for getWrongFilesInLastUpdate has a cut-n-paste error in the
first sentence.
The Javadoc for getMissingFilesInLastUpdate and
getWrongFilesInLastUpdate talk about a 'last known update date' --
should these be different dates for missing-files updates and checksum
updates? I assume from the presence of
getDateOfLast{Missing,Wrong}FilesUpdate that they should, but it should
be specified.
I get the feeling that there's some assumptions especially about admin
data that are not clearly stated.
In B.2.3 I have only one question -- why does the segment URL include an
appid? Isn't host and directory enough to identify the files?
Including appid means it's difficult later to give a bitarchive a
different appid.
I agree with your solution of sending off a FileListJob to update admin
data, but wonder if it shouldn't be something that can be done later, in
case admin data is lost? If so, it can be decoupled from reading old
versions and simply be a way to manually update admin data with segment
information.
I think this will take more than 4 man-days -- it touches on many parts
of the system, and includes a tricky rewrite of the DAO methods.
I'll save B.2.4 and B.2.5 for later.
-Lars
More information about the NetarchiveSuite-users
mailing list