Automated Metadata Post-Processing in the XMC Cat
Automated Metadata Post-Processing - What Is It?
Now that you have a catalog running, how do you populate it? One of the significant hurdles in
building an archive is getting the metadata. Others have identified the problem that if curating
data is left to the user, it often does not get done (this is somewhat analogous to documentation
in the realm of software development). Additionally, other researchers (Jim Gray et al.) have noted
that metadata is ephemeral - if you do not capture it when the data is generated, you cannot go back
later to get it when you need it. In the XMC Cat, we have tackled this issue by creating a hook to add
domain specific plugins that will be run on files added to the metadata catalog. However, this
framework is domain independent and runs as a separate process so it does not slow down the initial
insert of the metadata.
How it Works:
The XMC Cat has a framework that can run utilities that will process a file based on the distinguished name (DN) of
the user adding the file and the globally unique ID (GUID) of the file. As each file is added to the XMC Cat, its GUID
and the DN of its owner are added to a queue and a separate set of threads run each plugin for the GUID/DN pair.
Each plugin program must have a static method named "process" with the following signature:
public static void process(String dn, String guid, String epr, PostProcessMap queueProcessor, PostProcessParameters parameters)
The values for each of these parameters is as follows:
| dn | distinguished name of the user |
| guid | global ID of the file |
| epr | endpoint of the XMC Cat metadata catalog |
| queueProcessor | PostProcessMap interface used to share name/value pairs among post-processors
working on the same file. This interface has three methods:
String getValue(String key)
boolean hasName(String name)
void setValue(String key, String value)
|
| parameters | PostProcessParameters interface used by a post-processor class to access
parameters set in the configuration file for that class as well as parameters set in the configuation for all post-processors.
This interface has the following methods:
String getParameter(String name)
boolean hasParameter(String name)
|
The PostProcessMap interface allows
the utility programs to set and get name/value pairs for the file being processed. If one utility needs
to query for the title element of the file, (title is an element in the metadata schema used in LEAD) it can set
this as a name/value pair to be used by subsequent utilities called to process this same file. There is no
sharing of name/value pairs across processing of files.
The PostProcessParameters interface gets parameters set in the post-process.xml configuration file in the CatalogService/lib
directory on the server. Each class that is to be run as a post-processor has a section such as the following in this configuration file:
<processorClass name="edu.indiana.dde.mylead.postprocessors.NamelistPostProcessor">
<parameter name="someLocalParam1">A</parameter>
<parameter name="someLocalParam2">B</parameter>
</processorClass>
In this example, the process method in the
edu.indiana.dde.mylead.postprocessors.NamelistPostProcessor class could access the values
of the parameters named
someLocalParam1 and
someLocalParam2 when processing a file.
The
catalog-post-processor-1.0.2.jar contains both the PostProcessMap and PostProcessParameters
interfaces and also an abstract class named AbstractPostProcessor. This class contains a couple static helper methods that
may be useful. Please see the
javadocs for details. This jar requires all
of the jars needed for the XMC Cat client - please see the XMC Cat
page.
The AbstractPostProcessor class also contains an empty static proccess method, but classes extending the AbstractPostProcessor
class will be hiding and not overriding the process method. This is because static methods cannot be abstract in Java.
For our purposes this is not an issue since the process method of the utility is always invoked using reflection
on the utility class itself (so the empty process method in the abstract class will always remain hidden). Since all of
the methods in the AbstractPostProcessor class are static, each utility can extend that class or just include the jar on
the build path and use the utilities as desired. the only requirement of each utility is that it implement the process method
with the signature as described above.
Once the utility is created, it is added to the post-processing by creating a jar for it and putting that jar in the lib
folder of the XMC Cat's CatalogService folder (in the Axis2 services folder on the Tomcat server). The utility is then
registered with the service by adding its full name (including the java package name) to the post-process.xml file
(also in the XMC Cat lib folder on the server). When the server is restarted, the plugin added to the post-process.xml file will
be considered for each file subsequently added..
The services.xml file contains parameters that determine how many queue processor threads are created when the service is started
and the size of the queue (overflow is stored in the database and likewise any records remaining in the queue are stored to the database
when the service is shut down). When the server is started, records that were not yet processed when the server was shut down are put
back into the queue.