Automated Metadata Post-Processing in the XMC Cat
Current Version:    catalog-post-processor-1.0.2.jar     javadocs

Automated Metadata Post-Processing - What Is It?
Now that you have a catalog running, how do you populate it? One of the significant hurdles in building an archive is getting the metadata. Others have identified the problem that if curating data is left to the user, it often does not get done (this is somewhat analogous to documentation in the realm of software development). Additionally, other researchers (Jim Gray et al.) have noted that metadata is ephemeral - if you do not capture it when the data is generated, you cannot go back later to get it when you need it. In the XMC Cat, we have tackled this issue by creating a hook to add domain specific plugins that will be run on files added to the metadata catalog. However, this framework is domain independent and runs as a separate process so it does not slow down the initial insert of the metadata.

How it Works:
The XMC Cat has a framework that can run utilities that will process a file based on the distinguished name (DN) of the user adding the file and the globally unique ID (GUID) of the file. As each file is added to the XMC Cat, its GUID and the DN of its owner are added to a queue and a separate set of threads run each plugin for the GUID/DN pair. Each plugin program must have a static method named "process" with the following signature:
public static void process(String dn, String guid, String epr, PostProcessMap queueProcessor, PostProcessParameters parameters)

The values for each of these parameters is as follows:
dndistinguished name of the user
guidglobal ID of the file
eprendpoint of the XMC Cat metadata catalog
queueProcessorPostProcessMap interface used to share name/value pairs among post-processors working on the same file. This interface has three methods:
String getValue(String key)
boolean hasName(String name)
void setValue(String key, String value)
parametersPostProcessParameters interface used by a post-processor class to access parameters set in the configuration file for that class as well as parameters set in the configuation for all post-processors. This interface has the following methods:
String getParameter(String name)
boolean hasParameter(String name)
The PostProcessMap interface allows the utility programs to set and get name/value pairs for the file being processed. If one utility needs to query for the title element of the file, (title is an element in the metadata schema used in LEAD) it can set this as a name/value pair to be used by subsequent utilities called to process this same file. There is no sharing of name/value pairs across processing of files.

The PostProcessParameters interface gets parameters set in the post-process.xml configuration file in the CatalogService/lib directory on the server. Each class that is to be run as a post-processor has a section such as the following in this configuration file:
<processorClass name="edu.indiana.dde.mylead.postprocessors.NamelistPostProcessor">
     <parameter name="someLocalParam1">A</parameter>
     <parameter name="someLocalParam2">B</parameter>
</processorClass>

In this example, the process method in the edu.indiana.dde.mylead.postprocessors.NamelistPostProcessor class could access the values of the parameters named someLocalParam1 and someLocalParam2 when processing a file.

The catalog-post-processor-1.0.2.jar contains both the PostProcessMap and PostProcessParameters interfaces and also an abstract class named AbstractPostProcessor. This class contains a couple static helper methods that may be useful. Please see the javadocs for details. This jar requires all of the jars needed for the XMC Cat client - please see the XMC Cat page. The AbstractPostProcessor class also contains an empty static proccess method, but classes extending the AbstractPostProcessor class will be hiding and not overriding the process method. This is because static methods cannot be abstract in Java. For our purposes this is not an issue since the process method of the utility is always invoked using reflection on the utility class itself (so the empty process method in the abstract class will always remain hidden). Since all of the methods in the AbstractPostProcessor class are static, each utility can extend that class or just include the jar on the build path and use the utilities as desired. the only requirement of each utility is that it implement the process method with the signature as described above.

Once the utility is created, it is added to the post-processing by creating a jar for it and putting that jar in the lib folder of the XMC Cat's CatalogService folder (in the Axis2 services folder on the Tomcat server). The utility is then registered with the service by adding its full name (including the java package name) to the post-process.xml file (also in the XMC Cat lib folder on the server). When the server is restarted, the plugin added to the post-process.xml file will be considered for each file subsequently added..

The services.xml file contains parameters that determine how many queue processor threads are created when the service is started and the size of the queue (overflow is stored in the database and likewise any records remaining in the queue are stored to the database when the service is shut down). When the server is started, records that were not yet processed when the server was shut down are put back into the queue.