OAI-PMH Static Repository is a plugin for Omeka that allows to convert a folder of files and/or metadata into a standard OAI-PMH static repository. By this way, the folder can be fetched by any standard OAI-PMH harvesters, in particular the one available for Omeka, OAI-PMH Harvester, through any standard OAI-PMH gateway, in particular another plugin for Omeka, OAI-PMH Gateway. In other words, you can self-harvest, or ingest, your own directories, files and metadata via a standard process.
Concretely, just install these three plugins, set a local folder or a remote one with files and/or metadata, then they will be harvested automatically and available directly in Omeka (see the included example below).
In the true world, this tool is designed for people and institutions who store folders of files somewhere on hard drives or servers, who manage them with a simple file manager and with some metadata in various files (text, spreadsheet, xml...). Of course, if files and metadata are well managed, they can be ingested too.
If you just want to import folders with various files and metadata, it is recommended to use Archive Folder, that is simpler to use.
Examples are included in the directory [tests/suite/_files/], but you can build your own before importing true files and metadata.
A sample folder with some metadata is available in "tests/suite/_files/Folder_test". To test it, follow these steps:
- install the fixed and improved fork of OAI-PMH Harvester, OAI-PMH Gateway and this plugin;
- if you want to test import of extra data, install [Geolocation] too;
- if you want to import all metadata that are in examples, allow the formats
and check the default extensions forods
in the page/admin/settings/edit-security
, and the same too for the respective media typesapplication/xml
, and defaultapplication/vnd.oasis.opendocument.spreadsheet
; - copy the folder outside of the Omeka install, somewhere the server can access;
- click on "Add OAI-PMH Static Repository" in the "OAI-PMH Static Repositories" tab;
- fill the base uri, something like
; - don't care about other parameters, they will be default;
- click on the submit button;
- click on "Check" to check the folder (results are displayed when the page is refreshed);
- if there is no issue, click "Update" to process the folder.
After a few seconds, the static repository will be automatically available via the OAI-PMH Gateway, and the harvest will be launched.
It can take a few tens of seconds for the harvester to import documents in Omeka, according to the server. Furthermore, in order to preserve memory and cpu, the harvest process can be done in multiple steps so if the twenty items are not loaded in one time, just re-launch the harvester (without updating folder) or increase values of parameters of the harvester.
See the notes below for other issues.
You can try the update of this harvest too:
- replace file "Dir_B/Subdir_B-A/document.xml" and/or "Dir_B/Subdir_B-A/document_external.xml" of your copied directory by the matching ones that are prepared in the directory "tests/suite/_files/Update_files/";
- click on "Update" in the "OAI-PMH Static Repositories" tab.
Install first the plugins [Archive Document], OAI-PMH Harvester, and ['Oai-PMH Gateway]. Even if only the first one is required, the latter are used to import data inside the Omeka database.
Note: the official OAI-PMH Harvester can only ingest standard metadata (elements). If you want to ingest other standards, files and metadata of files, extra data, and to manage records, you should use the fixed and improved fork of it.
The optional plugin [OAI-PMH Repository] can be used to expose data directly from Omeka. Note: the official [OAI-PMH Repository] has been completed in an improved fork, in particular with a human interface, and until merge of the commits, the latter is recommended.
The plugin [Ocr Element Set] can be installed too to import ocr data.
Then uncompress files and rename plugin folder OaiPmhStaticRepository
Then install it like any other Omeka plugin and follow the config instructions.
Some points should be checked too.
- Server Access
The server should allow the indexing of the folder for the localhost. So,
for [Apache httpd], the following commands may be added in a file .htaccess
the root of the folder (this is the default in the folder test, so it may be
needed to change it):
Options Indexes FollowSymLinks
# AllowOverride all
Order Deny,Allow
Deny from all
Allow from ::1
- Local path
If the server doesn't allow indexing, you can use the equivalent path /var/www/path/to/the/Folder_Test
or something similar. Nevertheless, for security reasons, the allowed base path
or a parent should be defined before in the file security.ini
of the plugin.
- Characters
It is recommended to have filenames with characters whose representations are the same in metadata files, on the source file system, the transport layer (http) and the destination file system, in particular for uppercase/lowercase, for non-latin characters and even if they simply contains spaces. Furthermore, the behavior depends on the version of PHP.
Nevertheless, the plugin manages all unicode characters. A quick check can be done with the folders "Folder_Test_Characters_Http" and "Folder_Test_Characters_Local". These folders contains files with spaces and some non-alphanumeric characters and metadata adapted for an ingest via http or local path.
In fact, currently, if the main uri is an url, all paths in metadata files
should be raw url encoded, except the reserved characters: $-_.+!*'()[],
- Files extensions
For security reasons, the plugin checks the extension of each ingested file. So, if you import specific files, in particular XML metadata files and json ones, they should be allowed in the page "/admin/settings/edit-security".
- XSLT processor
Xslt has two main versions: xslt 1.0 and xslt 2.0. The first is often installed with php via the extension "php-xsl" or the package "php5-xsl", depending on your system. It is until ten times slower than xslt 2.0 and sheets are more complex to write.
So it's recommended to install an xslt 2 processor, that can process xslt 1.0 and xslt 2.0 sheets. The command can be configured in the configuration page of the plugin. Use "%1$s", "%2$s", "%3$s", without escape, for the file input, the stylesheet, and the output.
Examples for Debian 6, 7, 8 / Ubuntu / Mint (with the package "libsaxonb-java"):
saxonb-xslt -ext:on -versionmsg:off -warnings:silent -s:%1$s -xsl:%2$s -o:%3$s
Examples for Debian 8 / Ubuntu / Mint (with the package "libsaxonhe-java"):
CLASSPATH=/usr/share/java/Saxon-HE.jar java net.sf.saxon.Transform -ext:on -versionmsg:off -warnings:silent -s:%1$s -xsl:%2$s -o:%3$s
Example for Fedora / RedHat / Centos / Mandriva / Mageia:
saxon -ext:on -versionmsg:off -warnings:silent -s:%1$s -xsl:%2$s -o:%3$s
Note: Only saxon is currently supported as xslt 2 processor. Because Saxon is a Java tool, a JRE should be installed, for example "openjdk-8-jre-headless".
Note: Warnings are processed as errors. That's why the parameter "-warnings:silent" is important to be able to process an import with a bad xsl sheet. It can be removed with default xsl, that doesn't warn anything.
Anyway, if there is no xslt2 processor installed, the command field should be cleared. The plugin will use the default xslt 1 processor of php, if installed.
The plugin creates an xml file that represents the content of a folder in a standard way and makes it available. So, the existing folder of files:
My Folder
├── image_n1.jpg
├── image_n2.jpg
├── image_n3.jpg
└── image_n4.jpg
will be available as a standard OAI-PMH Repository (simplified here):
<oai:repositoryName>My Folder</oai:repositoryName>
<ListRecords metadataPrefix="oai_dc">
The folder can contains sub-folders, so, if wanted, each folder can be imported as an item with multiple files.
My Nested Folder
├──┬─ Item_1
│ ├───┬─ Item_2
│ │ ├──── my_image_1.jpg
│ │ └──── my_image_2.jpg
│ ├──── my_image_1.jpg
│ └──── my_image_2.jpg
├──┬─ Item_3
│ ├──── my_image_3.jpg
│ └──── my_image_4.jpg
├──── my_image_5.jpg
└──── my_image_6.jpg
Here, there are 4 or 8 items according to the parameter unreferenced files
The metadata of each item can be imported if they are available in files in a
supported format. Currently, some formats are implemented: a simple text one
(as raw text or as OpenDocument Text odt
, for testing purpose), a simple json
format, a table one (as OpenDocument Spreadsheet ods
), the Mets xml, if the
profile is based on Dublin Core, and an internal xml one, Documents
, that
allows to manage all specificities of Omeka. Other ones can be easily added via
a simple class.
My Digitized Books
├──── Book_1.xml
├──── External.metadata.txt
├──┬─ Book 1
│ ├──── Page_1.tiff
│ ├──── Page_2.tiff
│ ├──── Page_3.tiff
│ └──── Page_4.tiff
└──┬─ Book 2
├──── Book_2.metadata.txt
├──── Page_1.tiff
├──── Page_2.tiff
├──── Page_3.tiff
└──── Page_4.tiff
Notes for metadata files:
- A metadata file can contain one or multiple documents.
- Referenced files in metadata files can be external to the original folder.
- Metadata files can be anywhere in the folder, as long as the paths to the referenced files urls are absolute or relative to it and that the server has access to it.
See below for more details on metadata files.
Books and serials are often digitized with an undesctructive compression via the format [Jpeg 2000], the metadata are saved in Mets and the content texts (OCR) are saved in Alto. All of them can be imported automagically.
My Digitized Books
├──┬─ Book_1
│ ├───── Book_1.mets.xml
│ ├───┬─ master
│ │ ├──── Book_1_001.jp2
│ │ ├──── Book_1_002.jp2
│ │ ├──── Book_1_003.jp2
│ │ └──── Book_1_004.jp2
│ └───┬─ ocr
│ ├──── Book_1_001.alto.xml
│ ├──── Book_1_002.alto.xml
│ ├──── Book_1_003.alto.xml
│ └──── Book_1_004.alto.xml
├──┬─ Book_2
│ ├───── Book_2.mets.xml
│ ├───┬─ master
│ │ ├──── Book_2_001.jp2
The xml Mets file contains path to each subordinate file (master, ocr, etc.), so the structure may be different.
The plugin [Ocr Element Set] should be installed to import ocr data, if any.
If there are other files in folders, for example old xml [refNum] or any other old texts files that may have been used previously for example for a conversion from refNum to Mets via the tool [refNum2Mets], they need to be skipped via the option "Unreferenced files" and/or the option "File extensions to exclude", with "refnum.xml ods txt" for example.
See examples in the folder "tests/suite/_files" of the plugin.
Dublin Core
(simple or qualified if wanted) is the default element set. The elementTitle
corresponds toDublin Core : Title
. -
If the element set name is not set, the check is case insensitive;
will be imported asDublin Core : Title
. Else, the check is case sensitive:Dublin Core : title
will not be identified. -
If the element name contains a
, it will be interpreted as a standard element, else as an extra data, that will be harvested via formatDocuments
(see below). Extra data can be theItem type
, thecollection
, thetags
, etc. -
A static repository does not allow sets, so there will be only one collection by folder. Nevertheless, the use of the format of harvesting
allows to ingest records in other collections (see below). -
Internal relative paths should be relative to the metadata file where they are referenced.
Xml files can be imgested as this in the static repository, as long as there is an associated class, that is needed to convert the metadata to the required Dublin Core format.
A internal and simple xml format is used as a pivot. It is designed to be used internaly only, not to be exposed. Other xml formats should be converted to it be imported.
It has only the five tags record
, elementSet
, element
, extra
and data
under the root tag documents
allows to import all metadata and specific fields
of Omeka. For compatibilty purposes, it supports too the tags of the Dublin Core
(simple or qualified). Standard attributes of the record can be set as extra
data too.
Here is the structure (see true examples for details):
<record xmlns="http://localhost/documents/" name="my-doc #1" recordType="Item">
<elementSet name="Dublin Core">
<element name="Title">
<data>Foo Bar</data>
<element name="Creator">
<data>John Smith</data>
<data name="collection">My collection</data>
<data name="item type">Still Image</data>
<data name="featured">1</data>
<data name="public">1</data>
<data name="tag">First tag</data>
<data name="tag">Second tag</data>
<data name="Another field">This field can be written in the static repository with a special format.</data>
<record file="http://localhost/path/to/the/file" />
Any METS file can be imported as long as the profile uses Dublin Core metadata. Else, the class should be extended.
The associated Alto file, an OCR format, can be ingested too as text. The plugin [Ocr Element Set] should be installed first to create fields for it, because texts are saved at file level. Else, a hook can be used to import data somewhere else.
The plugin [Ocr Element Set] saves ocr about each image at file level, so the option "File Metadata" should be set.
Two extra parameters can be managed:
- "mets_fileGrp_document": allows to set the main file, if wanted and if any ("document" by default).
- "mets_fileGrps": allows to set the groups of files to import. This avoids to import only main files and not the thumbnails or other unwanted files. The default is "master, ocr, MASTER, OCR". If set to empty, only the first file group will be imported.
Note: The namespace of the xslt stylesheets may need to be changed according to your files, or extend the class "OaiPmhStaticRepository_Mapping_Mets".
See examples in tests/suite/_files/Folder_Test/External_Metadata.ods
The first row of each sheet represents the element to import. Order of columns is not important. Unknown headers should be managed by the formats.
One row can represent multiple records and multiple rows can represent one record. This is the case for example for a file without metadata that is set on a row for an item, or when there is a new collection on the same row, or when there are multiple files attached to an item.
To add multiple values to the same element, for example multiple authors, three ways can be used:
- Set them in one or multiple columns with the same header;
- Fill the cell with multiple values separated by an element delimiter, like
or a characterend of line
; - Repeat the data in some other rows, as long as the identifier is set and that there is no column "action", in which case each row is processed separately.
Metadata for a file should be set after the item ones and require a column
that indicates the item to which the file is attached.
Beside the fork of Csv Import, some headers changed:
are replaced byItem Type
andRecord Type
is replaced byFiles
);- Extra data are now written with the array notation, so they are different from
standard elements: for example,
geolocation : latitude
is replaced bygeolocation [ latitude ]
; - The standard delimiter ":" can still be used for extra data, but the value will be an array like other elements, not a string;
- As identifier, it's recommended to use a true value from an element field like the "Dubliin Core:Identifier";
- To enter an empty value, that may be required by some extra data like the standard [Geolocation] plugin, enter the value "Empty value" (case sensitive) or the one you specified; This may be required when the action is "replace" too.
- if an extra data has only one value, it will be a string. To force an array, add an element delimiter ("|" by default).
- TODO Convert OpenDocument styles into html ones.
Three modes allows to link collections, items and files between rows.
- The recommended format is the cleares: fill a column "Collection" for items and a column "Item" for files, where the value is the identifier, that is generally the "Dublin Core:Identifier".
- An index may be used with the column "name". All rows with the same name belong to the same document.
- Else, the documents are processed as ordered, so a file belong to the previous item. This is always the case if there is a column "action".
The text format for the metadata is just an example to try the plugin.It's a
simple tagged file format, very similar to .ini
formats: a metadata is a line
that starts with the name of the element set (Dublin Core
by default),
followed by a colon :
and by the name of the element (Title
). The value is
separated from the field name with an equal sign =
. If this character is not
present, the line is ignored, so it can be a comment. If an element has multiple
lines, next lines start with two non significant spaces. Fields names and values
are trimmed. Because values are trimmed, an empty line between two fields is not
taken into account. Fields are repeatable. All documents in a file are merged.
The File
field, with the path to the file (absolute or relative to the
metadata file), is needed only for a flat folder or when there are metadata for
files. Metadata for each file are managed like the item. All lines next to a
line are attached to this file. If one file is referenced, all other
files should be referenced too, even if they don't have metadata.
If there are multiple records in the file, they should be preceded by an Item
field (the first one is optional).
There is no need to escape any character, unlike many other formats. The extension of the file should be a double one: ".metadata.txt".
Title = The Title
Creator = John Smith
Creator=Mary Smith
Description = This is the Dublin Core Description of this document.
This is the second line of the description, after a line break.
This line is a comment, because the equal sign is absent and it doesn't begin with two spaces.
Date = 2015
File = Image_1.jpg
Title = The first Image
Rights = Creative Commons CC-BY-SA
File = Image_2.jpg
Title = The second Image
Rights = Public Domain
Item = Document 2
Title = Second Document
This format is the same than the text one, except that the two spaces are
replaced by two underscores __
. It is only added as an example.
Three filters and associated classes are available to create static repositories for custom formats.
This filter makes the mapping between the metadata files and the elements that exists in Omeka. This filter is required to process metadata files. The mapping should be done at least for Dublin Core, because this is the base format of Omeka and OAI-PMH.
If the format is an xml one, it can be copied as this in the final static
repository. In that case, the ingest should use this format and the mapping
should be done with the filter oai_pmh_harvester_maps
Note: In Omeka, all metadata are flat by default, so the hierarchical structure of a complex XML file should be interpreted.
This filter specifies a class that defines a format that will be used as a
metadata format in the static repository. It is not needed as long as all data
are mapped into the Omeka format with the filter oai_pmh_static_repository_mappings
so that the import can be done with default formats, in particular the
one. On the contrary, of course, It's required if the import is done
with this format, in particular when the xml is raw copied.
This filter processes the import inside Omeka. It is required only if the format is designed to be ingested by an harvester.
The import into Omeka is done via the OAI-PMH Harvester plugin. Because this
is a standard, only standard elements are imported via the standard formats
(Dublin Core and METS). The format Documents
allows to import extra data,
for example the collection, the item type, the featured and public status, the
tags, and any other data that are managed by specific plugins.
With the OAI-PMH Harvester plugin and the "documents" format, extra data are imported via two ways.
- Standard "Post": the name should be the same that is used in the form of the
original plugin, for example
for the latitude of an item in array notation with the plugin [Geolocation]. - If this is not possible, the hook
should be set and managed in a plugin. This hook is called after the harvest of each record.
Furthermore, this hook can be used to ingest data that are contained inside original files, in particular for audio, photo and video files.
According to the specifications of OAI-PMH protocol, the static repository can be updated. The update is done for a whole record: it's not possible to add or remove a specific element.
In practice, there are two updates: the update of records inside a folder, that builds the static repository, and the update inside Omeka, realized througn the harvester.
The update is based on the oai identifier and the date stamp of each record.
This identifier is built with the original path of each document and files and their name. For records defined in metadata files, the path (for files) and the name (for document) are used too. When there is no path and no name, the order in the folder and in the metadata file is used.
So, when using metadata files, it's recommended to use a unique name for each document (unique across all the folder). If not, new documents should be added at the end of the list of records. If not, you shouldn't update metadata files in the static repository, else updates may be applied on wrong records.
The update process updates all metadata of each item and files. It uses the core functions of Omeka. When a metadata doesn't exist any more, Omeka keeps them by default. It is useful if you add other metadata inside Omeka, but it can cause synchronisation issues if you re-harvest the metadata.
So, when you set up a harvest, three choices are possible:
- keep all old metadata, updated or not (Omeka default);
- remove only metadata from elements that have been updated, so specific metadata that have been added in other elements are kept (plugin default);
- remove all old metadata, so the items will be strictly the same that in the static repository.
Tip: The good process depends on the static repository, but, generally, when you want to add or to modify metadata, the better is to update records directly in the folder and in metadata files.
A static repository requires a date stamp without time, because the finality of such a repository is to be static and stable.
Therefore, if the static repository is updated the same day but after an harvest has been done, the updated metadata will never be harvested. This issue applies too for badly formatted repositories.
A checkbox allows to bypass this limitation when the harvester uses the default
format, because it is designed for internal management only.
This constraint applies only to the harvest: the update of the static repository itself can be done at any time.
A static repository doesn't manage the deleted status. A record can be deleted and removed from the repository, but the harvester won't see it, so it won't be removed from Omeka.
This limitation can be bypassed only when the harvester uses the Documents
format and that there is an extra data action
with the value delete
if the request exceeds the threshold (5 seconds by default), the harvest is not
processed. A shorter or longer delay can be set in the main "config.ini" file:
plugins.OaipmhHarvester.wait = 5
The official [Geolocation] requires five fields to be able to ingest location. So, for all formats, you should set all of them:
For the format Open Document Spreadsheet (ods), a value should be set in all of these cells, because, empty cells are skipped.
geolocation[latitude] | geolocation[longitude] | geolocation[zoom_level] | geolocation[map_type] | geolocation[address] |
48.8583701 | 2.2944813 | 17 | Google Maps v3.x | - |
If you use a spreadsheet and don't want to set a false address or map type, you should use the fixed release of Geolocation.
For metadata files odt
and ods
, double or multiple successive spaces are
merged into one space.
Import of files ods
and odt
are not possible, because these files are
ingested as metadata files. If wanted, currently, they should be disabled
via the filter directly in the code of the plugin.
Files are checked with the white-list of extensions set in in the page "/admin/settings/edit-security", but the media type is currently not checked when the static repository is built. Anyway, it is checked during the harvest.
If names inside metadata files are not unique across all the folder, the plugin can't determine which record may be updated. So, if using metadata files, it's recommended to use a unique name for each document (unique across all the folder).
The harvester may have issues if files are available through "https", but cached by a proxy. In that case, you will have to wait some minutes (or days) before re-harvest, or to check settings of the proxy and the server.
The namespace of the xslt stylesheets may need to be changed according to your files.
### TODO
- List element texts set by the harvester in the table harvest_records, for the item and attached files in order to keep other ones, manually changed in Omeka?
Use it at your own risk.
It’s always recommended to backup your files and your databases and to check your archives regularly so you can roll back if needed.
See online issues on the [plugin issues] page on GitHub.
This plugin is published under the [CeCILL v2.1] licence, compatible with [GNU/GPL] and approved by [FSF] and [OSI].
In consideration of access to the source code and the rights to copy, modify and redistribute granted by the license, users are provided only with a limited warranty and the software's author, the holder of the economic rights, and the successive licensors only have limited liability.
In this respect, the risks associated with loading, using, modifying and/or developing or reproducing the software by the user are brought to the user's attention, given its Free Software status, which may make it complicated to use, with the result that its use is reserved for developers and experienced professionals having in-depth computer knowledge. Users are therefore encouraged to load and test the suitability of the software as regards their requirements in conditions enabling the security of their systems and/or data to be ensured and, more generally, to use and operate it in the same conditions of security. This Agreement may be freely reproduced and published, provided it is not altered, and that no provisions are either added or removed herefrom.
Current maintainers:
- Daniel Berthereau (see [Daniel-KM] on GitHub)
- Copyright Daniel Berthereau, 2015
[Ocr Element Set]: https://github.com/Daniel-KM/Omeka-plugin-Ocr Element Set [Geolocation]: https://omeka.org/add-ons/plugins/geolocation [Apache httpd]: https://httpd.apache.org [Jpeg 2000]: http://www.jpeg.org/jpeg2000 [refNum]: http://bibnum.bnf.fr/refNum [refNum2Mets]: https://github.com/Daniel-KM/refNum2Mets [Archive Document]: https://github.com/Daniel-KM/Omeka-plugin-ArchiveDocument [OAI-PMH Repository]: https://omeka.org/add-ons/plugins/oai-pmh-repository [plugin issues]: https://github.com/Daniel-KM/Omeka-plugin-OaiPmhStaticRepository/issues [CeCILL v2.1]: https://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html [GNU/GPL]: https://www.gnu.org/licenses/gpl-3.0.html [FSF]: https://www.fsf.org [OSI]: http://opensource.org [Daniel-KM]: https://github.com/Daniel-KM "Daniel Berthereau"