Skip to content

Commit

Permalink
Add more information about name cleanup
Browse files Browse the repository at this point in the history
This is an attempt to explain why and how Archivematica changes file
and directory names during Transfer and Ingest. Feedback welcome!

Connected to archivematica/Issues#387
  • Loading branch information
sallain committed Jan 18, 2019
1 parent de94160 commit 14fa691
Show file tree
Hide file tree
Showing 5 changed files with 121 additions and 22 deletions.
1 change: 1 addition & 0 deletions user-manual/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ links to each chapter's main sections.
transfer/forensic
transfer/dspace
transfer/scan-for-viruses
transfer/clean-up-names
transfer/dataverse
ingest/ingest
ingest/manual-normalization
Expand Down
6 changes: 6 additions & 0 deletions user-manual/transfer/_csv/file-name-edits.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Original name,Edited name,METS output
i & tem,i___tem,"Original name=""%transferDirectory%objects/i & tem.png""; cleaned up name=""%transferDirectory%objects/i___tem.png"""
Česká republika.png,Ceska_republika.png,"Original name=""%transferDirectory%objects/Česká republika.png""; cleaned up name=""%transferDirectory%objects/Ceska_republika.png"""
Éireann,Eireann,"Original name=""%transferDirectory%objects/Éireann.png""; cleaned up name=""%transferDirectory%objects/Eireann.png"""
España,Espana,"Original name=""%transferDirectory%objects/España.png""; cleaned up name=""%transferDirectory%objects/Espana.png"""
Росси́я,Rossiia,"Original name=""%transferDirectory%objects/Росси́я.png""; cleaned up name=""%transferDirectory%objects/Rossiia.png"""
91 changes: 91 additions & 0 deletions user-manual/transfer/clean-up-names.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
.. _clean-up-names:

==============
Clean up names
==============

The clean up names microservice runs twice during processing - once on the
Transfer tab and once on the Ingest tab. The purpose of this microservice is to
ensure that Archivematica's tools and processes do not fail because of
characters that appear in file or directory names.

Archivematica groups a wide variety of tools together to create preservation
workflows. Some of the tools that Archivematica uses have a narrowly-defined
scope in terms of what the tool considers to be a valid character for a file or
directory name. Encountering a character that the tool considers to be invalid
can cause the tool to fail, which can halt processing and interfere with normal
operation of Archivematica.

To prevent these kinds of tool failures from happening, Archivematica implements
a script that changes file and directory names to conform to requirements of the
most restrictive tools. Valid characters are defined by `specific code`_ in
Archivematica:

``-_.()abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789``

The script tries to replace any character that is not included in the above list
with its nearest equivalent - for example, the character ``Č`` is replaced by
``C``. Where the script cannot find a close equivalent, the character is
replaced by an underscore. The change is documented in the METS file.

.. csv-table::
:file: _csv/file-name-edits.csv
:header-rows: 1

Additional information can be found in the complete PREMIS event:

.. code:: xml
<mets:mdWrap MDTYPE="PREMIS:EVENT">
<mets:xmlData>
<premis:event xmlns:premis="info:lc/xmlns/premis-v2" xsi:schemaLocation="info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-2.xsd" version="2.2">
<premis:eventIdentifier>
<premis:eventIdentifierType>UUID</premis:eventIdentifierType>
<premis:eventIdentifierValue>ef7ef8b8-c363-4dfe-9b67-4100d1bfa7f2</premis:eventIdentifierValue>
</premis:eventIdentifier>
<premis:eventType>name cleanup</premis:eventType>
<premis:eventDateTime>2019-01-17T22:07:17+00:00</premis:eventDateTime>
<premis:eventDetail>prohibited characters removed:program="sanitize_names"; version="1.10.d0ccb7d7661cf35c769dcc0846d8f087998af713"</premis:eventDetail>
<premis:eventOutcomeInformation>
<premis:eventOutcome></premis:eventOutcome>
<premis:eventOutcomeDetail>
<premis:eventOutcomeDetailNote>Original name="%transferDirectory%objects/Росси́я.png"; cleaned up name="%transferDirectory%objects/Rossiia.png"</premis:eventOutcomeDetailNote>
</premis:eventOutcomeDetail>
</premis:eventOutcomeInformation>
<premis:linkingAgentIdentifier>
<premis:linkingAgentIdentifierType>preservation system</premis:linkingAgentIdentifierType>
<premis:linkingAgentIdentifierValue>Archivematica-1.7</premis:linkingAgentIdentifierValue>
</premis:linkingAgentIdentifier>
<premis:linkingAgentIdentifier>
<premis:linkingAgentIdentifierType>repository code</premis:linkingAgentIdentifierType>
<premis:linkingAgentIdentifierValue></premis:linkingAgentIdentifierValue>
</premis:linkingAgentIdentifier>
</premis:event>
</mets:xmlData>
</mets:mdWrap>
This strategy for addressing character limitations within the tools bundled by
Archivematica can result in file and directory names can result in significant
changes to file and directory names. In order to provide easy access to a record
of the changes, the Archivematica AIP contains log files that show the original
name alongside the changed name. There are two log files:

* The log file for name changes that happen during Transfer is located at
``data/logs/transfers/my-transfer-name/logs/filenameCleanup.log``
* The log file for name changes that happen during Ingest is located at
``data/logs/filenameCleanup.log``

These logs contain a plaintext rendering of the information embedded within the
PREMIS event.

.. code::
Sanitized name: %transferDirectory%objects/Росси́я.png -> %transferDirectory%objects/Rossiia.png
Sanitized name: %transferDirectory%objects/Éireann.png -> %transferDirectory%objects/Eireann.png
Sanitized name: %transferDirectory%objects/f & ile.png -> %transferDirectory%objects/f___ile.png
Sanitized name: %transferDirectory%objects/Česká republika.ong -> %transferDirectory%objects/Ceska_republika.ong
Sanitized name: %transferDirectory%objects/Ísland.png -> %transferDirectory%objects/Island.png
Sanitized name: %transferDirectory%objects/España.png -> %transferDirectory%objects/Espana.png
.. _`specific code`: https://github.com/artefactual/archivematica/blob/b6dcfb07a6be5957a5085efd1fecd8462fdc3a91/src/MCPClient/lib/clientScripts/sanitizeNames.py#L34
42 changes: 21 additions & 21 deletions user-manual/transfer/scan-for-viruses.rst
Original file line number Diff line number Diff line change
@@ -1,44 +1,44 @@
.. _scan-for-viruses:

=================
Scan for viruses
=================
================
Scan for viruses
================

The scan for viruses Microservice runs at multiple points in the transfer and
ingest workflows inside Archivematica. Archivematica uses the ClamAV antivirus
engine and its configuration is discussed :ref:`here <antivirus-admin>`. The
The scan for viruses microservice runs at multiple points in the transfer and
ingest workflows inside Archivematica. Archivematica uses the ClamAV antivirus
engine and its configuration is discussed :ref:`here <antivirus-admin>`. The
configuration of this service can effect whether PREMIS events are recorded for
files or not.
files or not.

.. figure:: images/VirusPREMISPass.*
:align: center
:figwidth: 80%
:width: 100%
:alt: The event log for a successful virus check

The event log for a successful virus check
The event log for a successful virus check

We look at the impact of various settings below.

Exploring ClamAV settings
-------------------------

``MaxFileSize`` If a file is passed to the scanner that is larger than this
``MaxFileSize`` If a file is passed to the scanner that is larger than this
then it will not be scanned. No event will be recorded.

``MaxScanSize`` limits the number of bytes that will be scanned. This might be
used in a standard operating environment where one might be confident a virus
or malware will only appear within that range. In Archivematica, because of the
possibility of a virus still existing outside of that range, a PREMIS event
``MaxScanSize`` limits the number of bytes that will be scanned. This might be
used in a standard operating environment where one might be confident a virus
or malware will only appear within that range. In Archivematica, because of the
possibility of a virus still existing outside of that range, a PREMIS event
cannot be recorded confidently.

``MaxStreamLength`` is a setting used in Clamdscan only. The maximum number of
bytes that can be sent to the ClamAV daemon. Files that are larger than this
limit cannot be sent in their entirety to the server and so a PREMIS event that
states the existence or non-existence of malware in the file cannot be recorded
``MaxStreamLength`` is a setting used in Clamdscan only. The maximum number of
bytes that can be sent to the ClamAV daemon. Files that are larger than this
limit cannot be sent in their entirety to the server and so a PREMIS event that
states the existence or non-existence of malware in the file cannot be recorded
confidently.

We can observe the impact of different combinations of configuration options in
We can observe the impact of different combinations of configuration options in
the following two tables.

Clamscan
Expand Down Expand Up @@ -69,8 +69,8 @@ Clamdscan
| 84M | 42M | 84M | 100M | No | No |
+-----------+-------------+-------------+-----------------+----------+--------------+

In both tables you can see that if for any reason a file is not scanned,
or *'not-completely-scanned'*, a PREMIS event will not be recorded. A PREMIS
In both tables you can see that if for any reason a file is not scanned,
or *'not-completely-scanned'*, a PREMIS event will not be recorded. A PREMIS
event in either instance of `PASS` or `FAIL` would be a false-positive.

:ref:`Back to the top <scan-for-viruses>`
:ref:`Back to the top <scan-for-viruses>`
3 changes: 2 additions & 1 deletion user-manual/transfer/transfer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,8 @@ The microservices that run on the Transfer tab include:
original transfer and places as a text file in the AIP.

* **Clean up names**: removes prohibited characters from folder and filenames,
such as ampersands.
such as ampersands. For more information, see :ref:`Clean up names
<clean-up-names>`.

* **Identify file format**: allows the user to choose between various format
identification tools, or to skip format identification at this stage. See
Expand Down

0 comments on commit 14fa691

Please # to comment.