From 14fa69180fa9508eade8d53404595f0b1408c115 Mon Sep 17 00:00:00 2001 From: sallain Date: Thu, 17 Jan 2019 14:34:35 -0800 Subject: [PATCH] Add more information about name cleanup This is an attempt to explain why and how Archivematica changes file and directory names during Transfer and Ingest. Feedback welcome! Connected to archivematica/Issues#387 --- user-manual/index.rst | 1 + user-manual/transfer/_csv/file-name-edits.csv | 6 ++ user-manual/transfer/clean-up-names.rst | 91 +++++++++++++++++++ user-manual/transfer/scan-for-viruses.rst | 42 ++++----- user-manual/transfer/transfer.rst | 3 +- 5 files changed, 121 insertions(+), 22 deletions(-) create mode 100644 user-manual/transfer/_csv/file-name-edits.csv create mode 100644 user-manual/transfer/clean-up-names.rst diff --git a/user-manual/index.rst b/user-manual/index.rst index a52dfd57..41348e94 100644 --- a/user-manual/index.rst +++ b/user-manual/index.rst @@ -15,6 +15,7 @@ links to each chapter's main sections. transfer/forensic transfer/dspace transfer/scan-for-viruses + transfer/clean-up-names transfer/dataverse ingest/ingest ingest/manual-normalization diff --git a/user-manual/transfer/_csv/file-name-edits.csv b/user-manual/transfer/_csv/file-name-edits.csv new file mode 100644 index 00000000..f7ff48c9 --- /dev/null +++ b/user-manual/transfer/_csv/file-name-edits.csv @@ -0,0 +1,6 @@ +Original name,Edited name,METS output +i & tem,i___tem,"Original name=""%transferDirectory%objects/i & tem.png""; cleaned up name=""%transferDirectory%objects/i___tem.png""" +Česká republika.png,Ceska_republika.png,"Original name=""%transferDirectory%objects/Česká republika.png""; cleaned up name=""%transferDirectory%objects/Ceska_republika.png""" +Éireann,Eireann,"Original name=""%transferDirectory%objects/Éireann.png""; cleaned up name=""%transferDirectory%objects/Eireann.png""" +España,Espana,"Original name=""%transferDirectory%objects/España.png""; cleaned up name=""%transferDirectory%objects/Espana.png""" +Росси́я,Rossiia,"Original name=""%transferDirectory%objects/Росси́я.png""; cleaned up name=""%transferDirectory%objects/Rossiia.png""" diff --git a/user-manual/transfer/clean-up-names.rst b/user-manual/transfer/clean-up-names.rst new file mode 100644 index 00000000..25161cc7 --- /dev/null +++ b/user-manual/transfer/clean-up-names.rst @@ -0,0 +1,91 @@ +.. _clean-up-names: + +============== +Clean up names +============== + +The clean up names microservice runs twice during processing - once on the +Transfer tab and once on the Ingest tab. The purpose of this microservice is to +ensure that Archivematica's tools and processes do not fail because of +characters that appear in file or directory names. + +Archivematica groups a wide variety of tools together to create preservation +workflows. Some of the tools that Archivematica uses have a narrowly-defined +scope in terms of what the tool considers to be a valid character for a file or +directory name. Encountering a character that the tool considers to be invalid +can cause the tool to fail, which can halt processing and interfere with normal +operation of Archivematica. + +To prevent these kinds of tool failures from happening, Archivematica implements +a script that changes file and directory names to conform to requirements of the +most restrictive tools. Valid characters are defined by `specific code`_ in +Archivematica: + +``-_.()abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`` + +The script tries to replace any character that is not included in the above list +with its nearest equivalent - for example, the character ``Č`` is replaced by +``C``. Where the script cannot find a close equivalent, the character is +replaced by an underscore. The change is documented in the METS file. + +.. csv-table:: + :file: _csv/file-name-edits.csv + :header-rows: 1 + +Additional information can be found in the complete PREMIS event: + +.. code:: xml + + + + + + UUID + ef7ef8b8-c363-4dfe-9b67-4100d1bfa7f2 + + name cleanup + 2019-01-17T22:07:17+00:00 + prohibited characters removed:program="sanitize_names"; version="1.10.d0ccb7d7661cf35c769dcc0846d8f087998af713" + + + + Original name="%transferDirectory%objects/Росси́я.png"; cleaned up name="%transferDirectory%objects/Rossiia.png" + + + + preservation system + Archivematica-1.7 + + + repository code + + + + + + +This strategy for addressing character limitations within the tools bundled by +Archivematica can result in file and directory names can result in significant +changes to file and directory names. In order to provide easy access to a record +of the changes, the Archivematica AIP contains log files that show the original +name alongside the changed name. There are two log files: + +* The log file for name changes that happen during Transfer is located at + ``data/logs/transfers/my-transfer-name/logs/filenameCleanup.log`` +* The log file for name changes that happen during Ingest is located at + ``data/logs/filenameCleanup.log`` + +These logs contain a plaintext rendering of the information embedded within the +PREMIS event. + +.. code:: + + Sanitized name: %transferDirectory%objects/Росси́я.png -> %transferDirectory%objects/Rossiia.png + Sanitized name: %transferDirectory%objects/Éireann.png -> %transferDirectory%objects/Eireann.png + Sanitized name: %transferDirectory%objects/f & ile.png -> %transferDirectory%objects/f___ile.png + Sanitized name: %transferDirectory%objects/Česká republika.ong -> %transferDirectory%objects/Ceska_republika.ong + Sanitized name: %transferDirectory%objects/Ísland.png -> %transferDirectory%objects/Island.png + Sanitized name: %transferDirectory%objects/España.png -> %transferDirectory%objects/Espana.png + + +.. _`specific code`: https://github.com/artefactual/archivematica/blob/b6dcfb07a6be5957a5085efd1fecd8462fdc3a91/src/MCPClient/lib/clientScripts/sanitizeNames.py#L34 diff --git a/user-manual/transfer/scan-for-viruses.rst b/user-manual/transfer/scan-for-viruses.rst index 8898dfc5..bd2f2ffe 100644 --- a/user-manual/transfer/scan-for-viruses.rst +++ b/user-manual/transfer/scan-for-viruses.rst @@ -1,14 +1,14 @@ .. _scan-for-viruses: -================= - Scan for viruses -================= +================ +Scan for viruses +================ -The scan for viruses Microservice runs at multiple points in the transfer and -ingest workflows inside Archivematica. Archivematica uses the ClamAV antivirus -engine and its configuration is discussed :ref:`here `. The +The scan for viruses microservice runs at multiple points in the transfer and +ingest workflows inside Archivematica. Archivematica uses the ClamAV antivirus +engine and its configuration is discussed :ref:`here `. The configuration of this service can effect whether PREMIS events are recorded for -files or not. +files or not. .. figure:: images/VirusPREMISPass.* :align: center @@ -16,29 +16,29 @@ files or not. :width: 100% :alt: The event log for a successful virus check - The event log for a successful virus check + The event log for a successful virus check We look at the impact of various settings below. Exploring ClamAV settings ------------------------- -``MaxFileSize`` If a file is passed to the scanner that is larger than this +``MaxFileSize`` If a file is passed to the scanner that is larger than this then it will not be scanned. No event will be recorded. -``MaxScanSize`` limits the number of bytes that will be scanned. This might be -used in a standard operating environment where one might be confident a virus -or malware will only appear within that range. In Archivematica, because of the -possibility of a virus still existing outside of that range, a PREMIS event +``MaxScanSize`` limits the number of bytes that will be scanned. This might be +used in a standard operating environment where one might be confident a virus +or malware will only appear within that range. In Archivematica, because of the +possibility of a virus still existing outside of that range, a PREMIS event cannot be recorded confidently. -``MaxStreamLength`` is a setting used in Clamdscan only. The maximum number of -bytes that can be sent to the ClamAV daemon. Files that are larger than this -limit cannot be sent in their entirety to the server and so a PREMIS event that -states the existence or non-existence of malware in the file cannot be recorded +``MaxStreamLength`` is a setting used in Clamdscan only. The maximum number of +bytes that can be sent to the ClamAV daemon. Files that are larger than this +limit cannot be sent in their entirety to the server and so a PREMIS event that +states the existence or non-existence of malware in the file cannot be recorded confidently. -We can observe the impact of different combinations of configuration options in +We can observe the impact of different combinations of configuration options in the following two tables. Clamscan @@ -69,8 +69,8 @@ Clamdscan | 84M | 42M | 84M | 100M | No | No | +-----------+-------------+-------------+-----------------+----------+--------------+ -In both tables you can see that if for any reason a file is not scanned, -or *'not-completely-scanned'*, a PREMIS event will not be recorded. A PREMIS +In both tables you can see that if for any reason a file is not scanned, +or *'not-completely-scanned'*, a PREMIS event will not be recorded. A PREMIS event in either instance of `PASS` or `FAIL` would be a false-positive. -:ref:`Back to the top ` \ No newline at end of file +:ref:`Back to the top ` diff --git a/user-manual/transfer/transfer.rst b/user-manual/transfer/transfer.rst index 7e6ccbb5..541bf047 100644 --- a/user-manual/transfer/transfer.rst +++ b/user-manual/transfer/transfer.rst @@ -576,7 +576,8 @@ The microservices that run on the Transfer tab include: original transfer and places as a text file in the AIP. * **Clean up names**: removes prohibited characters from folder and filenames, - such as ampersands. + such as ampersands. For more information, see :ref:`Clean up names + `. * **Identify file format**: allows the user to choose between various format identification tools, or to skip format identification at this stage. See