This project contains very little code, but provides a set of fixture files under version control to facilitate testing of ingest workflows for newspaper_works, or for other digitized newspaper content workflows in general.
Add NewspaperWorksFixtures to your Gemfile
, preferably only as a test or
development dependency:
group :development, :test do
gem 'newspaper_works_fixtures'
end
Then run bundle install
.
Once the gem is installed, you will be able to access the paths to the fixtures by
calling methods on the base NewspaperWorksFixtures
class, as in:
2.4.1 :001 > NewspaperWorksFixtures.file_fixtures
=> "/path/to/newspaper_works_fixtures/spec/fixtures/files"
# /path/to/gem/spec/fixtures/files/ndnp/batch_local
NewspaperWorksFixtures.ndnp_local_batch
A small batch of newspaper objects that is intended to mock vendor-provided digitization deliverables (page-level objects, no article segmentation) conforming to Library of Congress NDNP specs. (The data here is from University of Utah.)
This batch includes 1 title, 1 reel, and 1 issue with 2 pages. Each scan has a TIFF, JP2, PDF, and ALTO XML file.
2 image scans; 74 MB
# /path/to/gem/spec/fixtures/files/ndnp/batch_test_ver01
NewspaperWorksFixtures.ndnp_chronam_batch
A small batch of newspaper objects that mimics the BagIt-formatted batches of scanned newspapers
found on the Library of Congress Chronicling America data/batches site.
(The data here is from batch_curiv_jojoba_ver01
; page-level objects, no
article segmentation.)
This batch includes multiple titles, reels, target files, issues, and pages. Each scan has a JP2, PDF, and ALTO XML file (no TIFF). All of the corresponding BagIt and METS files are included as well.
11 image scans; 149 MB
# /path/to/gem/spec/fixtures/files/article_segmented/batch_deseret_news
NewspaperWorksFixtures.pdf_batch
NewspaperWorksFixtures.tiff_batch
These are two variants of four-page issues of Chicopee Weekly, via Digital Commonwealth.
The PDF source materials are 400 ppi monochrome (CCITT group 4 compressed), with each PDF representing a single four page issue. The file naming convention is as follows:
-
Publication directory named with Library of Congress Control Number (LCCN).
-
Inside publication directory are PDF files using naming convention of
YYYYMMDDEE.pdf
, where:YYYY
is four digit year.MM
is month (zero padded).DD
is day of month (zero padded).EE
is edition number (zero padded).
The TIFF batch likewise is one-bit "Group 4" compressed mononchrome images,
and use a similar YYYMMDDEE
naming convention:
-
Publication directory named with Library of Congress Control Number (LCCN).
-
Directly contained in publication directory are directories, one per issue, using the
YYYYMMDDEE
naming convention/ -
Inside issue directories are TIFF files with lexically ordered filenames, corresponding to page sequence order of that issue.
- The JP2 batch is a copy of a two-page issue also included in NDNP source materials in this gem.
# /path/to/gem/spec/fixtures/files/article_segmented/batch_deseret_news
NewspaperWorksFixtures.article_segmented_batch_deseret_news
This batch includes one title, one issue, nine pages, and articles. Each page has a PDF, and ALTO XML file, and each article has a PDF and an ALTO XML file (no TIFF).
Article segmented files: 19 pdf, 19 xml/dtd; 3.9 MB
Page level files: 9 pdf; 5.6 MB; 9 xml/dtd; 3.8 MB
# /path/to/gem/spec/fixtures/files/article_segmented/batch_topaz_times
NewspaperWorksFixtures.article_segmented_batch_topaz_times
This batch includes an issue, pages, and articles. Each page has a PDF, TIF, ALTO XML file, and an articles XML file.
Article segmented files: 30 PDF, 30 TIF; 1.1 MB
Page level files: 4 PDF, 4 TIF; 876 KB
This gem is part of a project developed in a collaboration between The University of Utah, J. Willard Marriott Library and Boston Public Library, as part of the "Newspapers in Samvera" project grant funded by the Institute for Museum and Library Services.