A home for awesome digital preservation resources that are now obsolete.
- EDRM Internationalization Data Set - Did not get archived by IA AFAICT.
- Apache Tika's regression corpus - Millions of files collected largely from govdocs1 and Common Crawl with oversampling on binary formats. - Tika Corpora no longer openly accessible
- Apache Tika's Bugtracker corpora - Dense set of problematic files -- attachments from bug trackers for open source parsers. - As above.
- In order to improve our digital preservation capability, we need to be able to analyse and evaluate our work in an effective manner. For example, we need to be able to compare and contrast tools and approaches, and we need to see how changes over time affect performance. Practising what we preach in this field means sharing our data about digital preservation. Share your digital preservation data using the Linked Data Simple Storage Specification. Dave Tarrant explains why this is a good idea. NOTE: Links still work but service is offline/read-only.