http://www.cs.jhu.edu/~shuosun/clirmatrix/
Alternatively, CLIRMatrix is also available in the following google drive:
https://drive.google.com/drive/folders/1V-DcBwvAnlVAYJw_gsx0zXV5VXJcRGGc?usp=sharing
Script to extract untruncated documents from Wikipedia dumps:
Usage:
./extract.sh [wikipedia language code]
E.g.
./extract.sh en
[1] Shuo Sun, Kevin Duh CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)