Skip to content

Federated datasets managed and available through Purdue Anvil

License

Notifications You must be signed in to change notification settings

PurdueRCAC/DatasetDocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DatasetDocs

This repository serves two purposes:

  1. It hosts a documentation generator for Lmod dataset module files
  2. It contains the generated documentation that is hosted on Read the Docs

Documentation

The documentation for all available datasets can be found online at ReadTheDocs. This documentation is automatically generated from Lmod module files and provides detailed information about available datasets, their versions, and associated environment variables.

The documentation is stored in the /docs directory of this repository and is continuously built and updated on Read the Docs.

Use of Datasets

The datasets provided in this repository are federated and play a crucial role in enhancing the efficacy of HPC-optimized workflows across various research domains. Anvil's community dataset storage offers smooth and high-speed access to large-scale datasets, significantly benefiting scientific workflows. The hundreds of terabytes of meteorological and geospatial datasets available on Anvil have been essential for the seamless operation of our tools and scientific efforts, allowing researchers to focus more on scientific discovery rather than navigating data-related challenges.

For more information about Anvil and its capabilities, please visit the RCAC Anvil page.

Documentation Generator

The documentation generator tool automatically creates and maintains documentation for scientific datasets by parsing Lmod module files and creating structured documentation in reStructuredText (rst) format.

Features

  • Recursively scans directories containing Lmod (.lua) module files
  • Extracts and formats help text from module files
  • Captures environment variables set by the modules
  • Automatically detects version information from date-based filenames (YYYY-MM-DD format)
  • Generates structured documentation in reStructuredText format
  • Builds documentation using Sphinx and hosts it on Read the Docs
  • Maintains hierarchical documentation structure mirroring the dataset organization

Installation

  1. Clone this repository:
git clone https://github.com/PurdueRCAC/DatasetDocs.git
cd DatasetDocs
  1. Install Python dependencies:
pip install -r docs/requirements.txt

Usage

The main script generate_docs.py can be run as follows:

python generate_docs.py \
  --datasets-dir /path/to/lmod/datasets \
  --output-dir /path/to/DatasetDocs/docs

Arguments

  • --datasets-dir: Directory containing the Lmod (.lua) module files
  • --output-dir: Directory where the generated documentation will be written

Documentation Structure

The generated documentation follows this structure:

docs/
├── index.rst               # Main documentation index
├── category1/              # Top-level dataset category
│   ├── dataset1/           # Dataset subdirectory
│   │   └── YYYY-MM-DD.rst  # Version-specific documentation
│   └── index.rst           # Category index
└── category2/
    └── ...

Building Documentation Locally

While the documentation is automatically built on Read the Docs, you can also build it locally using Sphinx:

cd docs
make html

The built documentation will be available in docs/_build/html/.

Contributing

Contributions are welcome! You can contribute in several ways:

  • Improving the documentation generator
  • Fixing documentation errors
  • Enhancing the documentation structure
  • Adding new features

Please feel free to submit a Pull Request.

License

This project is licensed under the Open Source License License - see the LICENSE file for details.

About

Federated datasets managed and available through Purdue Anvil

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published