Skip to content

inventory.data.gov

Tim Lowden edited this page Jan 16, 2025 · 26 revisions

a.k.a Inventory is used by federal agencies to manage metadata for their datasets. Inventory is used to generate the agency's data.json which must be hosted on the agency's website (e.g. agency.gov/data.json). Inventory is a CKAN instance and can be used to host datasets in addition to metadata.

Access

Access to Inventory is historically confusing. There are several mechanisms referring to public/private and each means something different. Since Inventory contains only open data, all data within Inventory can be publicly accessible. However for historical reasons, datasets are only visible to authenticated users while resources are publicly visible.

  • CKAN private: true a property on the dataset which is seen on the organization datasets listing and affects visibility within CKAN. Only members of an organization can see private datasets within their own organizations. Since Inventory only contains open data and any metadata is published on catalog.data.gov, there is no reason to mark a dataset as private. The CKAN private field is ignored by Inventory's data.json export.
  • DCAT-US “public access level” doesn’t mean anything to CKAN and does not affect how the dataset is displayed within Inventory. A dataset with "accessLevel": "non-public" will still appear in the data.json inventory.
  • Publishing status (draft or published) is an Inventory concept which affects whether or not the dataset is included in the data.json when exporting from Inventory. It does not affect visibility within CKAN. Any authenticated users will still be able to see draft datasets. Public users be able to see resources of draft datasets. TODO: is this defined in ckanext-datajson?

Resources, the actual data files uploaded to individual datasets, do not have a concept of private and inherit visibility settings from the dataset. Any dataset that includes resource files hosted on Inventory must be marked private: false, otherwise the resource files will not be accessible to anonymous users. This includes some of GSA's hosted datasets that are available by download or the datastore API.

Why are datasets not visible to the public? 🤷 maybe because the confusion around Publishing Status. See GSA/data.gov#2095.

Environments

Instance Url
Production inventory.data.gov
Staging inventory-stage-datagov.app.cloud.gov
sandbox inventory-dev-datagov.app.cloud.gov

Dependencies

Sub-components:

  • ckan

Services:

  • nginx
  • PostgreSQL
  • redis
  • S3
  • SOLR

Common tasks

Updating Publishers List

Inventory organizations have a pre-approved list of "publishers" that can be listed in the metadata for a dataset. This list is then uploaded to the inventory organizations, and is used to create a list for the dataset creator. The list is currently here. To update the list, you can use the icon in the top right (pencil icon) or clone it and make the necessary changes locally.

Without manually adding a publisher to an org, no publisher will appear in the dropdown when an inventory user creates a new dataset record.

  1. Open https://github.com/GSA/inventory-app/blob/main/config/data/inventory_publishers.csv and click Edit this file
  2. For a brand new org, add a row in the correct alphabetical spot
    1. The first column should be the org-url-id, followed by the Full Text Name of Main Org, followed by (optionally) Name of Sub-org OR Additional Publishers Desired in the Org, comma-separated, and ending with the number of commas correlating to the amount of csv header field left blank. There are 7 fields possible in the header: organization,publisher,publisher_1,publisher_2,publisher_3,publisher_4,publisher_5
    2. Examples: nrel-gov,Department of Energy,National Renewable Energy Laboratory,,,, has 4 ending commas because there are 3 fields used, state-gov,Department of State,,,,, has 5 ending commas because there are 2 fields used, and department-of-energy,Department of Energy,Office of Nuclear Energy,Idaho National Laboratory,,, has 3 ending commas because it has 4 fields used.
  3. For an existing org, add a row that adds the new Publisher in the correct column of the hierarchy
  4. Create a Pull Request (similar to https://github.com/GSA/inventory-app/pull/469)
  5. Add a team member (or the whole team) as a reviewer. Once the PR is merged, then an automated deployment of this csv occurs (after deployment). You can verify the deployment in GitHub Actions.

Of note, any org (even a sub-org) in the inventory will need its own row with a publisher. A sub-org of DOE cannot just add a line in to the department-of-energy block, it must have its own row with its own org ID in the first column.

Adding an Organization

  1. Log in to inventory, click Organizations, then Add Organization button.
  2. Fill out the Name, Description, and Image fields
    1. If the org name is long, consider manually changing the URL to a shorter, logical string. Example: Agricultural Research Service, Department of Agriculture becomes inventory.data.gov/organization/ars-usda-gov
  3. Save the org
  4. If the new org is a sub-org of a parent agency/organization (meaning when the parent org exports a data.json file, the application will pull the data.json from the sub-org inventory and incorporate it into the final output), navigate to the parent org page and click Admin
  5. On main org admin page, either add a key of sub-agencies in the custom fields and add a value or edit the value field if the key already exists.
    1. The value formatting should follow org-url-id and be comma-separated, such as olm-doe,fossil-energy,cfo-energy-gov

Importing from data.json

ckanpyimport is used in onboarding new agencies to inventory.data.gov. This tool imports datasets from a data.json file.

The import script will happily create duplicates, so if there are any existing datasets in the organization, you probably should delete them all first.

Run this from the jumpbox using nohup or tmux so that disconnecting your session does not interrupt the script. The script can take a while depending on how many packages need to be imported (~2 hours for 1000 datasets). You should also test against staging before running against production.

Adding a User

Once a user has registered via google form and application has been approved by admins, a CKAN user can be added with an Editor role for the agency with the following steps:

Important Note: Despite the inventory.data.gov UI providing buttons for an admin to Add a User, due to the integration with login.gov, adding users via the UI is no longer supported. Adding a user via the UI will only confuse SAML and cause issues.

Non-developer instructions:

  1. Fill out an issue to add a user at https://github.com/GSA/datagov-account-management/issues using the New User Account template
  2. Make sure the dropdown lists inventory.data.gov as the application
  3. Choose Editor permissions for any non-datagov team member

Developer instructions:

  1. Given user's name and email (lowercase), log into cloud.gov CLI and run cf command:
cf run-task inventory --command "ckan user add firstname-lastname email=email@agency.gov password=\'\$(cat /proc/sys/kernel/random/uuid)\'" -k 1500M -m 2G
  1. Monitor the output and make sure user is successfully created:
# task-name is printed from previous command
cf tail -f -t log inventory| grep some-task-name
  1. Log into Inventory web UI, go to the Agency's organization page and add user as an Editor.

In case there is an error in step 2 complaining the username is taken, or the email address has been used by another registered user, and you can't find the user on the UI, it means the user is in a deleted state, use the following steps to reactivate it. You need to know the username to do it. If you only know the email address, you will have to run DB queries to find the username.

Updating a User

In case a user exists but in a deleted state, you can use api to reactivate the user, assuming the username is some-user-name.

exam the user:

curl -H "Authorization: <your-token>" -s https://inventory.data.gov/api/action/user_show?id=some-user-name | jq

reactivate the user

curl -H "Authorization: <your-token>" -X POST https://inventory.data.gov/api/action/user_patch -d '{"id": "some-user-name", "state": "active"}'

Alert conditions

Health Checks

This is configured with cloud.gov, see current configuration as code here.

Database Initialization

This should be handled automatically: https://github.com/GSA/inventory-app/blob/main/.profile#L90-L95

Clone this wiki locally