Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

require_all_on fails with iterables #434

Closed
3 tasks done
aulemahal opened this issue Jan 18, 2022 · 0 comments · Fixed by #435
Closed
3 tasks done

require_all_on fails with iterables #434

aulemahal opened this issue Jan 18, 2022 · 0 comments · Fixed by #435

Comments

@aulemahal
Copy link
Contributor

Here's a quick checklist in what to include:

  • Include a detailed description of the bug or suggestion
  • Output of intake_esm.show_versions()
  • Minimal, self-contained copy-pastable example that generates the issue if possible. Please be concise with code posted.

Description

When searching a catalog with a query that includes columns storing iterables and specifying require_all_on (on another column than the one with the iterables), the output is empty.

csv file:

simulation_id,ensemble_id,model_institution_id,model_id,experiment_id,timestep_id,domain_id,member_id,variable_id,file
A,CMIP6,CCCma,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file1
A,CMIP6,CCCma,CanESM,historical,day,NAM,r1i1p1,"('pr', 'prsn')",file2
B,CMIP6,CCCmb,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file3
C,CMIP6,CCCmc,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file4
D,CMIP6,CCCmd,CanESM,historical,day,NAM,r1i1p1,"('pr', 'tasmin')",file5
D,CMIP6,CCCme,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'prsn')",file6

json file:

{
    "esmcat_version": "0.1.0",
    "assets": {
        "column_name": "file",
        "format": "netCDF"
    },
    "aggregation_control": {
        "variable_column_name": "variable_id",
        "groupby_attrs": ["simulation_id", "domain_id", "timestep_id"],
        "aggregations": [
            {"type": "join_new", "attribute_name": "member_id"},
            {"type": "union", "attribute_name": "variable_id"}
        ],
    },
    "attributes" : [],
    "catalog_file": "test.csv"
}

What I Did

import ast
import intake

cat = intake.open_esm_datastore('test.json', read_csv_kwargs={'converters': {'variable_id': ast.literal_eval}})
cat.search(variable_id=['tasmax', 'prsn'], require_all_on=['simulation_id'])
# got: < catalog with 0 dataset(s) from 0 asset(s)>
# expected : < catalog with 2 dataset(s) from 3 asset(s)>

The culprit is intake_esm._search.search_apply_require_all_on which doesn't know about columns with iterables. When it regroups the columns of each group and creates the set to be compared with the expected one, elements in iterable columns are still iterables, whereas the expected condition has elements from these iterables.

PR coming sooner than not.

Version information: output of intake_esm.show_versions()

INSTALLED VERSIONS

cftime: 1.5.1.1
dask: 2021.12.0
fastprogress: 0.2.7
fsspec: 2021.11.1
gcsfs: 2021.11.1
intake: 0.6.4
intake_esm: 2021.8.17.post43+dirty
netCDF4: 1.5.8
pandas: 1.3.5
requests: 2.26.0
s3fs: 2021.11.1
xarray: 0.20.2
zarr: 2.10.3

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant