Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Expected behavior of facets on array of string #617

Closed
bussec opened this issue Jun 8, 2022 · 6 comments · Fixed by #752
Closed

Expected behavior of facets on array of string #617

bussec opened this issue Jun 8, 2022 · 6 comments · Fixed by #752
Milestone

Comments

@bussec
Copy link
Member

bussec commented Jun 8, 2022

What is the expected behavior when an ADC API query requests aggregation (via the facets request parameter) on a field that holds an array of strings, e.g., study.keywords_study. For example if such a field holds the array ['A', 'B', 'D'] should the aggregation

  1. Increase the count for the each of the strings independently {'A':1, 'B':1, 'D':1}, or
  2. Count the joint occurrence of the strings {'A,B,D':1}?

Note that the example provided in the docs does not match this case 1:1 as pcr_target is an array of objects, and pcr_target_locus contains only a single string.

@schristley
Copy link
Member

Note that the example provided in the docs does not match this case 1:1 as pcr_target is an array of objects, and pcr_target_locus contains only a single string.

Conceptually they are the same. In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings. Another way to answer the question though is:

  1. Increase the count for the each of the strings independently {'A':1, 'B':1, 'D':1}, or

This, because it's most useful/common statistics and easy to parse for the client program.

  1. Count the joint occurrence of the strings {'A,B,D':1}?

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

@bussec
Copy link
Member Author

bussec commented Jun 28, 2022

In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings.

Where is the collapsing performed? What happens if there are multiple pcr_target_locus records? It is nice if Mongo handles these things automatically, but for sciReptor we need to recreate this behavior explicitly.

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

According to the GDC documentation, this is not possible (see limitation 2):

https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets

Otherwise I am fine with the behavior described in option 1.

@schristley
Copy link
Member

In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings.

Where is the collapsing performed? What happens if there are multiple pcr_target_locus records? It is nice if Mongo handles these things automatically, but for sciReptor we need to recreate this behavior explicitly.

Let me explain with an SQL example, if that helps. It depends how you store pcr_target records but let's assume the standard relational design of having them in their own table. Thus, the pcr_target table has some identifier fields, the pcr_target_locus field and the forward/reverse primer location fields. A query asking if there is TRB locus:

select * from samples s, pcr_target_locus p where s.id == p.id and p.pcr_target_locus == 'TRB'

I'm ignoring the specifics about the id for joining the tables and restricting to a specific repertoire/sample. If this query returns no records for a specific repertoire/sample, then no TRB locus for the sample. If this query returns one or more records, then there is a TRB locus for the sample.

Combining a query like that with GROUP BY, DISTINCT and COUNT can get you pretty close to the facets result.

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

According to the GDC documentation, this is not possible (see limitation 2):

https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets

I was never sure why the GDC had that limitation, but the ADC does not have it, you should be able to have any filter on a facets. All the filter does is restrict to a subset of repertoire records, so it's essentially independent of the facets operation.

In the SQL world, that may mean you need to chain SELECT statements, i.e. one SELECT to do the filtering and another SELECT which operates on the first's results to do the facets.

@scharch
Copy link
Contributor

scharch commented Jul 10, 2023

@bussec can this be closed?

@bussec
Copy link
Member Author

bussec commented Jul 10, 2023

No, this information needs to be included in the docs (especially the difference to GDC).

@scharch scharch modified the milestones: AIRR 1.4.2, ADC V2 Jul 10, 2023
bcorrie added a commit that referenced this issue Feb 16, 2024
@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

I have updated the Docs to reflect the GDC difference. Closing this issue.

@bcorrie bcorrie closed this as completed Feb 16, 2024
bcorrie added a commit that referenced this issue Feb 20, 2024
* Update facet docs

As per #617

* Removal/deprecation of is and not operators

* New release notes file for ADC API

Added deprecation of is and not.

* Error codes, repository loading changes

As per #431 and #487

* Add 408 and 413 errors

* Added 408 and 413 errors

* Add docs for AA/nt case discussion

As per #528

* Update data loading recommendation

* Remove docs about deprecated not operator

* Update to array query docs.

* Typo fix
@github-project-automation github-project-automation bot moved this to Done in ADC API Aug 28, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants