Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

closest for m/z - RT pair #20

Closed
michaelwitting opened this issue May 31, 2020 · 7 comments
Closed

closest for m/z - RT pair #20

michaelwitting opened this issue May 31, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@michaelwitting
Copy link
Collaborator

One thing I have to do quite often is to search for a specific m/z - RT pair in a data set. A function similar to closest would be great, but only returning the closest, but all matches. Maybe to be fit for the future already include the possibility of CCS values?

What do you think @jorainer ?

@jorainer
Copy link
Member

hm, could be that I already implemented this somewhere (can't remember if it was in xcms or some other package). That's indeed an important function. Maybe named like closestPair or similar? But it's definitely a tricky one!

@michaelwitting
Copy link
Collaborator Author

I don't know, could be that I missed it so far. But I think this is definitely something for MetaboCoreUtils. Can be used in MS1 annotation, alignment etc...

@jorainer
Copy link
Member

jorainer commented Sep 6, 2023

Picking that issue up again: I would suggest the following definition:

mclosest <- function(x, table, ppm = 0, tolerance = Inf) {
...
}

where x and table can be two dimensional arrays (matrix or data.frame) with the same number of columns (doesn't have to be limited to 2). The function should then find for each row in x the row in table with the smallest distance considering each pair of columns (i.e. smallest difference between column 1 in both arrays, column 2 in both arrays etc). Other properties:

  • ppm and tolerance should be numeric of length 1 or equal to the number of columns of x.
  • the result should be an integer of length equal to the number of rows of x, each element being the index (row) in table with the closest values.
  • I would not use any similarity algorithm (like euclidian distance or similar) to calculate the similarity, because the columns are expected to contain values with different units (e.g. if x and table are data frames with m/z and retention time values.

Implementation suggestion:

  • calculate absolute difference between pairwise columns in x and tables (i.e. absolute difference of values in column 1 of x and table, absolute difference of values in column 2 of x and table etc.) - might be that we will need to loop over rows in x - or alternatively do some matrix operation?
  • replace differences larger than allowed by ppm and tolerance with NA
  • rank differences (or replace with their order)
  • return for each row in x the index of the row in table with the lowest rank product

The name mclosest should tell that this is a multi closest calculation... not perfect name, so open for alternative suggestions.

would that be something you would be OK with @michaelwitting ? I could let Philippine @philouail implement that.

@michaelwitting
Copy link
Collaborator Author

Will this always match columns called mz and then the additional one?
I'm just thinking how this could be used in a flexible manner to match retention times or collisional cross sections. Shall the user be allowed to define name of the column, which shall be used for the additional matching? Of course it has to be present then in both input data frames.

@jorainer
Copy link
Member

jorainer commented Sep 7, 2023

I would require that both x and table have the same number of columns. That would keep this function very generic and could be applied to many different use cases. The user has to ensure that these are provided in the correct order (i.e. first columns being m/z, second columns retention times, third columns ...).

Examples:

  • mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"]) would return for each row in a the index in table with the best match.
  • mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"], ppm = 0, tolerance = c(0.01, 2)) would also return the best match, but only if the difference between the m/z values in a and b is below 0.01 and the difference in retention times is below 2.

does this make sense?

@michaelwitting
Copy link
Collaborator Author

Makes totally sense to me.

@philouail philouail self-assigned this Sep 12, 2023
This was referenced Sep 29, 2023
@jorainer
Copy link
Member

jorainer commented Oct 2, 2023

@philouail implemented this now (PR #71). It's in the main branch and I'll push to Bioconductor.

@jorainer jorainer closed this as completed Oct 2, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants