Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Parquet needs some standarization #19

Open
ypriverol opened this issue Sep 21, 2018 · 2 comments
Open

Parquet needs some standarization #19

ypriverol opened this issue Sep 21, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ypriverol
Copy link
Collaborator

We need to do some standardization for the Parquet format that enables other people to understand the file format.

@bgruening
Copy link
Contributor

Yeah, that would be nice and give it a proper name :)

@sorenwacker
Copy link

sorenwacker commented May 6, 2021

I like 'parquet', as it is pretty clear what library to use to open it.

Regarding column names. I had a few thoughts:

  • The column name Mass or Masses is technically wrong as it is M/Z values. Or do you convert the M/Z values into masses internally?

  • Intensities could be Intensity even if it is an array.

  • RetentionTime was used in mzXML files, in mzML files I have seen it as ScanTime which is a bit more general and may be more accurate. It would not imply that a chromatographic step was used.

  • Things like TIC are maybe convenient, but also somewhat redundant and it could be calculated easily in one line of code if the data would be in long format.

    df_long.groupby('scan_time_min').sum().plot(y='intensity')

I am quite new to metabolomics/proteomics thought. I am looking at the problem more from a data science Python-biased perspective.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants