a problem about dataset #2

luckyouz · 2024-06-06T11:37:30Z

Hello,
I noticed that your article mentions the collection of datasets with 1143, 1055, and 963 molecules in MED1, NPM1, and HP1α droplets, respectively, and that you provided the data file probe_screen-data.xlsx. However, it seems that the article does not specify how these datasets are divided, nor does it mention which data are held out. I have reviewed the data file, but the molecule counts for each droplet do not seem to match those mentioned.
Therefore, I would like to understand your method for dividing the datasets and how to derive the training, validation, and test datasets for the deep learning models described in the readme.md from this data file. Your assistance in clarifying these points would be greatly appreciated.
Thank you!

pgmikhael · 2024-06-15T19:54:22Z

Hi,

Thanks for reaching out!

As mentioned in the methods, the molecules were assigned to the training set (80%), validation set (10%) or test set (10%) using a scaffold split. Specifically, the code in this README provides the exact parameters used to split the data and define which ones are held out.

Since not all molecules could be processed through RDKit and Chemprop, 1143, 1055, and 963 refers to those that could be processed and were used for developing the model.

tkella47 · 2025-01-17T18:35:00Z

I believe the question remains unresolved. The provided file in the paper, probe_screen-data.xlsx, does not appear to align with the documentation in this repository.

The file format is .xlsx rather than .csv, as mentioned in the instructions.
The data contains multiple columns, but the structure and content of these columns do not correspond to the expected format or any clear interpretation based on the documentation.
Could you clarify how the data in the file maps to the processing steps described, or provide additional guidance on how to interpret the data? Any further details or examples would be greatly appreciated.

Thank you for your assistance!

pgmikhael · 2025-01-18T05:05:37Z

Hi,

I added a folder with the CSVs referenced in the notebook. Hopefully this helps with running the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a problem about dataset #2

a problem about dataset #2

luckyouz commented Jun 6, 2024

pgmikhael commented Jun 15, 2024

tkella47 commented Jan 17, 2025

pgmikhael commented Jan 18, 2025

a problem about dataset #2

a problem about dataset #2

Comments

luckyouz commented Jun 6, 2024

pgmikhael commented Jun 15, 2024

tkella47 commented Jan 17, 2025

pgmikhael commented Jan 18, 2025