Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

a problem about dataset #2

Open
luckyouz opened this issue Jun 6, 2024 · 3 comments
Open

a problem about dataset #2

luckyouz opened this issue Jun 6, 2024 · 3 comments

Comments

@luckyouz
Copy link

luckyouz commented Jun 6, 2024

Hello,
I noticed that your article mentions the collection of datasets with 1143, 1055, and 963 molecules in MED1, NPM1, and HP1α droplets, respectively, and that you provided the data file probe_screen-data.xlsx. However, it seems that the article does not specify how these datasets are divided, nor does it mention which data are held out. I have reviewed the data file, but the molecule counts for each droplet do not seem to match those mentioned.
Therefore, I would like to understand your method for dividing the datasets and how to derive the training, validation, and test datasets for the deep learning models described in the readme.md from this data file. Your assistance in clarifying these points would be greatly appreciated.
Thank you!

@pgmikhael
Copy link
Owner

Hi,

Thanks for reaching out!

As mentioned in the methods, the molecules were assigned to the training set (80%), validation set (10%) or test set (10%) using a scaffold split. Specifically, the code in this README provides the exact parameters used to split the data and define which ones are held out.

Since not all molecules could be processed through RDKit and Chemprop, 1143, 1055, and 963 refers to those that could be processed and were used for developing the model.

@tkella47
Copy link

I believe the question remains unresolved. The provided file in the paper, probe_screen-data.xlsx, does not appear to align with the documentation in this repository.

The file format is .xlsx rather than .csv, as mentioned in the instructions.
The data contains multiple columns, but the structure and content of these columns do not correspond to the expected format or any clear interpretation based on the documentation.
Could you clarify how the data in the file maps to the processing steps described, or provide additional guidance on how to interpret the data? Any further details or examples would be greatly appreciated.

Thank you for your assistance!

@pgmikhael
Copy link
Owner

Hi,

I added a folder with the CSVs referenced in the notebook. Hopefully this helps with running the code.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants