-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Failure to reproduce Table 3 of the e2efold paper #6
Comments
Thank you for reproducing the Table 3 performance! Yes, For your information, the detailed reproducing code for ArchiveII and Table 3 is in the folder You can run If you want to go into details, you can check |
Can you give the list of ArchiveII target (i.e. RNA names, not just the types) on which Table 3 is calculated? |
Thank you very much for your interest! Everything (name, sequence length, performance) is stored with the following code (line 228-244 in
Could you please have a check? |
I see. So the paper actually only test on a subset of 2877 (72%) RNAs from the full set of 3975 ArchiveII RNAs. This was not clear in the paper. On this subset, e2efold_productive_short.py (or e2efold_productive_long.py if >600 nucleotides) indeed outperforms LinearFold. However, the performance is still not as impressive as that reported in the paper. F1 is <0.7 for e2efold_productive, even though the paper reports F1=0.821.
|
Thank you for running the code and reproduce the results! |
Sorry, I mixed Table 2 with Table 3. You are correct that e2efold should have F1=0.69 on this dataset. The main reason for my misunderstanding was that the original text said "We then test the model on sequences in ArchiveII that have overlapping RNA types (5SrRNA, 16SrRNA, etc) with the RNAStralign dataset" which apparently should have included the "SRP" RNA type shared by both datasets with hundreds of RNAs. In fact, "SRP" was excluded from Table 3. This exclusion made e2efold appeared better than SOTA. |
I try to run https://github.com/ml4bio/e2efold/blob/master/e2efold_productive/e2efold_productive_short.py on the ArchiveII dataset used to benchmark e2efold in Table 3 of the e2efold paper. For time's sake, I only run the result on the subset of 3911 target RNAs with up to 600 nucleotides, rather than the full set of 3975 RNAs. Nonetheless, I do not think the small number of 64 (1.6%) out of 3975 would alter the conclusion of the benchmark. Even though the result of e2efold pretrained model on this dataset is much better than what was shown in #5, it is still worse than what was reported in the paper:
In particular, e2efold is found to be even worse than LinearFold, a thermodynamics based RNA folding program, although e2efold is supposed to outperform all state-of-the-art algorithms on this dataset according to the original paper. Note that on average, each target in these 3911 RNAs has 59.1386 base pairs; therefore, e2efold is certainly under-predicting many pairs.
I wondered whether such poor performance is caused by inconsistency in packaging details of e2efold_productive. To verify this, could the e2efold team kindly provides the detail table of per target F1 so that I can check the worse offenders? Thank you.
The text was updated successfully, but these errors were encountered: