Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Failure to reproduce Table 3 of the e2efold paper #6

Closed
kad-ecoli opened this issue Jun 30, 2020 · 6 comments
Closed

Failure to reproduce Table 3 of the e2efold paper #6

kad-ecoli opened this issue Jun 30, 2020 · 6 comments

Comments

@kad-ecoli
Copy link

kad-ecoli commented Jun 30, 2020

I try to run https://github.com/ml4bio/e2efold/blob/master/e2efold_productive/e2efold_productive_short.py on the ArchiveII dataset used to benchmark e2efold in Table 3 of the e2efold paper. For time's sake, I only run the result on the subset of 3911 target RNAs with up to 600 nucleotides, rather than the full set of 3975 RNAs. Nonetheless, I do not think the small number of 64 (1.6%) out of 3975 would alter the conclusion of the benchmark. Even though the result of e2efold pretrained model on this dataset is much better than what was shown in #5, it is still worse than what was reported in the paper:

Method F1 MCC Predicted base pairs per RNA
e2efold 0.5540 0.5595 42.2986
LinearFold 0.6060 0.6085 56.8372

In particular, e2efold is found to be even worse than LinearFold, a thermodynamics based RNA folding program, although e2efold is supposed to outperform all state-of-the-art algorithms on this dataset according to the original paper. Note that on average, each target in these 3911 RNAs has 59.1386 base pairs; therefore, e2efold is certainly under-predicting many pairs.

I wondered whether such poor performance is caused by inconsistency in packaging details of e2efold_productive. To verify this, could the e2efold team kindly provides the detail table of per target F1 so that I can check the worse offenders? Thank you.

@liyu95
Copy link
Member

liyu95 commented Jul 1, 2020

Thank you for reproducing the Table 3 performance! Yes, e2efold_productive and experiment_archiveii are slightly different.

For your information, the detailed reproducing code for ArchiveII and Table 3 is in the folder experiment_archiveii.
As we discussed in the paper, we reported the performance of the RNA types in ArchiveII, which also have enough training data samples in the training dataset. More specifically, the performance of the following RNA types are reported:
['RNaseP', '5s', 'tmRNA', 'tRNA', 'telomerase', '16s']
We did not include grp1 as their sequence lengths in ArchiveII and RNAStralign are completely different. For SRP, we do not have enough training samples in the RNAStralign dataset, which only contains 468 samples. While in ArchiveII, there are much more SRP sequences, whose number is 928.

You can run main.sh in the folder experiment_archiveii to reproduce the performance in Table 3.

If you want to go into details, you can check e2e_learning_stage3.py and e2e_learning_stage3_rnastralign_all_long.py in the same folder.

@kad-ecoli
Copy link
Author

kad-ecoli commented Jul 1, 2020

Can you give the list of ArchiveII target (i.e. RNA names, not just the types) on which Table 3 is calculated?

@liyu95
Copy link
Member

liyu95 commented Jul 1, 2020

Thank you very much for your interest!

Everything (name, sequence length, performance) is stored with the following code (line 228-244 in e2e_learning_stage3.py)

    e2e_result_df = pd.DataFrame()
    e2e_result_df['name'] = [a.name for a in test_data.data]
    e2e_result_df['type'] = list(map(lambda x: x.split('_')[0], [a.name for a in test_data.data]))
    e2e_result_df['seq_lens'] = list(map(lambda x: x.numpy(), seq_lens_list))
    e2e_result_df['exact_p'] = pp_exact_p
    e2e_result_df['exact_r'] = pp_exact_r
    e2e_result_df['exact_f1'] = pp_exact_f1
    e2e_result_df['shift_p'] = pp_shift_p
    e2e_result_df['shift_r'] = pp_shift_r
    e2e_result_df['shift_f1'] = pp_shift_f1
    # pdb.set_trace()
    final_result = e2e_result_df[e2e_result_df['type'].isin(
        ['RNaseP', '5s', 'tmRNA', 'tRNA', 'telomerase', '16s'])]
    to_output = list(map(str, 
        list(final_result[['exact_p', 'exact_r', 'exact_f1', 'shift_p','shift_r', 'shift_f1']].mean().values.round(3))))
    print('Number of sequences: ', len(final_result))
    print(to_output)

Could you please have a check?

@kad-ecoli
Copy link
Author

kad-ecoli commented Jul 1, 2020

I see. So the paper actually only test on a subset of 2877 (72%) RNAs from the full set of 3975 ArchiveII RNAs. This was not clear in the paper. On this subset, e2efold_productive_short.py (or e2efold_productive_long.py if >600 nucleotides) indeed outperforms LinearFold. However, the performance is still not as impressive as that reported in the paper. F1 is <0.7 for e2efold_productive, even though the paper reports F1=0.821.

Method F1 MCC Predicted base pairs per RNA
e2efold 0.6894 0.6943 36.5262
LinearFold 0.6186 0.6216 50.6879

@liyu95
Copy link
Member

liyu95 commented Jul 1, 2020

Thank you for running the code and reproduce the results!
In the Section Test On ArchiveII Without Re-training in our paper, we made it clear that we tested on the overlapping RNA families in ArchiveII, instead of the entire set. In Table 3, we also reported the performance as F1=0.686, which you have reproduced. We have never claimed we can reach F1>0.8 on the ArchiveII dataset.
Could you please make it clear what your concern is?

@kad-ecoli
Copy link
Author

kad-ecoli commented Jul 1, 2020

Sorry, I mixed Table 2 with Table 3. You are correct that e2efold should have F1=0.69 on this dataset.

The main reason for my misunderstanding was that the original text said "We then test the model on sequences in ArchiveII that have overlapping RNA types (5SrRNA, 16SrRNA, etc) with the RNAStralign dataset" which apparently should have included the "SRP" RNA type shared by both datasets with hundreds of RNAs.

In fact, "SRP" was excluded from Table 3. This exclusion made e2efold appeared better than SOTA.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants