Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Reproducibility of detection #256

Open
sotaro-kanematsu opened this issue Oct 15, 2024 · 2 comments
Open

Reproducibility of detection #256

sotaro-kanematsu opened this issue Oct 15, 2024 · 2 comments

Comments

@sotaro-kanematsu
Copy link

Please provide your opinion on the reproducibility of fusion genes detected when analyzed using Arriba. I conducted two types of analyses: 1. A case where the same template was used to create a library, sequenced twice, and analyzed; 2. A case where independent libraries were created from the same template, each sequenced and analyzed. In the duplicate experiments for cases 1 and 2, I examined the concordance rate of detected fusions, and found that the concordance rate when only sequencing was repeated was about 50% (n=8), while the concordance rate when repeated from the library (n=16) was about 25%. Could the ones that did not match in these two repeated analyses be false positives? I was particularly surprised that even when sequencing the same library twice, the concordance rate was only around 50%.
By the way, there is no significant difference in sequencing amount or quality between the two.

Thank you for your kind help at anytime.

Sota

@suhrig
Copy link
Owner

suhrig commented Oct 16, 2024

The "n"s you mention - are they sequencing run counts or fusion counts?

In any case, the concordance you observe doesn't seem unusual. Library creation and sequencing are both stochastic processes. Between two runs, you will not amplify/sequence the exact same molecules twice. A fusion that is clearly detectable in one run may be underrepresented in another. This means that in the first run you will find evidence for other fusions than in the second. Discordance does not necessarily mean artifact, hence. The fusion may simply not be detectable. Of course, some of them will be artifacts, though.

You can reduce the discordance from the sequencing step by increasing the sequencing depth. At a certain depth, you should reach detection saturation. North of 50 million reads should suffice to reliably detect the high-confidence and medium-confidence fusions (provided that the duplication rate isn't too high and you use >=75nt paired-end sequencing). If you're unsure whether you have reached saturation, you can downsample the BAM file in silico to various depths and rerun Arriba. At some point, the saturation curve should flatten.

When comparing the concordance between two samples, I recommend to ignore low-confidence fusions. They have a high false-positive rate. Their purpose is to provide fusion calls in situations where high sensitivity is more important than high specificity. Without external knowledge (e.g., structural variant calls from whole-genome sequencing or an expectation to find a certain fusion that is characteristic for a given cancer type) these fusions should be treated with caution. You should find that the concordance of the high-/medium-conf fusions is better and that most of the discordance in your samples comes from the low-conf fusions.

Happy to answer any follow-up questions you may have.

@sotaro-kanematsu
Copy link
Author

sotaro-kanematsu commented Oct 22, 2024

Apologies for the lack of clarity. The sample size (n) refers to the number of samples analyzed. I strongly agree with your opinion that 'increasing the sequencing depth is effective in reducing discordance at the sequencing stage.' However, at the same time, I am uncertain whether 50 million reads will be sufficient for stable detection, considering that the samples I am analyzing are fresh frozen tissue (tumor samples). This is because determining sufficiency is likely difficult due to the tumor cell proportion and heterogeneity within the samples, as the cell populations constituting the tissue are diverse. I plan to investigate how the concordance rate changes when varying the conditions to 30 million, 40 million, and 50 million reads. 

Thanks Suhrig!!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants