-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Reproducibility of detection #256
Comments
The "n"s you mention - are they sequencing run counts or fusion counts? In any case, the concordance you observe doesn't seem unusual. Library creation and sequencing are both stochastic processes. Between two runs, you will not amplify/sequence the exact same molecules twice. A fusion that is clearly detectable in one run may be underrepresented in another. This means that in the first run you will find evidence for other fusions than in the second. Discordance does not necessarily mean artifact, hence. The fusion may simply not be detectable. Of course, some of them will be artifacts, though. You can reduce the discordance from the sequencing step by increasing the sequencing depth. At a certain depth, you should reach detection saturation. North of 50 million reads should suffice to reliably detect the high-confidence and medium-confidence fusions (provided that the duplication rate isn't too high and you use >=75nt paired-end sequencing). If you're unsure whether you have reached saturation, you can downsample the BAM file in silico to various depths and rerun Arriba. At some point, the saturation curve should flatten. When comparing the concordance between two samples, I recommend to ignore low-confidence fusions. They have a high false-positive rate. Their purpose is to provide fusion calls in situations where high sensitivity is more important than high specificity. Without external knowledge (e.g., structural variant calls from whole-genome sequencing or an expectation to find a certain fusion that is characteristic for a given cancer type) these fusions should be treated with caution. You should find that the concordance of the high-/medium-conf fusions is better and that most of the discordance in your samples comes from the low-conf fusions. Happy to answer any follow-up questions you may have. |
Apologies for the lack of clarity. The sample size (n) refers to the number of samples analyzed. I strongly agree with your opinion that 'increasing the sequencing depth is effective in reducing discordance at the sequencing stage.' However, at the same time, I am uncertain whether 50 million reads will be sufficient for stable detection, considering that the samples I am analyzing are fresh frozen tissue (tumor samples). This is because determining sufficiency is likely difficult due to the tumor cell proportion and heterogeneity within the samples, as the cell populations constituting the tissue are diverse. I plan to investigate how the concordance rate changes when varying the conditions to 30 million, 40 million, and 50 million reads. Thanks Suhrig!! |
Please provide your opinion on the reproducibility of fusion genes detected when analyzed using Arriba. I conducted two types of analyses: 1. A case where the same template was used to create a library, sequenced twice, and analyzed; 2. A case where independent libraries were created from the same template, each sequenced and analyzed. In the duplicate experiments for cases 1 and 2, I examined the concordance rate of detected fusions, and found that the concordance rate when only sequencing was repeated was about 50% (n=8), while the concordance rate when repeated from the library (n=16) was about 25%. Could the ones that did not match in these two repeated analyses be false positives? I was particularly surprised that even when sequencing the same library twice, the concordance rate was only around 50%.
By the way, there is no significant difference in sequencing amount or quality between the two.
Thank you for your kind help at anytime.
Sota
The text was updated successfully, but these errors were encountered: