-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversion from standard VCF to Local-alleles VCF #2105
Comments
Ah sorry, I just found It's still not clear if this is the only way to convert from normal to local-allele though. |
Currently we have no command to convert between PL and LPL. The intention was to provide the functionality where it is actually needed and avoid producing such monster sites, which happens at the Is the purpose of your experimentation just testing or does it come from a real world scenario? In any case, a dedicated program, or maybe an option in |
Well it's chicken and egg. I don't know of any publically available many-sample data files using LPL. I do know of ones using normal PL. I agree the answer is not to start from here, but here is where I'm at! If you know of public data with LPL then feel free to point me towards it. Thanks. I'm testing the split method anyway as this is really just an experiment to see how this works rather than something for a pipeline. One feature request is +split needs a --threads option. It's dominated by bgzf compression time. |
This is a request for improved documentation.
Doing a hunt through the documentation I find local alleles mentioned in one section only:
This is the -L option of
bcftools merge
. The usage for this claims:It's unclear how we can use this as a conversion tool for an existing file. I can see two possible routes.
Take an existing many-sample VCF and split into potentially thousands of per-sample files, and then merge them back together again. This seems excessive and probably extremely time consuming. Plus unlike samtools, there's no
bcftools split
subcommand, so I assume this would be many passes through the data withbcftools view -S
which feels like it'll end up be O(N^2) complexity in the number of samples. Or am I missing a split command under a different name? [Edit: yes - see below. I found a plugin later on]Merging the existing VCF with an empty VCF. I'm unsure of how to go about this.
I tried method 2 initially as it seemed the most viable, but immediately hit an error.
Yet the usage statement claims 0 means unlimited. (And confusingly implies it's the default via
[0]
, although it has no default as it's a required argument.)Not deterred, I soldiered on and took it at its word and went with 1.
[Incidentally - a feature request. Please permit tools to work on uncompressed data when we're reading the entire file. It's tedious to have to compress and index everything when I know everything is in the same order already.]
I also hit problems with not being sure what a valid empty VCF file looks like. Is there a command to take the header from an existing VCF and strip off all the samples, so we get an empty one? I did it with egrep and echoes in the end.
After compressing and indexing that, my merge now produces LAA in the header, but as far as I can see it's totally absent in the output file.
So am I correct in assuming now that it's not possible by my guess number 2 above, and the only route to producing LAA is a full sample by sample split and then a remerge back together again?
The back story to all of this is I just want to be able to produce a baseline from an existing merged file that's not using LAA to see how much better it can get, before looking into other formats and other tools. Help would be appreciated, and I'm sure it's not just beneficial to me (hence why putting the issue here rather than direct email).
The text was updated successfully, but these errors were encountered: