Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GPos getindex #25

Open
Marlin-Na opened this issue Nov 15, 2020 · 4 comments
Open

GPos getindex #25

Marlin-Na opened this issue Nov 15, 2020 · 4 comments

Comments

@Marlin-Na
Copy link

Hi Pete, thanks for your work on this. Currently getindex on GPos returns an Integer instead of GPos. I wonder if this is intentional or not? For example

julia> gp
"4-element GenomicPositions{Int64,3}"
│ Loc │ Chromosome │ Position │
│     │ String     │ Int64    │
├─────┼────────────┼──────────┤
│ 1   │ chr1       │ 3        │
│ 2   │ chrX       │ 3        │
│ 3   │ chr2       │ 3        │
│ 4   │ chr2       │ 3        │

julia> gp[2]
500003

I believe it can be fixed here

Base.getindex(x::GenomicPositions, i::Int) = x.genopos[i]

@Marlin-Na
Copy link
Author

Okay, I see there is no such scalar type implemented for GenomicPosistions like GenomicFeatures.Interval for GenomicRanges.

@phaverty
Copy link
Owner

Hi @Marlin-Na , thank you for your interest in this project. I return an integer here on purpose. This is the nucleotide position in the concatenated genome. I use such values often. I'm open to design changes, though. Returning an integer here is a bit inconsistent because a slice of a GPos should probably return a shorter GPos. Slice and index should return the same type. Probably I need syntax to do either. I don't know which (returning an index or a GPos) would use the standard syntax (get index and friends) and which would use an alternative syntax.

@Marlin-Na
Copy link
Author

Hi @phaverty , thanks for your explanation. I am mostly interested in putting GenomicVectors into a DataFrame. I see there is GenomicDataFrame, but there are cases that we want to put more than one GPos/GRanges to a DataFrame (e.g. when working with structural variants), plus it is more generalized. Currently, GenomicPositions in a DataFrame is shown as Int64, which is kind of weird. And getting one row of the DataFrame it will become a real Int64, which is inconsistent with getting multiple rows.

julia> gp = GenomicPositions(Int64[3e4,4.2e3,1.9e5,1e4],["chr2","chr1","chr2","chrX"],genomeinfo)
julia> df = DataFrame(gp = gp, metacol = collect(5:8))
4×2 DataFrame
│ Row │ gp     │ metacol │
│     │ Int64  │ Int64   │
├─────┼────────┼─────────┤
│ 1   │ 330000 │ 5       │
│ 2   │ 4200   │ 6       │
│ 3   │ 490000 │ 7       │
│ 4   │ 510000 │ 8       │

julia> df[[1,2],:].gp  # still a GenomicPositions
"2-element GenomicPositions{Int64,3}"
│ Loc │ Chromosome │ Position │
│     │ String     │ Int64    │
├─────┼────────────┼──────────┤
│ 1   │ chr2       │ 30000    │
│ 2   │ chr1       │ 4200     │

julia> df[1,:].gp  # becomes an integer
330000

I believe index should return eltype(gp) (currently, this is Int64 for both GPos and GRanges), and slice should return the same type of gp.

I see your point on returning an integer. But as I said above, I am hoping to have an Integer-like type as eltype(gp). Probably something like this so that the GenomeInfo is not lost.

# as eltype of GenomicPositions
struct GenomicPosition{N} <: AbstractInteger
    pos::Int64
    chr_info::GenomeInfo{Int64,N}
end

But I don't know how much performance degradation would this introduce. I don't know if Julia will store a copy of chr_info or just store a reference to the immutable chr_info. A reference might be fine from my point of view but copying chr_info probably would be expensive. Also I don't know if Julia is smart enough to optimize convert(Int64, gp[i]) directly as gp.genopos[i] within a loop.

Similarly, index on GenomicRanges currently returns GenomicFeatures.Interval, which I don't quite like. First, it doesn't match with eltype(GenomicRanges). Second, the chromosome information is also lost (the seqname is "hg19" instead of the actual chromosome). Similar to the GPos case above, I am hoping to have a scalar type implemented for GRanges as well (just two integers plus chr_info) and if needed, we can convert them to GenomicFeatures.Interval using chromosome names and chr_pos.

I am actually quite new to Julia, so these are just my thoughts. Let me know what's your opinion on this.

@phaverty
Copy link
Owner

I think I'll have chromosomes output Vector{String} and keep the Symbol part as an internal optimization.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants