Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Suggestion to simplify implemenation of scitype #155

Open
ablaom opened this issue Aug 29, 2021 · 5 comments
Open

Suggestion to simplify implemenation of scitype #155

ablaom opened this issue Aug 29, 2021 · 5 comments

Comments

@ablaom
Copy link
Member

ablaom commented Aug 29, 2021

I have often lamented the fact that scitype cannot be a map of machine type to type, instead of object to type, because of the infamous CategoricalValue fly in the ointment. As a workaround to performance problems with arrays, we introduced Scitype, which is a map from type to type. Wouldn't if be simpler if implementing Scitype is the "fallback" responsibility of a convention, and that we only overload scitype for problematic cases like CategoricalValue? What am I missing here?

So, something like (ignoring convention distinctions):

# fallback:
scitype(X) = Scitype(typeof(X))

Scitype(::Type) = Unknown
Scitype(::Type{<:Integer}) = Count
# and so forth

# exceptions:
function scitype(X::CategoricalValue)
    N = length(pool(X))
    if isordered(X)
       return OrderedFactor{N}
    end
    return Multiclass{N}
end

To be clear, I'm not suggesting a change in the definition of scitype, only how it is implemented, although Scitype is something we may want to make part of the public interface.

What got me thinking about this is the case of parametric types like Sampleable{S} and a type I'd like to introduce, called Iterator{S} for lazy loaded data structures. Here S is the scitype of the objects sampled, or the scitype of the objects iterated, respectively. How to we implement scitypes for these? This is tricky because we may not have an object from which to extract the parameter S, only its machine type. So in this case we are limited to using Scitype.

Thoughts @OkonSamuel @tlienart

@ablaom
Copy link
Member Author

ablaom commented Aug 30, 2021

Of course, this comment does not address the other "fly in the ointment" which is tables, requiring trait dispatch.

@tlienart
Copy link
Collaborator

I'm probably not getting this suggestion 100% in terms of how it is different from what is there now sorry, could you maybe show what it would give in the case of Sampleable{S} before and after your suggestion?

@OkonSamuel
Copy link
Member

I have also been thinking down this line and I like this idea. It makes the codebase simpler.

@OkonSamuel
Copy link
Member

So I guess this would imply that scitype(Any[1,2,3]) should return Array{Unknown, 1} since Scitype would operate on the type information.
This is would be more efficient than the current implementation which peeks into the type of each element in the array at runtime.

@ablaom
Copy link
Member Author

ablaom commented Sep 1, 2021

Well, no, I'm not suggesting we change the definition of scitype for arrays (see Property 3) - only the implementation. According to the definition, we will need to look inside to determine the scitype in the case of Any eltype, and any other eltype for which Scitype(T) returns Unknown.

We could change the definition, but this would pose some problems. For example, if scitype(Any[1.0, 2.0, 3.0]) = AbstractVector{Unknown}, then it forces users to tighten their element types when this is not strictly necessary to use the algorithms (an objection raised by @tlienart, if I remember correctly). Also, what should the new definition be that still allows us to recover scitype(categorical(1, 2, 3)) = AbstractVector{Multiclass{3}}?

Perhaps you have a simple replacement for Property 3 that allows us to get everything we want?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants