Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Methods for large corpora? #9

Open
gnewton opened this issue Dec 27, 2018 · 1 comment
Open

Methods for large corpora? #9

gnewton opened this issue Dec 27, 2018 · 1 comment

Comments

@gnewton
Copy link

gnewton commented Dec 27, 2018

Sort of related to #8...

You have methods in the API, like in your example, that take an array of strings (docs).

matrix, _ := vectoriser.FitTransform(testCorpus...)

I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal.
Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

Thanks,
Glen

@james-bowman
Copy link
Owner

Thanks, this is on the agenda. Was thinking something like a FitPartial() method and/or adding support for more generalisable input streams.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants