Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

"RuntimeError: Either words or rawWords must be filled" using add_doc sometimes #161

Closed
batmanscode opened this issue Feb 26, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@batmanscode
Copy link

I have text in a dataframe and was adding it in like this:

for text in df['text']:
    mdl.add_doc(text.strip().split())

This works fine

However, when I tried to remove stopwords before using add_doc I get the error in the title

I'm doing the preprocessing using texthero like this:

import texthero as hero
from texthero import preprocessing

custom_pipeline = [preprocessing.remove_stopwords,
                   preprocessing.remove_digits,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_whitespace]

df['clean_text'] = hero.clean(df['tweet'], custom_pipeline)

for text in df['clean_text']:
    mdl.add_doc(text.strip().split())
RuntimeError: Either `words` or `rawWords` must be filled.

Side note: maybe this could be built into tomotopy using texthero

@batmanscode batmanscode changed the title RuntimeError: Either words or rawWords must be filled using add_doc sometimes RuntimeError: Either words or rawWords must be filled using add_doc sometimes Feb 26, 2022
@batmanscode batmanscode changed the title RuntimeError: Either words or rawWords must be filled using add_doc sometimes "RuntimeError: Either words or rawWords must be filled" using add_doc sometimes Feb 26, 2022
@bab2min
Copy link
Owner

bab2min commented Feb 26, 2022

Hi @batmanscode ,
It seems that there is an empty document in your df['clean_text']. Could you check the value of df['clean_text'] to make sure there are no blank documents?

@batmanscode
Copy link
Author

@bab2min df['clean_text'].isnull().value_counts() showed no empty values

@bab2min
Copy link
Owner

bab2min commented Mar 2, 2022

@batmanscode
df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

@batmanscode
Copy link
Author

batmanscode commented Mar 8, 2022

@batmanscode df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

Ah this makes sense, thanks you. There are indeed empty values here. Are there some ways to get tomotopy to skip these? It's not really a problem to remove, but just curious

@bab2min
Copy link
Owner

bab2min commented Mar 8, 2022

@batmanscode Currently, add_doc has no such feature. But I think it's a good idea to add the option to ignore empty docs.

@bab2min bab2min added the enhancement New feature or request label Mar 8, 2022
@batmanscode
Copy link
Author

@bab2min Agreed. Would be a nice quality of life feature to have

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants