Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

index 18509 is out of bounds for axis 1 with size 13293 - error while creating lda model using gensim #160

Open
lonewolf06 opened this issue Jan 17, 2020 · 0 comments

Comments

@lonewolf06
Copy link

Hello,

I am getting the mentioned error while trying to create an lda model for customers comments topic modeling, I am new to python so wasn't able to debug the issue. Any help is much appreciated! Below is the code:

#function to clean up the data
def clean_text(text):
tokenized_text = word_tokenize(text.lower())
cleaned_text = [t for t in tokenized_text if t not in stopwords_hotel and re.match('[a-zA-Z-][a-zA-Z-]{2,}', t)]
return cleaned_text

#data tokenization
tokenized_data_hotel = []
for text in df_hotel.customer_comments_lem:
tokenized_data_hotel.append(clean_text(text))

Build a Dictionary - association word to numeric id

dictionary_hotel = corpora.Dictionary(tokenized_data_hotel)

Transform the collection of texts to a numerical form

corpus = [dictionary.doc2bow(text) for text in tokenized_data_hotel]

#creating bag of words corpus
corpus_bow_hotel = [dictionary.doc2bow(doc) for doc in tokenized_data_hotel]

topic modeling using bag of words

lda_model_bow_hotel = gensim.models.ldamodel.LdaModel(corpus=corpus_bow_hotel,
id2word=dictionary_hotel,
num_topics=4, per_word_topics='TRUE')

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant