Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

KeyError: "word '...' not in vocabulary" 20-newsgroups #2856

Closed
gocen opened this issue Jun 12, 2020 · 11 comments
Closed

KeyError: "word '...' not in vocabulary" 20-newsgroups #2856

gocen opened this issue Jun 12, 2020 · 11 comments
Labels
question Discussions that are generally off-topic for the github issue tracker

Comments

@gocen
Copy link

gocen commented Jun 12, 2020

I want to use pre-trained 20-newsgroups model. My code is this:
from gensim.models import Word2Vec
import gensim.downloader as api
corpus = api.load('20-newsgroups')
print (model.similarity("jesus", "game"))

But it gives the error
KeyError: "word 'jesus' not in vocabulary"

@piskvorky
Copy link
Owner

And is the word jesus in the vocabulary? What's the actual issue?

@piskvorky piskvorky added the need info Not enough information for reproduce an issue, need more info from author label Jun 12, 2020
@gocen
Copy link
Author

gocen commented Jun 12, 2020 via email

@gojomo
Copy link
Collaborator

gojomo commented Jun 12, 2020

@gocen Your code doesn't show how model was created. (Was it loaded? Trained? With what code?)

You're likely making an error in preparing the model. If it reports that a word isn't present, it wasn't there during training, or not there in sufficient quantity.

A better place to ask for usage help is the discussion list, https://groups.google.com/forum/#!forum/gensim, reserving this issue-tracker for bugs & feature-requests.

@gocen
Copy link
Author

gocen commented Jun 13, 2020

My code works with text8.
corpus = api.load('text8')
when I changed it to this
corpus = api.load('20-newsgroups')
It also loades 20-newsgroups but gives that error

@piskvorky
Copy link
Owner

Your code is not using the corpus variable at all. You're using model, which we don't know what it is.

As @gojomo said, likely a user error, not a library bug – please use the mailing list.

@gocen
Copy link
Author

gocen commented Jun 13, 2020

Sorry, I put the code not complete. My code is this:
import gensim
from gensim.models import Word2Vec
import gensim.downloader as api
corpus = api.load('20-newsgroups')
model = Word2Vec(corpus)
print (model.similarity("jesus", "game"))

@piskvorky
Copy link
Owner

piskvorky commented Jun 13, 2020

Word2Vec expects a sequence of sentences (lists of strings) on input.

But your corpus is a dict:

>>> list(corpus)[0]

{'topic': 'soc.religion.christian',
 'set': 'train',
 'data': 'From: db7n+@andrew.cmu.edu (D. Andrew Byler)\nSubject: Re: Serbian genocide Work of God?\nOrganization: Freshman, Civil Engineering, Carnegie Mellon, Pittsburgh, PA\nLines: 61\n\nVera Shanti Noyes writes;\n\n>this is what indicates to me that you may believe in predestination.\n>am i correct?  i do not believe in predestination -- i believe we all\n>choose whether or not we will accept God\'s gift of salvation to us.\n>again, fundamental difference which can\'t really be resolved.\n\nOf course I believe in Predestination.  It\'s a very biblical doctrine as\nRomans 8.28-30 shows (among other passages).  Furthermore, the Church\nhas always taught predestination, from the very beginning.  But to say\nthat I believe in Predestination does not mean I do not believe in free\nwill.  Men freely choose the course of their life, which is also\naffected by the grace of God.  However, unlike the Calvinists and\nJansenists, I hold that grace is resistable, otherwise you end up with\nthe idiocy of denying the universal saving will of God (1 Timothy 2.4). \nFor God must give enough grace to all to be saved.  But only the elect,\nwho he foreknew, are predestined and receive the grace of final\nperserverance, which guarantees heaven.  This does not mean that those\nwithout that grace can\'t be saved, it just means that god foreknew their\nobstinacy and chose not to give it to them, knowing they would not need\nit, as they had freely chosen hell.\n\t\t\t\t\t\t\t  ^^^^^^^^^^^\nPeople who are saved are saved by the grace of God, and not by their own\neffort, for it was God who disposed them to Himself, and predestined\nthem to become saints.  But those who perish in everlasting fire perish\nbecause they hardened their heart and chose to perish.  Thus, they were\ndeserving of God;s punishment, as they had rejected their Creator, and\nsinned against the working of the Holy Spirit.\n\n>yes, it is up to God to judge.  but he will only mete out that\n>punishment at the last judgement. \n\nWell, I would hold that as God most certainly gives everybody some\nblessing for what good they have done (even if it was only a little),\nfor those He can\'t bless in the next life, He blesses in this one.  And\nthose He will not punish in the next life, will be chastised in this one\nor in Purgatory for their sins.  Every sin incurs some temporal\npunishment, thus, God will punish it unless satisfaction is made for it\n(cf. 2 Samuel 12.13-14, David\'s sin of Adultery and Murder were\nforgiven, but he was still punished with the death of his child.)  And I\nneed not point out the idea of punishment because of God\'s judgement is\nquite prevelant in the Bible.  Sodom and Gommorrah, Moses barred from\nthe Holy Land, the slaughter of the Cannanites, Annias and Saphira,\nJerusalem in 70 AD, etc.\n\n> if jesus stopped the stoning of an adulterous woman (perhaps this is\nnot a >good parallel, but i\'m going to go with it anyway), why should we\nnot >stop the murder and violation of people who may (or may not) be more\n>innocent?\n\nWe should stop the slaughter of the innocent (cf Proverbs 24.11-12), but\ndoes that mean that Christians should support a war in Bosnia with the\nU.S. or even the U.N. involved?  I do not think so, but I am an\nisolationist, and disagree with foreign adventures in general.  But in\nthe case of Bosnia, I frankly see no excuse for us getting militarily\ninvolved, it would not be a "just war."  "Blessed" after all, "are the\npeacemakers" was what Our Lord said, not the interventionists.  Our\nactions in Bosnia must be for peace, and not for a war which is\nunrelated to anything to justify it for us.\n\nAndy Byler\n',
 'id': '21408'}

So your sentences, the input to word2vec, is just the 4 words ['topic', 'set', 'data', 'id'] repeated ~20k times.

You'll want to tokenize the data field and pass that to word2vec as input.

@gocen
Copy link
Author

gocen commented Jun 13, 2020

Ok, who prepared this corpus? Don't you?

@piskvorky
Copy link
Owner

@gocen
Copy link
Author

gocen commented Jun 13, 2020

Excuse me, I am not so familiar with this issue
Is thıs possible you to arrange 20-newsgroups corpus to be useful like text8
because I will use this to classify 20-newsgroups dataset

@piskvorky
Copy link
Owner

Please continue discussion on the mailing list.

Repository owner locked as resolved and limited conversation to collaborators Jun 13, 2020
@mpenkov mpenkov added question Discussions that are generally off-topic for the github issue tracker and removed need info Not enough information for reproduce an issue, need more info from author labels Oct 28, 2020
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
question Discussions that are generally off-topic for the github issue tracker
Projects
None yet
Development

No branches or pull requests

4 participants