-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
FastText segfaults for some ngram ranges #2377
Comments
Thank you for the report. Sounds like a bug as opposed to a data problem.
Not seeing ANY output very odd. Did the Python subprocess segfault? Was there a core dump or non-zero exit code? We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0? Can you truncate the dataset and still reproduce the problem? If yes, please post your truncated data here. Also, what OS and Python version are you using? |
Thank you @mpenkov !
If this would be the case is this shown inside the console? Because there was really no message it just quits.
That sounds good. I will try this version! I use windows 10, 4 core 64 bit i7 with python 3.6 I would like not to post my data but I test it on a open data set. Where can I post the data to you personally maybe? |
I’m not familiar with Windows, so I can’t answer that, sorry.
Please try reducing the data to a smaller subset first. It’s highly likely that you can reproduce the bug with a few sentences as opposed to an entire corpus. |
@ctrado18 are you the same person as on the mailing list? |
Hey guys, I tested with v3.6. Still same. Also I used the test data https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/alldata-id-10.txt I removed my preprocessor to exlude this as a problem. But it is still the same. So, it is not a problem with my data or iterator and preprocessor. That rises a somehow deeper issue with my hardware? But I uses a very new one. So, what can I do to find out what is going on here? Here is my code together with above test data:
It is just quitting without print anything. For min_n=2 and max_n=4 it works! Strange. Also using without sg=hs=1 just:
it works. So I think it is also about the skipgram modus?
I am sorry for the inflationary discussion but I felt missunderstood first because problem was clear for me. I hope everything is clear now. |
Thank you for providing detailed information. I've reproduced the bug: bug.py: from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open
import logging
logging.basicConfig(level=logging.INFO)
path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model) reproduced example:
gdb session:
Looks like we're trying to access memory that we shouldn't be touching. We'll need to debug the Cython code to work out what the problem is. |
@mpenkov WOW. I am so happy. Thanks! That is really great from you! I had so much headaches because of this and gone crazy since it was such a obvious bug such that I check my whole text data sentence for sentence... But, why am I the first one who observes this? I mean are there so less people who are working with that. Since it also due to sg=1 no one seems using this... That is bad because all my research I can do is just for ngram range 2-4 which is a bit too smalll for my case. So I should have used logging for the segfault. What is gdb? Anyway Thanks! 😄 I look forward for the solution! 😄 |
* avoid division by zero * get rid of stray newline * Fix #2377
hey,
I feel something is bad with my text data. I used common crawl for text data and even without any preprocesing (to dismiss as error source) the fasttext model stops during training and just exists without any messages. That is only for some ngram ranges like:
In this case no modell summary is printed and the model was not built but there is no message at all.
Giving as paramater instead just
min_n=2, max_n=4
the model is built and works fine.I used my text data and the sentence iterator like here
with above fasttext statement.
Might this comes from bad encoding like
\xad
? How to find out what is bad in my data?The text was updated successfully, but these errors were encountered: