Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FastText segfaults for some ngram ranges #2377

Closed
ctrado18 opened this issue Feb 7, 2019 · 8 comments · Fixed by #2382
Closed

FastText segfaults for some ngram ranges #2377

ctrado18 opened this issue Feb 7, 2019 · 8 comments · Fixed by #2382
Assignees
Labels
bug Issue described a bug fasttext Issues related to the FastText model

Comments

@ctrado18
Copy link

ctrado18 commented Feb 7, 2019

hey,

I feel something is bad with my text data. I used common crawl for text data and even without any preprocesing (to dismiss as error source) the fasttext model stops during training and just exists without any messages. That is only for some ngram ranges like:

model=FastText(sentences,min_n=4)
print(model)

In this case no modell summary is printed and the model was not built but there is no message at all.

Giving as paramater instead just min_n=2, max_n=4 the model is built and works fine.

I used my text data and the sentence iterator like here

>>> from gensim.utils import tokenize
>>> import smart_open
>>>
>>>
>>> class MyIter(object):
...     def __iter__(self):
...         path = datapath('crime-and-punishment.txt')
...         with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
...             for line in fin:
...                 yield list(tokenize(line))

with above fasttext statement.

Might this comes from bad encoding like \xad? How to find out what is bad in my data?

@mpenkov mpenkov self-assigned this Feb 7, 2019
@mpenkov mpenkov added the fasttext Issues related to the FastText model label Feb 7, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Feb 7, 2019

Thank you for the report. Sounds like a bug as opposed to a data problem.

In this case no modell summary is printed and the model was not built but there is no message at all.

Not seeing ANY output very odd. Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

Can you truncate the dataset and still reproduce the problem? If yes, please post your truncated data here.

Also, what OS and Python version are you using?

@mpenkov mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Feb 8, 2019
@piskvorky
Copy link
Owner

Seems related to this and this thread on the mailing list.

@ctrado18
Copy link
Author

ctrado18 commented Feb 8, 2019

Thank you @mpenkov !

Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

If this would be the case is this shown inside the console? Because there was really no message it just quits.

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

That sounds good. I will try this version!

I use windows 10, 4 core 64 bit i7 with python 3.6

I would like not to post my data but I test it on a open data set. Where can I post the data to you personally maybe?

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 8, 2019

If this would be the case is this shown inside the console?

I’m not familiar with Windows, so I can’t answer that, sorry.

Where can I post the data to you personally maybe?

Please try reducing the data to a smaller subset first. It’s highly likely that you can reproduce the bug with a few sentences as opposed to an entire corpus.

@piskvorky
Copy link
Owner

@ctrado18 are you the same person as on the mailing list?

@ctrado18
Copy link
Author

ctrado18 commented Feb 8, 2019

Hey guys,

I tested with v3.6. Still same. Also I used the test data https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/alldata-id-10.txt

I removed my preprocessor to exlude this as a problem. But it is still the same.

So, it is not a problem with my data or iterator and preprocessor. That rises a somehow deeper issue with my hardware? But I uses a very new one.

So, what can I do to find out what is going on here?

Here is my code together with above test data:


class MyIter(object):
    def __iter__(self):
        path = datapath('dat.txt')
        with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
            for line in fin:
                s = sentence_detector.tokenize(line)
                for k in s:
                        if k:
                            yield list(tokenize(k))

model = FastText(sentences=MyIter(),sg=1,hs=1,min_n=4,max_n=6)
print(model)

It is just quitting without print anything. For min_n=2 and max_n=4 it works! Strange.

Also using without sg=hs=1 just:

model = FastText(sentences=MyIter()min_n=4,max_n=6)
print(model)

it works. So I think it is also about the skipgram modus?

@ctrado18 are you the same person as on the mailing list?

I am sorry for the inflationary discussion but I felt missunderstood first because problem was clear for me. I hope everything is clear now.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 8, 2019

Thank you for providing detailed information. I've reproduced the bug:

bug.py:

from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open

import logging
logging.basicConfig(level=logging.INFO)

path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
    sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model)

reproduced example:

(gensim) misha@cabron:~/git/gensim$ time python bug.py 
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
Segmentation fault (core dumped)

real    0m9.275s
user    0m7.132s
sys     0m2.907s

gdb session:

(gdb) r bug.py
Starting program: /home/misha/envs/gensim/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1eb6700 (LWP 10740)]
[New Thread 0x7ffff16b5700 (LWP 10741)]
[New Thread 0x7fffeeeb4700 (LWP 10742)]
[New Thread 0x7fffea6b3700 (LWP 10743)]
[New Thread 0x7fffe7eb2700 (LWP 10744)]
[New Thread 0x7fffe56b1700 (LWP 10745)]
[New Thread 0x7fffe4eb0700 (LWP 10746)]
[Thread 0x7fffe7eb2700 (LWP 10744) exited]
[Thread 0x7fffe4eb0700 (LWP 10746) exited]
[Thread 0x7fffe56b1700 (LWP 10745) exited]
[Thread 0x7fffea6b3700 (LWP 10743) exited]
[Thread 0x7fffeeeb4700 (LWP 10742) exited]
[Thread 0x7ffff16b5700 (LWP 10741) exited]
[Thread 0x7ffff1eb6700 (LWP 10740) exited]
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
[New Thread 0x7fffe4eb0700 (LWP 11098)]
[New Thread 0x7fffe56b1700 (LWP 11099)]
[New Thread 0x7fffe7eb2700 (LWP 11100)]
[New Thread 0x7fffea6b3700 (LWP 11101)]
[Thread 0x7fffea6b3700 (LWP 11101) exited]
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
[Thread 0x7fffe4eb0700 (LWP 11098) exited]

Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe56b1700 (LWP 11099)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_hs (__pyx_v_word_point=0xb53900, __pyx_v_word_code=0x121c3b0 "\001\001", __pyx_v_codelen=<optimized out>, 
    __pyx_v_syn0_vocab=<optimized out>, __pyx_v_syn0_ngrams=0x7ffee671c010, __pyx_v_syn1=0x1808b70, __pyx_v_size=<optimized out>, __pyx_v_word2_index=33, __pyx_v_subwords_index=0x1428670, 
    __pyx_v_subwords_len=0, __pyx_v_alpha=0.0250000004, __pyx_v_work=0x7fffc8001000, __pyx_v_l1=0x7fffc8001200, __pyx_v_word_locks_vocab=0x1810570, __pyx_v_word_locks_ngrams=0x7fff757ef010)
    at ./gensim/models/fasttext_inner.c:2593
2593        __pyx_v_g = (((1 - (__pyx_v_word_code[__pyx_v_b])) - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_word_code
$1 = (const __pyx_t_5numpy_uint8_t *) 0x121c3b0 "\001\001"
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5dd87c0
(gdb) p __pyx_v_alpha
$2 = 0.0250000004
(gdb) p __pyx_v_b
$3 = 0

Looks like we're trying to access memory that we shouldn't be touching.

We'll need to debug the Cython code to work out what the problem is.

@mpenkov mpenkov added bug Issue described a bug and removed need info Not enough information for reproduce an issue, need more info from author labels Feb 8, 2019
@mpenkov mpenkov changed the title Fasttext is not working for specific ngram ranges - problem with textdata (encoding?) FastText segfaults for some ngram ranges Feb 8, 2019
@ctrado18
Copy link
Author

ctrado18 commented Feb 8, 2019

@mpenkov WOW. I am so happy. Thanks! That is really great from you! I had so much headaches because of this and gone crazy since it was such a obvious bug such that I check my whole text data sentence for sentence...

But, why am I the first one who observes this? I mean are there so less people who are working with that. Since it also due to sg=1 no one seems using this...

That is bad because all my research I can do is just for ngram range 2-4 which is a bit too smalll for my case. So I should have used logging for the segfault. What is gdb?

Anyway Thanks! 😄

I look forward for the solution! 😄

mpenkov added a commit that referenced this issue Mar 7, 2019
* avoid division by zero
* get rid of stray newline
* Fix #2377
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Issue described a bug fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants