FastText segfaults for some ngram ranges #2377

ctrado18 · 2019-02-07T19:45:51Z

hey,

I feel something is bad with my text data. I used common crawl for text data and even without any preprocesing (to dismiss as error source) the fasttext model stops during training and just exists without any messages. That is only for some ngram ranges like:

model=FastText(sentences,min_n=4)
print(model)

In this case no modell summary is printed and the model was not built but there is no message at all.

Giving as paramater instead just min_n=2, max_n=4 the model is built and works fine.

I used my text data and the sentence iterator like here

>>> from gensim.utils import tokenize
>>> import smart_open
>>>
>>>
>>> class MyIter(object):
...     def __iter__(self):
...         path = datapath('crime-and-punishment.txt')
...         with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
...             for line in fin:
...                 yield list(tokenize(line))

with above fasttext statement.

Might this comes from bad encoding like \xad? How to find out what is bad in my data?

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-02-07T22:28:38Z

Thank you for the report. Sounds like a bug as opposed to a data problem.

In this case no modell summary is printed and the model was not built but there is no message at all.

Not seeing ANY output very odd. Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

Can you truncate the dataset and still reproduce the problem? If yes, please post your truncated data here.

Also, what OS and Python version are you using?

piskvorky · 2019-02-08T08:53:50Z

Seems related to this and this thread on the mailing list.

ctrado18 · 2019-02-08T09:23:42Z

Thank you @mpenkov !

Did the Python subprocess segfault? Was there a core dump or non-zero exit code?

If this would be the case is this shown inside the console? Because there was really no message it just quits.

We changed the ngram implementation recently. Does the bug persist with Gensim 3.6.0?

That sounds good. I will try this version!

I use windows 10, 4 core 64 bit i7 with python 3.6

I would like not to post my data but I test it on a open data set. Where can I post the data to you personally maybe?

mpenkov · 2019-02-08T10:57:21Z

If this would be the case is this shown inside the console?

I’m not familiar with Windows, so I can’t answer that, sorry.

Where can I post the data to you personally maybe?

Please try reducing the data to a smaller subset first. It’s highly likely that you can reproduce the bug with a few sentences as opposed to an entire corpus.

piskvorky · 2019-02-08T11:35:01Z

@ctrado18 are you the same person as on the mailing list?

ctrado18 · 2019-02-08T13:15:18Z

Hey guys,

I tested with v3.6. Still same. Also I used the test data https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/alldata-id-10.txt

I removed my preprocessor to exlude this as a problem. But it is still the same.

So, it is not a problem with my data or iterator and preprocessor. That rises a somehow deeper issue with my hardware? But I uses a very new one.

So, what can I do to find out what is going on here?

Here is my code together with above test data:


class MyIter(object):
    def __iter__(self):
        path = datapath('dat.txt')
        with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
            for line in fin:
                s = sentence_detector.tokenize(line)
                for k in s:
                        if k:
                            yield list(tokenize(k))

model = FastText(sentences=MyIter(),sg=1,hs=1,min_n=4,max_n=6)
print(model)

It is just quitting without print anything. For min_n=2 and max_n=4 it works! Strange.

Also using without sg=hs=1 just:

model = FastText(sentences=MyIter()min_n=4,max_n=6)
print(model)

it works. So I think it is also about the skipgram modus?

@ctrado18 are you the same person as on the mailing list?

I am sorry for the inflationary discussion but I felt missunderstood first because problem was clear for me. I hope everything is clear now.

mpenkov · 2019-02-08T14:28:32Z

Thank you for providing detailed information. I've reproduced the bug:

bug.py:

from gensim.models.fasttext import FastText
from gensim.utils import tokenize
from gensim.test.utils import datapath
import smart_open

import logging
logging.basicConfig(level=logging.INFO)

path = datapath('alldata-id-10.txt')
with smart_open.smart_open(path, 'r', encoding='utf-8') as fin:
    sentences = [list(tokenize(l)) for l in fin]
model = FastText(sentences=sentences, sg=1, hs=1, min_n=4, max_n=6)
print(model)

reproduced example:

(gensim) misha@cabron:~/git/gensim$ time python bug.py 
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
Segmentation fault (core dumped)

real    0m9.275s
user    0m7.132s
sys     0m2.907s

gdb session:

(gdb) r bug.py
Starting program: /home/misha/envs/gensim/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff1eb6700 (LWP 10740)]
[New Thread 0x7ffff16b5700 (LWP 10741)]
[New Thread 0x7fffeeeb4700 (LWP 10742)]
[New Thread 0x7fffea6b3700 (LWP 10743)]
[New Thread 0x7fffe7eb2700 (LWP 10744)]
[New Thread 0x7fffe56b1700 (LWP 10745)]
[New Thread 0x7fffe4eb0700 (LWP 10746)]
[Thread 0x7fffe7eb2700 (LWP 10744) exited]
[Thread 0x7fffe4eb0700 (LWP 10746) exited]
[Thread 0x7fffe56b1700 (LWP 10745) exited]
[Thread 0x7fffea6b3700 (LWP 10743) exited]
[Thread 0x7fffeeeb4700 (LWP 10742) exited]
[Thread 0x7ffff16b5700 (LWP 10741) exited]
[Thread 0x7ffff1eb6700 (LWP 10740) exited]
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:collecting all words and their counts
WARNING:gensim.models.word2vec:Each 'sentences' item should be a list of words (usually unicode strings). First item here is instead plain <class 'str'>.
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 53 word types from a corpus of 14201 raw words and 10 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=5 retains 39 unique words (73% of original 53, drops 14)
INFO:gensim.models.word2vec:effective_min_count=5 leaves 14173 word corpus (99% of original 14201, drops 28)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 53 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 26 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 2588 word corpus (18.3% of prior 14173)
INFO:gensim.models.word2vec:constructing a huffman tree from 39 words
INFO:gensim.models.word2vec:built huffman tree with maximum node depth 11
INFO:gensim.models.fasttext:estimated required memory for 39 words, 0 buckets and 100 dimensions: 75972 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model with 3 workers on 39 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
[New Thread 0x7fffe4eb0700 (LWP 11098)]
[New Thread 0x7fffe56b1700 (LWP 11099)]
[New Thread 0x7fffe7eb2700 (LWP 11100)]
[New Thread 0x7fffea6b3700 (LWP 11101)]
[Thread 0x7fffea6b3700 (LWP 11101) exited]
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 2 more threads
[Thread 0x7fffe4eb0700 (LWP 11098) exited]

Thread 10 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe56b1700 (LWP 11099)]
__pyx_f_6gensim_6models_14fasttext_inner_fasttext_fast_sentence_sg_hs (__pyx_v_word_point=0xb53900, __pyx_v_word_code=0x121c3b0 "\001\001", __pyx_v_codelen=<optimized out>, 
    __pyx_v_syn0_vocab=<optimized out>, __pyx_v_syn0_ngrams=0x7ffee671c010, __pyx_v_syn1=0x1808b70, __pyx_v_size=<optimized out>, __pyx_v_word2_index=33, __pyx_v_subwords_index=0x1428670, 
    __pyx_v_subwords_len=0, __pyx_v_alpha=0.0250000004, __pyx_v_work=0x7fffc8001000, __pyx_v_l1=0x7fffc8001200, __pyx_v_word_locks_vocab=0x1810570, __pyx_v_word_locks_ngrams=0x7fff757ef010)
    at ./gensim/models/fasttext_inner.c:2593
2593        __pyx_v_g = (((1 - (__pyx_v_word_code[__pyx_v_b])) - __pyx_v_f) * __pyx_v_alpha);
(gdb) p __pyx_v_word_code
$1 = (const __pyx_t_5numpy_uint8_t *) 0x121c3b0 "\001\001"
(gdb) p __pyx_v_f
Cannot access memory at address 0x7ffdd5dd87c0
(gdb) p __pyx_v_alpha
$2 = 0.0250000004
(gdb) p __pyx_v_b
$3 = 0

Looks like we're trying to access memory that we shouldn't be touching.

We'll need to debug the Cython code to work out what the problem is.

ctrado18 · 2019-02-08T14:37:26Z

@mpenkov WOW. I am so happy. Thanks! That is really great from you! I had so much headaches because of this and gone crazy since it was such a obvious bug such that I check my whole text data sentence for sentence...

But, why am I the first one who observes this? I mean are there so less people who are working with that. Since it also due to sg=1 no one seems using this...

That is bad because all my research I can do is just for ngram range 2-4 which is a bit too smalll for my case. So I should have used logging for the segfault. What is gdb?

Anyway Thanks! 😄

I look forward for the solution! 😄

* avoid division by zero * get rid of stray newline * Fix #2377

mpenkov self-assigned this Feb 7, 2019

mpenkov added the fasttext Issues related to the FastText model label Feb 7, 2019

mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Feb 8, 2019

mpenkov added bug Issue described a bug and removed need info Not enough information for reproduce an issue, need more info from author labels Feb 8, 2019

mpenkov changed the title ~~Fasttext is not working for specific ngram ranges - problem with textdata (encoding?)~~ FastText segfaults for some ngram ranges Feb 8, 2019

mpenkov mentioned this issue Feb 14, 2019

Clean up FastText Cython code, fix division by zero #2382

Merged

mpenkov mentioned this issue Mar 7, 2019

avoid division by zero #2404

Merged

mpenkov closed this as completed in #2404 Mar 7, 2019

mpenkov added a commit that referenced this issue Mar 7, 2019

avoid division by zero in fasttext_inner.pyx (#2404)

d8bad9d

* avoid division by zero * get rid of stray newline * Fix #2377

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText segfaults for some ngram ranges #2377

FastText segfaults for some ngram ranges #2377

ctrado18 commented Feb 7, 2019

mpenkov commented Feb 7, 2019 •

edited

Loading

piskvorky commented Feb 8, 2019

ctrado18 commented Feb 8, 2019 •

edited

Loading

mpenkov commented Feb 8, 2019

piskvorky commented Feb 8, 2019

ctrado18 commented Feb 8, 2019 •

edited

Loading

mpenkov commented Feb 8, 2019 •

edited

Loading

ctrado18 commented Feb 8, 2019 •

edited

Loading

FastText segfaults for some ngram ranges #2377

FastText segfaults for some ngram ranges #2377

Comments

ctrado18 commented Feb 7, 2019

mpenkov commented Feb 7, 2019 • edited Loading

piskvorky commented Feb 8, 2019

ctrado18 commented Feb 8, 2019 • edited Loading

mpenkov commented Feb 8, 2019

piskvorky commented Feb 8, 2019

ctrado18 commented Feb 8, 2019 • edited Loading

mpenkov commented Feb 8, 2019 • edited Loading

ctrado18 commented Feb 8, 2019 • edited Loading

mpenkov commented Feb 7, 2019 •

edited

Loading

ctrado18 commented Feb 8, 2019 •

edited

Loading

ctrado18 commented Feb 8, 2019 •

edited

Loading

mpenkov commented Feb 8, 2019 •

edited

Loading

ctrado18 commented Feb 8, 2019 •

edited

Loading