Multiprocessing error #21

dimitarsh1 · 2020-05-10T17:33:17Z

When running the spacy_udpipe with the n_process = X enabled, it gives an error.
The code I run is:

nlpD = spacy_udpipe.load(lang)
nlps = list(nlpD.pipe(sentences, n_process=4))
for doc in nlps:
        for token in doc:
            lemma=token.lemma_

The error is:

File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
 File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '14027581762467160941'. This usually refers to an issue with the `Vocab` or `StringStore`."

When I run the same code, but without the n_process argument, then everything is fine. No errors, the text is processed and so on.

It seems to be related to a spaCy issue, but I couldn't find a solution.
https://stackoverflow.com/questions/60152152/spacy-issue-with-vocab-or-stringstore

spaCy version: 2.2.4
python version: 3.8.3
spacy-udpipe version: 0.3.0
OS: debian 10

Thanks.
Cheers,
Dimitar

The text was updated successfully, but these errors were encountered:

asajatovic · 2020-05-11T08:26:01Z

Seems to be related to explosion/spaCy#5220.

A quick workaround is to change only the first line:
nlpD = spacy_udpipe.load(lang).tokenizer.

This should do the trick as it will call UDPipeTokenizer.pipe(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now.

I will look into a proper fix soon, hopefully.

dimitarsh1 · 2020-05-11T13:00:29Z

Great, Thanks a lot. Kind regards, Dimitar

…

On Mon, 11 May 2020, 09:26 asajatovic, ***@***.***> wrote: Seems to be related to explosion/spaCy#5220 <explosion/spaCy#5220>. A quick workaround is to change only the first line: nlpD = spacy_udpipe.load(lang).tokenizer. This should do the trick as it will call UDPipeTokenizer.pipe <https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L141-L158>(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now. I will look into a proper fix soon, hopefully. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADYLMEXDL7HYJFCEDFQEN6DRQ6ZCPANCNFSM4M5KSOXA> .

asajatovic · 2020-05-23T14:59:25Z

The issue happens with StringStore.add.
When Language.__call__ is called, a Doc is created using UDPipeTokenizer.__call__ in which token attributes are added to StringStore object (see these lines). The same code is run when Language.pipe is called, but via a custom spaCy multiprocessing code. Now, this is where the havoc begins as Python multiprocessing objects and code interact with SWIG wrapper code for UDPipe model (i.e. UDPipe Python bindings which actually enable the underlying NLP model in C++) and somehow in the end the StringsStore object does not contain all the string values it should (missing lemmas, tags, dependencies, etc.).

Interestingly, when the same UDPipeLanguage object processes same texts via __call__ first and then via pipe, everything works fine as the StringStore object is already prepopulated with all string values.

Unfortunately, neither multithreading nor UDPipeTokenizer multiprocessing speeds up execution.

dimitarsh1 · 2020-05-23T15:12:37Z

Thanks.
Could that relate to the other issue "[E190] Token head out of range"?

asajatovic · 2020-05-23T15:39:29Z

@dimitarsh1 you are welcome.
[E190] should not be happening when using __call__ since version 0.2.1 or with pipe since version 0.3.1.
Unfortunately, without the exact input that causes this, it is very difficult to conclude anything with certainty.

BramVanroy · 2021-06-29T08:51:13Z

Related: on the most recent version of spacy_udpipe, pipe does not work with n_process > 1 on Windows because cannot pickle 'ufal.udpipe.Model' object. Works fine on Linux by default. Evidently, the spin-off method is crucial here. On Windows we only have "spawn", which is more restrict in terms of pickling compared to "fork". Linux has both but defaults to "fork". If you multiprocessing.set_start_method("spawn") on Linux, the code will also fail.

Should I make a separate issue for this? Might be difficult to solve this one, though, and perhaps impossible if you have no control over the UDPipe model directly.

Traceback (most recent call last):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\Scripts\parse-as-conll-script.py", line 33, in <module>
    sys.exit(load_entry_point('spacy-conll', 'console_scripts', 'parse-as-conll')())
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 179, in main
    parse(cargs)
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 35, in parse
    conll_str = parser.parse_file_as_conll(
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 81, in parse_file_as_conll
    return self.parse_text_as_conll(text, **kwargs)
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 135, in parse_text_as_conll
    for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
    for doc in docs:
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'ufal.udpipe.Model' object
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

mariosasko · 2021-07-24T18:10:55Z

Hi @BramVanroy, thanks for reporting. This should be fixed by #39 soon.

BramVanroy · 2021-07-27T11:31:55Z

Awesome! Thanks.

asajatovic self-assigned this May 11, 2020

asajatovic added the bug Something isn't working label May 11, 2020

asajatovic pinned this issue May 11, 2020

asajatovic mentioned this issue May 15, 2020

Hotfix/multiprocessing #24

Merged

asajatovic closed this as completed Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing error #21

Multiprocessing error #21

dimitarsh1 commented May 10, 2020

asajatovic commented May 11, 2020

dimitarsh1 commented May 11, 2020 via email

asajatovic commented May 23, 2020

dimitarsh1 commented May 23, 2020

asajatovic commented May 23, 2020

BramVanroy commented Jun 29, 2021 •

edited

Loading

mariosasko commented Jul 24, 2021

BramVanroy commented Jul 27, 2021

Multiprocessing error #21

Multiprocessing error #21

Comments

dimitarsh1 commented May 10, 2020

asajatovic commented May 11, 2020

dimitarsh1 commented May 11, 2020 via email

asajatovic commented May 23, 2020

dimitarsh1 commented May 23, 2020

asajatovic commented May 23, 2020

BramVanroy commented Jun 29, 2021 • edited Loading

mariosasko commented Jul 24, 2021

BramVanroy commented Jul 27, 2021

BramVanroy commented Jun 29, 2021 •

edited

Loading