Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Multiprocessing error #21

Closed
dimitarsh1 opened this issue May 10, 2020 · 8 comments
Closed

Multiprocessing error #21

dimitarsh1 opened this issue May 10, 2020 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@dimitarsh1
Copy link

When running the spacy_udpipe with the n_process = X enabled, it gives an error.
The code I run is:

nlpD = spacy_udpipe.load(lang)
nlps = list(nlpD.pipe(sentences, n_process=4))
for doc in nlps:
        for token in doc:
            lemma=token.lemma_

The error is:

File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
 File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '14027581762467160941'. This usually refers to an issue with the `Vocab` or `StringStore`."

When I run the same code, but without the n_process argument, then everything is fine. No errors, the text is processed and so on.

It seems to be related to a spaCy issue, but I couldn't find a solution.
https://stackoverflow.com/questions/60152152/spacy-issue-with-vocab-or-stringstore

spaCy version: 2.2.4
python version: 3.8.3
spacy-udpipe version: 0.3.0
OS: debian 10

Thanks.
Cheers,
Dimitar

@asajatovic asajatovic self-assigned this May 11, 2020
@asajatovic asajatovic added the bug Something isn't working label May 11, 2020
@asajatovic asajatovic pinned this issue May 11, 2020
@asajatovic
Copy link
Collaborator

Seems to be related to explosion/spaCy#5220.

A quick workaround is to change only the first line:
nlpD = spacy_udpipe.load(lang).tokenizer.

This should do the trick as it will call UDPipeTokenizer.pipe(creating a Doc with the same attributes as Language.pipe, just bug-free). If you want to use custom pipes afterward, you could call them on the resulting Doc objects (once created, these are modified in-place anyway), for now.

I will look into a proper fix soon, hopefully.

@dimitarsh1
Copy link
Author

dimitarsh1 commented May 11, 2020 via email

@asajatovic
Copy link
Collaborator

The issue happens with StringStore.add.
When Language.__call__ is called, a Doc is created using UDPipeTokenizer.__call__ in which token attributes are added to StringStore object (see these lines). The same code is run when Language.pipe is called, but via a custom spaCy multiprocessing code. Now, this is where the havoc begins as Python multiprocessing objects and code interact with SWIG wrapper code for UDPipe model (i.e. UDPipe Python bindings which actually enable the underlying NLP model in C++) and somehow in the end the StringsStore object does not contain all the string values it should (missing lemmas, tags, dependencies, etc.).

Interestingly, when the same UDPipeLanguage object processes same texts via __call__ first and then via pipe, everything works fine as the StringStore object is already prepopulated with all string values.

Unfortunately, neither multithreading nor UDPipeTokenizer multiprocessing speeds up execution.

@dimitarsh1
Copy link
Author

Thanks.
Could that relate to the other issue "[E190] Token head out of range"?

@asajatovic
Copy link
Collaborator

@dimitarsh1 you are welcome.
[E190] should not be happening when using __call__ since version 0.2.1 or with pipe since version 0.3.1.
Unfortunately, without the exact input that causes this, it is very difficult to conclude anything with certainty.

@BramVanroy
Copy link
Contributor

BramVanroy commented Jun 29, 2021

Related: on the most recent version of spacy_udpipe, pipe does not work with n_process > 1 on Windows because cannot pickle 'ufal.udpipe.Model' object. Works fine on Linux by default. Evidently, the spin-off method is crucial here. On Windows we only have "spawn", which is more restrict in terms of pickling compared to "fork". Linux has both but defaults to "fork". If you multiprocessing.set_start_method("spawn") on Linux, the code will also fail.

Should I make a separate issue for this? Might be difficult to solve this one, though, and perhaps impossible if you have no control over the UDPipe model directly.

Traceback (most recent call last):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\Scripts\parse-as-conll-script.py", line 33, in <module>
    sys.exit(load_entry_point('spacy-conll', 'console_scripts', 'parse-as-conll')())
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 179, in main
    parse(cargs)
  File "c:\dev\python\spacy_conll\spacy_conll\cli\parse.py", line 35, in parse
    conll_str = parser.parse_file_as_conll(
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 81, in parse_file_as_conll
    return self.parse_text_as_conll(text, **kwargs)
  File "c:\dev\python\spacy_conll\spacy_conll\parser.py", line 135, in parse_text_as_conll
    for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1484, in pipe
    for doc in docs:
  File "C:\Users\bramv\.virtualenvs\spacy_conll-AkYdeqDT\lib\site-packages\spacy\language.py", line 1520, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'ufal.udpipe.Model' object
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\bramv\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

@mariosasko
Copy link
Contributor

Hi @BramVanroy, thanks for reporting. This should be fixed by #39 soon.

@BramVanroy
Copy link
Contributor

Awesome! Thanks.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants