Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

Open
mglowacki100 opened this issue Aug 9, 2022 · 7 comments
Open

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

mglowacki100 opened this issue Aug 9, 2022 · 7 comments

Comments

@mglowacki100
Copy link

Problem description

I've tried to use script from:
https://github.com/RaRe-Technologies/gensim/releases/3.6.0
with varying number of cores (num_cores) and obtained following times: 8 -> 26 sec, 16 -> 17 sec, 24 -> 14.4 sec, 32 -> 15.9 sec, 48 -> 16 sec. So it doesn't scale linearly with number of cores and peak seems to be at 24 cores.
My machine reports 48 cores by cpu_count(), by lscpu: CPUs: 48, Threads per core: 2, Cores per socket:12, Socket: 2, Numa nodes: 2, Model name: Intel Xeon E5-2650 v4 2.2 Ghz. Note, the same behaviour occurs for Doc2Vec and FastText.
Is it possible that only one socket is used, or I miss something?

Steps/code/corpus to reproduce

import gensim.downloader as api
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
from linetimer import CodeTimer #pip install linetimer

# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)

# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = 8 # 16, 24, 32, 48 cpu_count()

# Train models using all cores
with CodeTimer(unit="s"):
    w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
#d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
#ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)

Versions

Linux-4.15...generic_x86_64_with_debian_buster_sid
64
numpy: 1.21.4
scipy: 1.7.3
gensim : 4.2.0
FAST_VERSION: the same behavior with 0 and 1

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2022

You have 24 cores, hyperthreading is not as efficient as real physical cores. So peak performance around 24 is probably expected.

@mglowacki100
Copy link
Author

@piskvorky Thanky you for fast reply :) Maybe to avoid confusion, instead of cpu_count from multiprocessing is better to use cpu_count(logical=False) from psutil that detects physical cores?
Additional reason, for doc2vec the same configuration, more cores give worse performance: 24 cores -> 36 seconds, 48 cores -> 45 seconds.

@gojomo
Copy link
Collaborator

gojomo commented Aug 11, 2022

Note that only the train() step makes use of the workers value for attempting multithreading - so a more vivid test of throughput would time only that step. EG, change your test code to:

w2v_model = Word2Vec(workers=num_cores)
w2v_model.build_vocab(corpus_file=corpus_fname)
with CodeTimer(unit="s"):
    w2v_model.train(corpus_file=corpus_fname, epochs=w2v_model.epochs, total_examples=w2v_model.corpus_count, total_words=w2v_model.corpus_total_words)

That said, despite the de#tent of the corpus_file mode to approach near-linear speedup with cores, there will always be deviations from that ideal:

  • to the extent any virtual 'cores' beyond the true physical count rely on chip-level 'hyperthreading', their use may not deliver the same incremental benefit as enabling a truly separate physical core
  • there will always be sources of contention where more active threads might slow others; in particular, on-chip CPU cache might be less usefully 'hot' with needed memory ranges when more threads are all demanding different accesses
  • whatever volume the corpus_file data is coming from might add bottlenecks, in its operation – a physical HD only has a set number of heads; an SSD may have other interface-to-memory bandwidth limits

So, that the optimal workers choice for maximum training throughput reaches a point of diminishing returns, and that's somethwere close-to (but not necessarily equal to) the number of true physical cores, is to be expected.

If you think any examples/docs in the latest Gensim should be updated to give better guidance, please point out the areas where info could be better, & suggested improvements – ideally as a PR for easiest review/merge.

(In using corpus_file mode in Doc2Vec, also keep in mind open issues like #2757.)

@mglowacki100
Copy link
Author

@gojomo Thanks for detailed reply
I'm aware that build_vocab step doesn't scale with number of cores (#400) and this is non-neglible step.
Here are updated timings with your snippet:

Cores (logical) Time (sec)
8 19
16 11
24 8
32 7.9
40 7.2
48 6.8

I've noticed one more thing, that could be even more important (it is on proprietary data, so to be sure I need to to replicate it on synthetic data). When number of token per line in corpus is low (in my case 20) then peak performance occurs even earlier around 14 cores, but when you add more cores it starts to slower training significanlty, so you get "parabolic" performance.

@gojomo
Copy link
Collaborator

gojomo commented Aug 15, 2022

Your observation makes intuitive sense to me: the code around reading/demarking one text might have more chances of cross-thread/cross-core contention than the bulk calculations done once one text is chosen & all-in-cache. So the idea that shorter texts wouldn't achieve the same per-word throughput rates isn't surprising.

That effect is even larger, I think, in the non-corpus_file code paths, where one master thread must do all the creation of texts & fanning-out of text batches to worker threads – with comensurate synchronization-overhead & risk of a worker-thread stalling if batch-assignment falls behind. Also in the non-corpus_file path, the optimal number of worker-threads is often far lower than the number of logical cores, and further varies based on other parameters like negative and window – which change the relative balance of highly-parallelizable vs more-contentious execution spans. (Some options that logically should increase runtime linerarly – like window, directly increasing the volume of calculations – instead do so sub-linearly, because they manage to win back some lost contention time.)

@mglowacki100
Copy link
Author

@gojomo , @piskvorky
As I'm working on PR regarding performance guidance, I've encountered one thing that may require clarification, namely FAST_VERSION handling. I've found some information here:
https://radimrehurek.com/gensim/models/fasttext_inner.html but I've few questions/remarks:

  1. maybe it'd be worth to add that -1 means pure python, so slow
  2. it is not obvious how to "control" FAST_VERSION value. I've made some experiments with python 3.7 on linux
  • pip install gensim gives 0 and np.show_config(): openblas
  • conda install gensim gives 1 and np.show_config(): mkl_rt, on my hardware this seems to be slightly faster
  • dev version compiled from with cython for python 3.8 gives 0
  • I don't know how to obtain mode 2 (maybe by BLAS removal?)
  • is cython also used in 0,1 modes or it is pure BLAS?

@piskvorky
Copy link
Owner

piskvorky commented Aug 23, 2022

FAST_VERSION is essentially to be interpreted as FAST_VERSION != -1. The individual values don't have much meaning IIRC, they were discovered by trial-and-error when I was trying to figure out how to "plug" into raw BLAS from Python.

Maybe its user-facing interface should have been a True/False bool FAST_VERSION != -1 from the start, but no point changing it now.

Gensim's *2vec models use cython, yes. Historically there was also a pure-Python mode using numpy only, but that has been removed (too slow). So FAST_VERSION == -1 shouldn't happen any more, unless I'm misremembering.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants