word2vec doesn't scale linearly with multi-cpu configuration ? #3376

mglowacki100 · 2022-08-09T06:20:56Z

Problem description

I've tried to use script from:
https://github.com/RaRe-Technologies/gensim/releases/3.6.0
with varying number of cores (num_cores) and obtained following times: 8 -> 26 sec, 16 -> 17 sec, 24 -> 14.4 sec, 32 -> 15.9 sec, 48 -> 16 sec. So it doesn't scale linearly with number of cores and peak seems to be at 24 cores.
My machine reports 48 cores by cpu_count(), by lscpu: CPUs: 48, Threads per core: 2, Cores per socket:12, Socket: 2, Numa nodes: 2, Model name: Intel Xeon E5-2650 v4 2.2 Ghz. Note, the same behaviour occurs for Doc2Vec and FastText.
Is it possible that only one socket is used, or I miss something?

Steps/code/corpus to reproduce

import gensim.downloader as api
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
from linetimer import CodeTimer #pip install linetimer

# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)

# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = 8 # 16, 24, 32, 48 cpu_count()

# Train models using all cores
with CodeTimer(unit="s"):
    w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
#d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
#ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)

Versions

Linux-4.15...generic_x86_64_with_debian_buster_sid
64
numpy: 1.21.4
scipy: 1.7.3
gensim : 4.2.0
FAST_VERSION: the same behavior with 0 and 1

The text was updated successfully, but these errors were encountered:

piskvorky · 2022-08-09T07:47:40Z

You have 24 cores, hyperthreading is not as efficient as real physical cores. So peak performance around 24 is probably expected.

mglowacki100 · 2022-08-09T09:55:32Z

@piskvorky Thanky you for fast reply :) Maybe to avoid confusion, instead of cpu_count from multiprocessing is better to use cpu_count(logical=False) from psutil that detects physical cores?
Additional reason, for doc2vec the same configuration, more cores give worse performance: 24 cores -> 36 seconds, 48 cores -> 45 seconds.

gojomo · 2022-08-11T23:47:59Z

Note that only the train() step makes use of the workers value for attempting multithreading - so a more vivid test of throughput would time only that step. EG, change your test code to:

w2v_model = Word2Vec(workers=num_cores)
w2v_model.build_vocab(corpus_file=corpus_fname)
with CodeTimer(unit="s"):
    w2v_model.train(corpus_file=corpus_fname, epochs=w2v_model.epochs, total_examples=w2v_model.corpus_count, total_words=w2v_model.corpus_total_words)

That said, despite the de#tent of the corpus_file mode to approach near-linear speedup with cores, there will always be deviations from that ideal:

to the extent any virtual 'cores' beyond the true physical count rely on chip-level 'hyperthreading', their use may not deliver the same incremental benefit as enabling a truly separate physical core
there will always be sources of contention where more active threads might slow others; in particular, on-chip CPU cache might be less usefully 'hot' with needed memory ranges when more threads are all demanding different accesses
whatever volume the corpus_file data is coming from might add bottlenecks, in its operation – a physical HD only has a set number of heads; an SSD may have other interface-to-memory bandwidth limits

So, that the optimal workers choice for maximum training throughput reaches a point of diminishing returns, and that's somethwere close-to (but not necessarily equal to) the number of true physical cores, is to be expected.

If you think any examples/docs in the latest Gensim should be updated to give better guidance, please point out the areas where info could be better, & suggested improvements – ideally as a PR for easiest review/merge.

(In using corpus_file mode in Doc2Vec, also keep in mind open issues like #2757.)

mglowacki100 · 2022-08-14T09:36:02Z

@gojomo Thanks for detailed reply
I'm aware that build_vocab step doesn't scale with number of cores (#400) and this is non-neglible step.
Here are updated timings with your snippet:

Cores (logical)	Time (sec)
8	19
16	11
24	8
32	7.9
40	7.2
48	6.8

I've noticed one more thing, that could be even more important (it is on proprietary data, so to be sure I need to to replicate it on synthetic data). When number of token per line in corpus is low (in my case 20) then peak performance occurs even earlier around 14 cores, but when you add more cores it starts to slower training significanlty, so you get "parabolic" performance.

gojomo · 2022-08-15T17:20:05Z

Your observation makes intuitive sense to me: the code around reading/demarking one text might have more chances of cross-thread/cross-core contention than the bulk calculations done once one text is chosen & all-in-cache. So the idea that shorter texts wouldn't achieve the same per-word throughput rates isn't surprising.

That effect is even larger, I think, in the non-corpus_file code paths, where one master thread must do all the creation of texts & fanning-out of text batches to worker threads – with comensurate synchronization-overhead & risk of a worker-thread stalling if batch-assignment falls behind. Also in the non-corpus_file path, the optimal number of worker-threads is often far lower than the number of logical cores, and further varies based on other parameters like negative and window – which change the relative balance of highly-parallelizable vs more-contentious execution spans. (Some options that logically should increase runtime linerarly – like window, directly increasing the volume of calculations – instead do so sub-linearly, because they manage to win back some lost contention time.)

mglowacki100 · 2022-08-22T13:40:43Z

@gojomo , @piskvorky
As I'm working on PR regarding performance guidance, I've encountered one thing that may require clarification, namely FAST_VERSION handling. I've found some information here:
https://radimrehurek.com/gensim/models/fasttext_inner.html but I've few questions/remarks:

maybe it'd be worth to add that -1 means pure python, so slow
it is not obvious how to "control" FAST_VERSION value. I've made some experiments with python 3.7 on linux

pip install gensim gives 0 and np.show_config(): openblas
conda install gensim gives 1 and np.show_config(): mkl_rt, on my hardware this seems to be slightly faster
dev version compiled from with cython for python 3.8 gives 0
I don't know how to obtain mode 2 (maybe by BLAS removal?)
is cython also used in 0,1 modes or it is pure BLAS?

piskvorky · 2022-08-23T05:14:40Z

FAST_VERSION is essentially to be interpreted as FAST_VERSION != -1. The individual values don't have much meaning IIRC, they were discovered by trial-and-error when I was trying to figure out how to "plug" into raw BLAS from Python.

Maybe its user-facing interface should have been a True/False bool FAST_VERSION != -1 from the start, but no point changing it now.

Gensim's *2vec models use cython, yes. Historically there was also a pure-Python mode using numpy only, but that has been removed (too slow). So FAST_VERSION == -1 shouldn't happen any more, unless I'm misremembering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

mglowacki100 commented Aug 9, 2022

piskvorky commented Aug 9, 2022 •

edited

Loading

mglowacki100 commented Aug 9, 2022

gojomo commented Aug 11, 2022 •

edited

Loading

mglowacki100 commented Aug 14, 2022

gojomo commented Aug 15, 2022

mglowacki100 commented Aug 22, 2022

piskvorky commented Aug 23, 2022 •

edited

Loading

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

word2vec doesn't scale linearly with multi-cpu configuration ? #3376

Comments

mglowacki100 commented Aug 9, 2022

Problem description

Steps/code/corpus to reproduce

Versions

piskvorky commented Aug 9, 2022 • edited Loading

mglowacki100 commented Aug 9, 2022

gojomo commented Aug 11, 2022 • edited Loading

mglowacki100 commented Aug 14, 2022

gojomo commented Aug 15, 2022

mglowacki100 commented Aug 22, 2022

piskvorky commented Aug 23, 2022 • edited Loading

piskvorky commented Aug 9, 2022 •

edited

Loading

gojomo commented Aug 11, 2022 •

edited

Loading

piskvorky commented Aug 23, 2022 •

edited

Loading