-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Using corpus_file does not speed up while the CPU utilization seems full. #3089
Comments
Hmm, do you know what changed between the new and old servers, specifically? Do the two systems link a different BLAS library? ( At runtime, is the setting for BLAS threading different? (e.g. Also, can you share |
@piskvorky Thanks for your quick response!!
From the new system:
From the old system:
I found the two similar...
I hope this further clarifies my issue. |
This is an environment variable. You set it before you launch your process (python interpreter, script). You can check its value from print(os.environ.get("MKL_NUM_THREADS", "not set") Or else the two CPUs (their caches, architectures) are sufficiently different that the slowdown is "real", IDK. But I'd first try to rule out the BLAS threading difference. What numbers do you get for the following, on either machine? $ ipython
import numpy
x = numpy.random.rand(1000, 1000)
timeit numpy.dot(x, x.T) |
1.The output from
|
No, it means the BLAS lib was using default threading. I don't know what that is for MKL – probably "no thread limit, let MKL decide".
How did you ensure that? |
As for the thread/core number, I'm using the slurm scheduler (or sbatch) so I set the number of nodes/tasks/CPUS explicitly. I set it as |
Instead, can you try running the |
Okay... I wrote a quick script as below and ran it by submitting it within the two systems, using
The result from the old system was
The result from the new system was
|
And this ( |
Okay, I find it as
which might have been an issue. So, I tried the same script with
But the result was the same as I'm pasting here. (Again, the CPU utilization was 2400%)
As for a reference, I'm pasting the log from the previous system
Is the |
Hello, I'm writing this to report because I faced with the same issue with another machine independent from the two systems I've been using. The input data and the code is the same as before. 1. The environment is as below
2. The result from numpy.show_config()
3. Training performance3.1 workers =1
3.2 workers =5
3.3 workers =20
The CPU utilization appears as it is supposed. Plus, I tested with |
Thank you for the detailed reports. I'm not sure what the problem is – this will have to wait until someone has the time to reproduce & dig deeper. |
@piskvorky Thanks for all the responses. I will post further here if I find something worthwhile sharing. |
@piskvorky I'm sharing this to ask if this might be related with the current issue. Thanks. |
Well that's exactly what we were checking above – whether your BLAS is parallelized. I'm not familiar with That is, the doc2vec training is slower even when restricted to a single core. So it cannot be an issue of multithreading / multiprocessing. |
@piskvorky I see. (Just to remind, the third experiment was conducted under a server machine without using any |
The third experiment is strange in that both Is Are these systems true machines, or VMs under some virtualization? Normally an HDD to SSD upgrade would be a big boost if IO was any part of the bottleneck. I'm unsure what "28 x Intel E5-2680v4" means in terms of true cores. 28 CPUs with 14 cores each, for 392 (!) cores? If so, that sounds like more actual cores than "2x Intel Xeon Gold 6248R", 2 CPUs of 24 cores each, for 48 total cores. Might the answer be as simple as that? (What does Despite the unset |
@gojomo Thanks for your detailed response. I'm sharing with you what I rechecked. 1. Rechecking the running time
1.1 1 workers
1.2- 5 workers
1.3 20 workers
I think the running times are not different from what I shared last time. 2. Rechecking the running time with window = 10 I think I have the same issue with a smaller window size. 2.1 1 workers
2.2 5 workers
2.3 20 workers
3. Checking the running times for vector_size=100, min_count=10, window=10, seed=111, dm=0, dbow_words=1 For example, the scripts are like
I did this with the third machine. 3.1 1 worker
3.2 4 workers
3.2 8 workers
It seems like the similar issue occurred. 4. Others
I'm attaching results from From the old system (
From the new system (
The third system (
Thank you very much. Please let me know if you have other suggestions that I can check out. (I will test the systems with other input files just to be sure.) |
Very belatedly, this behavior is still mysterious, especially where even a modest number of workers (5) with a typical window (1) goes slower than the 1-worker case. Two low-confidence theories: (1) some odd stall at some point; if logging had included timestamps per line, any single step taking a weird amount of time might have stuck out (2) as previously mentioned, something weird with the BLAS library's own attempts to parallelize via *_NUM_THREADS-like behavior, where the workers=1 case is achieving effective parallelism but workers=5 somehow excessively contending/stalling Separately, I note that the |
Problem description
I'm struggling with the issue of speeding up a doc2vec training using
corpus_file
after my institution introduced a new computing server system. With the previous system, I had/have no problem, but I found that the same script takes drastically different times, and I'm not enjoying the quasi-linear speed-up with the number of threads/cores anymore. I have not been able to find a solution for this issue, so I decided to post this here.Steps/code/corpus to reproduce
The script is simple as below. The
test.txt
file is in LineSentence format as this page suggests. (The widewindows
=240 is chosen to check out the CPU usage.)Running this with the previous server system, the 1-epoch time was
While with the new system, the 1- epoch time was
I checked out the CPU utilization from the new system (with the current issue) and it seemed that it was using 2400%.
I tried to use 48 cores, it again used the full 48 cores as below.
But, the training time was identical from the 24 core training.
While I'm not putting it here, this happens not only from Gensim 4.0 but also from 3.8.3. I understand a wrong hardware configuration might cause this, so I will close this issue if I can confirm that. But I wanted to know if any developers/other users encountered a case similar to what I have now...
Versions
From the new system (with the training speed issue)
From the previous system (where the
corpus_file
method worked.)The text was updated successfully, but these errors were encountered: