Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Slice has duration of "Did not end." #311

Open
dwchang79 opened this issue Oct 2, 2023 · 15 comments
Open

Slice has duration of "Did not end." #311

dwchang79 opened this issue Oct 2, 2023 · 15 comments

Comments

@dwchang79
Copy link

Some of my omnitrace proto files "did not end" according to Perfetto. For my work, I am trying to find the end time for certain kernels and when they do not end, it leads to a -1 run time. I heard this is a known issue, but I was told to formally submit this bug (and my other 2) so that they can be properly tracked.

I have attached a screenshot of the behavior. In it you should see the "samples [omnitrace]" slice continue and become white at the end/right and also near the bottom left you can see that the "Duration" says "Did not end."

Thank you.
DidNotEnd

@jrmadsen
Copy link
Collaborator

Likely fixed by buffer flushing fix in #317

@jrmadsen
Copy link
Collaborator

Please re-open if #317 does not fix this issue in the upcoming release

@dwchang79
Copy link
Author

Unfortunately, it seems this bug is still in Omnitrace. When I try to load LLaMa2's perfetto file, all of the kernels do not end.

@dwchang79
Copy link
Author

Also I can't seem to reopen this issue. Usually the re-open button is on the bottom and I do not see it.

@jrmadsen jrmadsen reopened this Jan 23, 2024
@jrmadsen
Copy link
Collaborator

How big is the perfetto file? Could you be hitting the data limit? Bc it’s strange the samples stop showing up. Samples are not inserted into perfetto until finalization but GPU kernels are so it would be strange (but maybe not impossible) for samples to cause the data limit to be hit and cause perfetto to drop the rest of the records.

@jrmadsen
Copy link
Collaborator

Wait, do the GPU kernels not end or do the samples not end? Bc if it’s just the samples, then that really seems like a data limit issue

@jrmadsen
Copy link
Collaborator

jrmadsen commented Jan 23, 2024

If the size of the perfetto buffer is the issue, you can either increase it (I think the default is maybe 2 GB) or you can disable Perfetto annotations (which will reduce the amount of data sent to Perfetto, sometimes very significantly)

@dwchang79
Copy link
Author

To answer the questions:

  1. The Perfetto file is ~900 MB so it does not open in the UI and I have to open it in the Desktop version.

  2. The GPU kernels also do not end. The top samples (or some times main function) has that white at the end, but when I go down into the actual kernels being launched, the reason it's that shade of white is because the last kernel(s) does not end. It has a start time but no end time.

  3. Is there a way to increase the size (as you suggested) using the Perfetto web UI?

Thank you.

@jrmadsen
Copy link
Collaborator

I just checked and it looks like the default buffer limit is ~1 GB so it sounds you may be hitting it.

No, the buffer size has nothing to do with the web UI. There is nothing you can do about any existing perfetto files. You need to recollect data with OMNITRACE_PERFETTO_BUFFER_SIZE_KB set to a larger value and/or set OMNITRACE_PERFETTO_ANNOTATIONS to OFF

@dwchang79
Copy link
Author

dwchang79 commented Jan 24, 2024

We increased the buffer size to ~4 GB and unfortunately, the problem still persists. I've attached a screenshot showing the problem.
perfetto_fp16_4GBbuff

@ppanchad-amd
Copy link

@dwchang79 Internal ticket has been created to further investigate your issue. Thanks!

@schung-amd
Copy link

Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?

@dwchang79
Copy link
Author

Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?

I am no longer at AMD (was on Sabbatical there as a Visiting Scholar), but I believe it is still an issue.

@schung-amd
Copy link

Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?

@dwchang79
Copy link
Author

Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?

When I first reported it, I was running CoralGEMM (don't remember if DGEMM or SGEMM), but later on it was LLaMa-2. And yes, I would see it consistently every run.

Thank you.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants