Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Optimizing Speed of Hull Algorithm #903

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Kushal-Shah-03
Copy link
Contributor

I've added tracy headers in some functions, to observe the number of function calls, and time taken, to identify possible places to parallelize. I was wondering if there is a way to just run a specific test in our tests or would I have to comment out the others?

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.46%. Comparing base (d437097) to head (1f02448).
Report is 87 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #903      +/-   ##
==========================================
- Coverage   91.84%   88.46%   -3.38%     
==========================================
  Files          37       62      +25     
  Lines        4976     8685    +3709     
  Branches        0     1056    +1056     
==========================================
+ Hits         4570     7683    +3113     
- Misses        406     1002     +596     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pca006132
Copy link
Collaborator

e.g. ./test/manifold_test --gtest_filter=Hull.Tictac.

@Kushal-Shah-03
Copy link
Contributor Author

image [without TBB]
This was the trace on Hull.Tictac, I think we can look into parallelizing the addPointToFace calls, that should improve the time.

@Kushal-Shah-03
Copy link
Contributor Author

So, for the addPointToFace, I checked it out for three cases (Tictac, Sphere and MengerSponge tests)
The maximum value for horizonEdgeCount was 17, for disabledFacePointVectors it was 11, and for disabledPoints it was 281249.
So, we could focus on parallelizing the disabledPoints part. Now, I looked back at the implementation, the way the loop works, is basically for each point we will modify some face (first face that happens to have the point on it's positive side) once. So, I would have to do number of points times atomic operations (on the faces). Now, I'm not very sure about this but apparently in tbb vectors you can concurrently push_back (now whether this adds more overhead or not I am not sure). But, if it doesn't the only time we would need to modify the values using atomic set would be when the point is the most distant point, which could lower number of atomic operations on average. So, we can then think of switching the inner and outer loop and then parallelize the points loop.

This is just food for thought and it's possible what I have explained isn't very clear, so let's discuss this in the next meeting.

@Kushal-Shah-03
Copy link
Contributor Author

image
I have removed the tracy header from all functions that were O(1). I have pushed the changes so you get an idea of which functions were removed. This is the tracy output after that.

Also @elalish I know we discussed that we wanted to clean up the code a little, but I just wanted to verify whether I should do it in this PR itself, and then we can rename it too refactoring and optimizing that won't be an issue right?

@elalish
Copy link
Owner

elalish commented Aug 22, 2024

@pca006132 knows more about atomics than I do, but my impression is that you mostly pay the atomic price when they collide (you're actually trying to modify the same address at the same time). I think if the number of faces is >= the number of disabled points, then atomics might not be too bad.

It might be easier to open a second PR for readability improvements. What do you think?

@Kushal-Shah-03
Copy link
Contributor Author

Yeah I was thinking the same, but the number of faces is much less, so I was not sure about what we could do to optimize it in such a situation, so I just thought I'll share my thoughts and we can discuss about it.

Yeah, I was thinking about opening a new PR as well, then we can sync this later once we are satisfied with it. So, I'll go ahead and start making changes in a new branch and send a PR soon.

@pca006132
Copy link
Collaborator

I think we can try to parallelize over the disabled points. I don't think we want to use the tbb vector, it is not as light weight as a typical vector. Also, I think considering we have more than several dozens of faces, it may be feasible to use a lock per face. If there is no contention, it can make things simple while giving nice performance improvement.

@Kushal-Shah-03
Copy link
Contributor Author

#904 (comment) @pca006132 Should I start breaking the Face struct according to this into three structs or is there a better way to approach it?

@pca006132
Copy link
Collaborator

Parallelize things will likely be more beneficial.

@Kushal-Shah-03
Copy link
Contributor Author

Parallelize things will likely be more beneficial.

I've already started work on it, I had some doubts I was hoping to ask you in today's meeting and then continue based on that.

@Kushal-Shah-03
Copy link
Contributor Author

@pca006132 I was wondering if we should use Par policy for iterating over the faces despite the number of faces being less, since the way I have tried to implement it, it could speed up the time. Also, I was having random errors with it, when I used Par and I was wondering if I used AtomicAdd correctly. I have added in comments, the change needed to make Par work, but it was causing errors, could you go through it once?

@elalish It seems to me like the bug of some verts being excluded is causing the change I made here to fail as well. I'll try to investigate the error again. But for the meanwhile is it alright if I add EXPECT_NEAR for it as well?

@pca006132
Copy link
Collaborator

I was wondering if we should use Par policy for iterating over the faces despite the number of faces being less, since the way I have tried to implement it, it could speed up the time.

Just evaluate the two and see how much better it gets. We just care about the actual performance (and correctness), the opinions I gave earlier are heuristics, not rules.

@Kushal-Shah-03
Copy link
Contributor Author

Yeah that's why I tried to implement it, but I was having errors, despite the code seeming correct to me. Could you go through that once. I've added in comments the changes needed for Par.

// return true;

// For ExecutionPolicy::Seq
pointMutex = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this pointMutex thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a variable I am using to check if that particular point has been assigned to a face yet, I initially planned to have a lock for it, forgot to change the name. I'll change the name.

pointMutex = 1;

// Ensures atomic addition of point to face
f.faceMutex->lock();
if (!f.pointsOnPositiveSide) {
f.pointsOnPositiveSide = getIndexVectorFromPool();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call is not thread-safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock call isn't thread-safe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the getIndexVectorFromPool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, since the Pool is same for all faces, I should use a lock specifically for pool. Thanks for pointing that out!, that helps a lot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do note that for every lock you are adding now, you will likely be thinking about how to remove them later...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, for now, the important thing is to make sure it works. We can gradually improve on performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will keep that in mind. Also, the function call happens once per face. So, should not add much overhead.

@@ -126,6 +126,7 @@ class MeshBuilder {
size_t visibilityCheckedOnIteration = 0;
std::uint8_t isVisibleFaceOnCurrentIteration : 1;
std::uint8_t inFaceStack : 1;
std::recursive_mutex* faceMutex = new std::recursive_mutex();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this probably should not be a recursive mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw recursive_mutex being used at a lot of places so just figured I could use it, I'll use std::mutex instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursive mutex is needed only when you may lock the same thing several times in the same thread. Yeah, you can probably just use std::mutex.

@pca006132
Copy link
Collaborator

Btw, if you use autoPolicy with disabledPoints->size(), it is almost guaranteed you will not get ExecutionPolicy::Par with that. iirc the number of disabled points is typically pretty small, much less than 1e4 or something.

@Kushal-Shah-03
Copy link
Contributor Author

Btw, if you use autoPolicy with disabledPoints->size(), it is almost guaranteed you will not get ExecutionPolicy::Par with that. iirc the number of disabled points is typically pretty small, much less than 1e4 or something.

Oh, thanks for pointing that out!

@Kushal-Shah-03
Copy link
Contributor Author

@pca006132 , I think I've got it working and now I want to try and improve the performance, I will try out both with and without parallelizing over the faces to see which gives better performance. Apart from that I had a couple of questions

  1. I wanted to know what function can I use for atomic set?
  2. How would I go about removing the lock for each face, because it appears to me that if I want to parallelize over the points I would need the lock?

Also to get an idea of how a change affects the average performance I was thinking I could run it on the Thingi10k dataset. While to test for larger cases where it should show improvement, I'll try it on the high quality sphere and MengerSponge. Does that sound good to you?

@elalish
Copy link
Owner

elalish commented Sep 10, 2024

Usually you either need a lock or an atomic, but ideally not both. I think of an atomic as a short hardware lock, and prefer them where possible.

@pca006132
Copy link
Collaborator

  1. std::atomic<int>.
  2. There are many possible ways. For some inspirations, you can look at our collider implementation, which uses some index computation to avoid the use of many locks.
  3. Sure, that sounds good.

@pca006132
Copy link
Collaborator

It seems that the tictac hull number of vertices output is somewhat related to evaluation order?

 [ RUN      ] Hull.Tictac
/home/runner/work/manifold/manifold/test/hull_test.cpp:80: Failure
The difference between sphere.NumVert() + tictacSeg and tictac.NumVert() is 2, which exceeds 1, where
sphere.NumVert() + tictacSeg evaluates to 63002,
tictac.NumVert() evaluates to 63004, and
1 evaluates to 1.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants