Prefetching in copying GC

NOTE: I am not proposing to fix it immediately.  *This is a research topic, and needs to be validated by experiments.*

TL;DR: After reading https://users.cecs.anu.edu.au/~steveb/pubs/papers/pf-ismm-2007.pdf again, I feel that the current `ProcessEdgesWork` does not have the same opportunity for prefetching as described in that paper, unless we do some minor adjustment.

In the paper, the edge-enqueuing tracing algorithm still enqueues `ObjectReferene`, not the *address* of the slot (`Address`).  Because the `ObjectReference` is in the queue, we know the address of the objects, so we can prefetch the object header and the object fields (i.e. the header, the oopmap and the fields are accessed "contemporaneously").

However, in our current mmtk-core, each `ProcessEdgesWork` work packet contains a list of slot addresses (`Address`).  The `ObjectReference` of the object is still beyond one level of `load` (`unsafe { slot.load::<ObjectReference>() }`).  So in the following snippet we can see several places where cache misses can occur:

```rust
    #[inline]
    fn process_edge(&mut self, slot: Address) {
        let object = unsafe { slot.load::<ObjectReference>() };  // CACHE MISS 1

        let new_object = { // self.trace_object(object);  manually inlined
            if copyspace1.contains(object) {  // copyspace1.trace_object(object), manually inlined
                //...
                let forwarding_status = object_forwarding::attempt_to_forward::<VM>(object);  // CACHE MISS 2
                //...
                new_object
            } else if mallocspace.contains(object) { // mallocspace.trace_object(object), manually inlined
                //...
                if !is_marked::<VM>(object, None) { // CACHE MISS 3
                    // ...
                }
                // ...
                object
            }
        };

        if Self::OVERWRITE_REFERENCE {
            unsafe { slot.store(new_object) };
        }
    }
```

If we perform prefetching on the elements of `ProcessEdgesBase::edges`, we can eliminate CACHE MISS 1.  But CACHE MISS 2 and CACHE MISS 3 still remains because we prefetched the slot, not the object itself or the side metadata.  But we can't prefetch the object unless we know its address in the first place.  In other words, the information held in `ProcessEdgesBase::edges` alone is insufficient for prefetching the objects.

# solution

One solution is to load from the edges immediately when we scan an object.  When scanning an object, the slots in the object is local to that object so load operations should be fast.

```rust
pub struct ProcessEdgesBase<VM: VMBinding> {
    pub edges: Vec<Address>,
    pub edges_preloaded: Vec<ObjectReference>, // Add this field
    pub nodes: Vec<ObjectReference>,
    // ...
}

impl<'a, E: ProcessEdgesWork> EdgeVisitor for ObjectsClosure<'a, E> {
    #[inline(always)]
    fn visit_edge(&mut self, slot: Address) {
        if self.buffer.is_empty() {
            self.buffer.reserve(E::CAPACITY);
        }
        self.buffer.push(slot);
        self.buffer_preloaded.push(slot.load::<ObjectReference>());  // Pre-load here.  This should hit the cache.
        if self.buffer.len() >= E::CAPACITY {
            let mut new_edges = Vec::new();
            mem::swap(&mut new_edges, &mut self.buffer);
            let mut new_edges_preloaded = Vec::new();  // Added
            mem::swap(&mut new_edges_preloaded, &mut self.buffer2);  // Added
            self.worker.add_work(
                WorkBucketStage::Closure,
                E::new(new_edges, new_edges_preloaded, false, self.worker.mmtk),  // Added new_edges_preloaded
            );
        }
    }
}
```

If `ProcessEdgesBase` has the last known values held in the edges, we can preload from *both* the slot and the last-known object reference so that both the load from the slot and the load form the object can be fast, and, in the case where objects never move, we don't need to load from the slot any more.

```
    #[inline]
    fn process_edges(&mut self) {
        for i in 0..PREFETCH_DISTANCE {
            std::intrinsics::prefetch_read_data(self.edges[i]);  // Prefetch slot
            std::intrinsics::prefetch_read_data(self.edges_preloaded[i]);  // Prefetch object
        }
        for i in 0..self.edges.len() {
            self.process_edge(self.edges[i]);
            std::intrinsics::prefetch_read_data(self.edges[i + PREFETCH_DISTANCE]);
            std::intrinsics::prefetch_read_data(self.edges_preloaded[i + PREFETCH_DISTANCE]);
        }
    }
```

Of course the profit should be measured.

# Alternative `process_edge` for MarkSweep

Because mark-sweep never moves objects, we can eliminate the load in `process_edge` if we pre-load the edge when scanning objects. This can be a special case for https://github.com/mmtk/mmtk-core/issues/574

```rust
    #[inline]
    fn process_edge(&mut self, slot: Address, preloaded: ObjectReference) { // Added the `preloaded` parameter
        let object = if NEVER_MOVES_OBJECT {
            preloaded
        } else {
            unsafe { slot.load::<ObjectReference>() };
        }
        let new_object = self.trace_object(object);
        if !NEVER_MOVES_OBJECT && Self::OVERWRITE_REFERENCE {
            unsafe { slot.store(new_object) };
        }
    }
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prefetching in copying GC #584

solution

Alternative `process_edge` for MarkSweep

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prefetching in copying GC #584

Description

solution

Alternative process_edge for MarkSweep

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Alternative `process_edge` for MarkSweep