Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Granite 3.(0,1) models are Llama-architecture models with some different scaling terms in various places. This commit adds granite model patching for decoder-only granite 3 models (not multimodal) and the corresponding tests.
Summary
This change enables patching Granite 3.(0,1) models w/ Liger kernels. We would like to use Liger kernels in our training implementation but we're a Granite-first codebase for the moment.
Testing Done
Convergence tests confirm that loss and model parameters are equivalent w/ and w/o Liger kernels. Logits, however, are not equivalent even when only swapping the SwiGLUMLP layer. The ator and rtol may need to be tuned for Granite vs. Llama, I'm going to continue investigating this before this PR is merged.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence