-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Word alignment try 2 #267
Word alignment try 2 #267
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #267 +/- ##
==========================================
- Coverage 70.28% 70.22% -0.06%
==========================================
Files 385 385
Lines 32019 32056 +37
Branches 4504 4511 +7
==========================================
+ Hits 22503 22512 +9
- Misses 8471 8499 +28
Partials 1045 1045 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
Reviewable status: 0 of 3 files reviewed, all discussions resolved
This is needed for adding the word alignment engine to Serval. Just exposing the alignment endpoints to the interactive engine. |
This needs to be merged and released before the Serval changes will be able to compile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I understand what this is for. There are already interfaces for word alignment models. Also, phrase alignment isn't word alignment. That is specific to the Thot SMT engine.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
The ThotSmtModel appears to be the best place to add the alignment routines onto - as the "phrase alignment" just means that the tokenizer can be configured. If I don't use ThotSmtModel, what specific things would I use? IWordAligner assumes that the source and target are already tokenized. Also, how would it interact with loading models built by machine.py? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For word alignment, you should use one of the classes that inherits from ThotWordAlignmentModel
. For SMT and word alignment models, you will need to tokenize the text. We should just use the LatinWordTokenizer
like we do for the SMT engine.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
Hmm. It wold be quite a bit of reworking. I would have to use a different wording than Otherwise, I think I would have to create base class of ThotSmtModel called ThotSymmetrizedWordAlignmentModelWithTokenizer? in which 1/2 of the functionality of ThotSmtModel is implemented. And even then, all the configurations and trainers and everything else would need to be torn apart and rewritten. I think this minimal change is the best solution - it looks like a word aligner on Serval but is just an SMT model underneath. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ThotSmtModel
is a full phrased-based SMT system and takes a lot more computation and time to train. The phrase alignment from the SMT model uses a different algorithm than the word alignment models and is much more expensive. Unfortunately, it is not a replacement for the word alignment models. We should meet to discuss how best to proceed. I'm sure if I had a better understanding of what you are trying to achieve, we can come up with a good solution.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
3c8ddd6
to
1f29ecd
Compare
1f29ecd
to
f905c2d
Compare
I'll need to update the branch and otherwise might have some small TODOs floating around, but I think this PR is generally ready for review, @johnml1135 @ddaspit 🥳 |
Why were these helpers moved from the generic "ToolHelpers" to the specific ones for symmetrization and thot? |
I am thinking (though not remembering fully) - are any tests needed? It doesn't really seem to be changing anything core, just rearranging where the content is and exposing it in a way more friendly to Serval. @ddaspit - do you know if anyone will be affected by the changes we are making to these libraries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 20 of 20 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135)
src/SIL.Machine/Translation/WordAlignmentResult.cs
line 7 at r2 (raw file):
namespace SIL.Machine.Translation { public class WordAlignmentResult
What is this class used for?
src/SIL.Machine/Translation/SymmetrizationHeuristic.cs
line 46 at r2 (raw file):
} public static class SymmetrizationHelpers
If this is only used in SIL.Machine.Tool, then you can just leave it there.
src/SIL.Machine.Translation.Thot/ThotWordAlignmentModelType.cs
line 16 at r2 (raw file):
} public static class ThotWordAlignmentHelpers
If this is only used in SIL.Machine.Tool, I would just leave it there.
src/SIL.Machine.Translation.Thot/ThotWordAlignmentModel.cs
line 124 at r2 (raw file):
} public ITrainer CreateTrainer(IParallelTextCorpus corpus, ITokenizer<string, int, string> tokenizer = null)
We should handle this the same way that we handle it in ThotSmtModel
. Add new properties for the source and target tokenizers to this class.
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
namespace SIL.Machine.Translation { public interface IWordAlignmentEngine : IWordAligner, IDisposable
What is the purpose of this interface?
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 29 at r2 (raw file):
{ alignedWordPairs = null; try
You should only use exceptions for exceptional cases and not normal logic like this. Is this to catch the case where integers cannot be parsed? If so, you should just check the result from int.TryParse
.
@johnml1135 Would you like to take a first pass at addressing Damien's concerns since you architected a lot of those changes? I'm happy to code the fixes, but you are probably better positioned to answer some of these since I'm not really aware of the rationale for some of the changes. I, in general, didn't change work you had already done that was already working properly. |
Previously, ddaspit (Damien Daspit) wrote…
It was more logical (when I was writing it) to make the |
Previously, ddaspit (Damien Daspit) wrote…
Makes sense. |
Previously, ddaspit (Damien Daspit) wrote…
Makes sense. |
Previously, johnml1135 (John Lambert) wrote…
Correct that - it's used in Serval. A lot of these changes are to better expose WordAlignment to Serval for use. |
Previously, ddaspit (Damien Daspit) wrote…
It's used a handful of places in Serval - though it should likely be removed. It was used more when the SMT engine and the statistical engine were sharing more infrastructure. This should be removed and replaced in all locations with WordAlignmentResult from Serval.WordAlignment.Models. |
Previously, ddaspit (Damien Daspit) wrote…
sounds good to me. |
Previously, ddaspit (Damien Daspit) wrote…
Ok by me. |
I took the first pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r3, all commit messages.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @johnml1135)
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Correct that - it's used in Serval. A lot of these changes are to better expose WordAlignment to Serval for use.
Can we use one of the existing interfaces (IWordAlignmentModel
or IWordAligner
)?
Previously, ddaspit (Damien Daspit) wrote…
At the Serval level, we needed a distinction between the Engine and the Model, especially to support the Statistical engine service and the WordAlignmentEngineState. Both don't use the Trainer routines and make more sense to just load the engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 20 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 29 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Makes sense.
Is this better? It's a little odd, but it'd be nice to have a TryParse
function for use in the serializer.
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Can we use one of the existing interfaces (
IWordAlignmentModel
orIWordAligner
)?
@johnml1135, would you like to give your rationale as to why we couldn't use one of the existing interfaces like Damien mentioned since you're the one who architected the change?
src/SIL.Machine.Tool/ToolHelpers.cs
line 17 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why were these helpers moved from the generic "ToolHelpers" to the specific ones for symmetrization and thot?
I don't know. I think you did that, right? Sounds like you're OK moving them back.
src/SIL.Machine.Translation.Thot/ThotWordAlignmentModel.cs
line 124 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
sounds good to me.
Done.
src/SIL.Machine.Translation.Thot/ThotWordAlignmentModelType.cs
line 16 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Ok by me.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 20 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Translation/SymmetrizationHeuristic.cs
line 46 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Makes sense.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 20 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
@johnml1135, would you like to give your rationale as to why we couldn't use one of the existing interfaces like Damien mentioned since you're the one who architected the change?
Sorry, I had this drafted before I saw your message
Previously, johnml1135 (John Lambert) wrote…
OK, It will create a dependency that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 20 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Translation/WordAlignmentResult.cs
line 7 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
OK, It will create a dependency that
Serval.Machine.Shared
has onServal.WordAlignment
. As long as that's alright with everyone, I can do it.
I guess that's how it already is for Serval.Translation
, so that shouldn't be an issue. I'll just not use a global import.
Previously, Enkidu93 (Eli C. Lowry) wrote…
Sure - move them back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit, @johnml1135, I believe I've responded to all comments and that this PR is ready for review.
Reviewable status: 5 of 20 files reviewed, 6 unresolved discussions (waiting on @ddaspit)
src/SIL.Machine.Tool/ToolHelpers.cs
line 17 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Sure - move them back.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 20 of 20 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 9 of 9 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit)
0d6fee1
to
d46b467
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 17 files at r8, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 29 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Is this better? It's a little odd, but it'd be nice to have a
TryParse
function for use in the serializer.
Looks good. I would return as soon as a TryParseIndex
fails.
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Sorry, I had this drafted before I saw your message
I took a look at the Serval code to better understand how the interface is being used. It looks like StatisticalEngineService
needs to call GetBestAlignedWordPairs
. Rather than add a new interface, we should add GetBestAlignedWordPairs
to the IWordAligner
interface. It makes sense there anyway. Then, we can use IWordAligner
instead. It should be easy to implement for most classes that implement the IWordAligner
interface. You can just add a default implementation in WordAlignerBase
that looks something like this:
WordAlignmentMatrix matrix = Align(sourceSegment, targetSegment);
return matrix.ToAlignedWordPairs();
src/SIL.Machine/Translation/IWordAlignmentModel.cs
line 10 at r8 (raw file):
{ ITrainer CreateTrainer(IParallelTextCorpus corpus); Task SaveAsync(CancellationToken cancellationToken = default);
We shouldn't need the Save
methods. The IWordAlignmentModel
interface provides no methods that update the model.
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 100 at r8 (raw file):
return true; } if (indexString == "NULL")
I would perform this check first.
Previously, ddaspit (Damien Daspit) wrote…
I agree - see other comments. |
Previously, ddaspit (Damien Daspit) wrote…
The IWordAligner interface does not handle Tokenization. So either we add tokenization to the IWordAligner interface, or we have a different interface. The lack of tokenization is why I didn't use it before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
The IWordAligner interface does not handle Tokenization. So either we add tokenization to the IWordAligner interface, or we have a different interface. The lack of tokenization is why I didn't use it before.
A while back I did a refactor of the translation interfaces to support passing strings as well as lists of tokens. Do you want something similar for the word alignment interfaces? I never got around to doing the same thing for the word alignment interfaces. I always planned to, so if we need it, then we should add it. It does look like the IWordAlignmentEngine
interface does not have any methods like that though.
Is everything ready? There are still a few comments unaddressed. I am ok with the removing of the engines and just squashing the behavior into the higher level (the model level). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I missed a round of Damien's comments somehow. I think everything's done now.
Reviewable status: 4 of 20 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 29 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Looks good. I would return as soon as a
TryParseIndex
fails.
Done.
src/SIL.Machine/Corpora/AlignedWordPair.cs
line 100 at r8 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would perform this check first.
Done.
src/SIL.Machine/Translation/IWordAlignmentEngine.cs
line 8 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
A while back I did a refactor of the translation interfaces to support passing strings as well as lists of tokens. Do you want something similar for the word alignment interfaces? I never got around to doing the same thing for the word alignment interfaces. I always planned to, so if we need it, then we should add it. It does look like the
IWordAlignmentEngine
interface does not have any methods like that though.
(I'm sorry - I think I missed your earlier comment, Damien. I think this fine now, right?)
src/SIL.Machine/Translation/IWordAlignmentModel.cs
line 10 at r8 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I agree - see other comments.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r10, 4 of 4 files at r11, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)
src/SIL.Machine/Translation/SymmetrizedWordAlignmentModel.cs
line 31 at r11 (raw file):
} public IWordAlignmentModel DirectWordAlignmentEngine
This property was renamed. It should be reverted back to the original name.
src/SIL.Machine/Translation/SymmetrizedWordAlignmentModel.cs
line 41 at r11 (raw file):
} public IWordAlignmentModel InverseWordAlignmentEngine
This property was renamed. It should be reverted back to the original name.
Done. Sorry I missed that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 17 files at r8, 4 of 4 files at r9, 13 of 13 files at r10, 4 of 4 files at r11, 13 of 13 files at r12, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r12, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
Small bug in NParallelTextCorpus Extend parsing of aligned word pairs to accommodate NULLs Add save functionality for WA; small bug in NPTC Edits to AlignedWordPair functionality PR review fixes Get rid of WordAlignmentResult
0a34724
to
cd92a37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
Add word alignment engine to IInteractiveTranslationEngine.
This change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"