Skip to content

Commit

Permalink
Fix IndexError when abbrv is longer than original; Close #12 (#14)
Browse files Browse the repository at this point in the history
* Fix IndexError when abbrv is longer than original

In some cases, there is a mismatch between abbreviation and original,
where a dot is added to an unabbreviated word, e.g., "Control".
If this occurs, the dot is removed and the abbreviation is reduced to
the length of the original word.

* Fix missed detection of single letter part names

Fix the wrong classification of single letter part names, if the single
letter is also a stopword.
  • Loading branch information
klb2 authored Jul 7, 2024
1 parent 8cc47b5 commit 045bcd1
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 2 deletions.
3 changes: 2 additions & 1 deletion pyiso4/lexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ def yield_hyphenated(word: str, base_pos: int) -> Iterable[Token]:
yield Token(TokenType.PART, word, self.start_word)
was_part = self.count
# check if ordinal (preceded by PART)
elif IS_ORDINAL.match(word) and self.count == was_part + 1:
elif IS_ORDINAL.fullmatch(word) and self.count == was_part + 1:
yield Token(TokenType.ORDINAL, word, self.start_word)
# check if article (after ordinal, so "a" is detected as ordinal if preceded by PART)
elif lower_word in ARTICLES:
Expand All @@ -155,6 +155,7 @@ def yield_hyphenated(word: str, base_pos: int) -> Iterable[Token]:
# yield the remaining symbols, if any
if len(end_symbols) > 0:
yield Token(TokenType.SYMBOLS, end_symbols, self.pos - len(end_symbols))
was_part = self.count

self.next()

Expand Down
3 changes: 3 additions & 0 deletions pyiso4/ltwa.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,9 @@ def match_capitalization_and_diacritic(abbrv: str, original: str) -> str:
"""Matches the capitalization and diacritics of the `original` word, as long as they are similar
"""

if len(abbrv) > len(original):
abbrv = abbrv[:len(original)]

normalized_abbrv = list(normalize(abbrv, Level.SOFT))
for i, c in enumerate(normalized_abbrv):
unided = unidecode(original[i])
Expand Down
10 changes: 9 additions & 1 deletion tests/tests.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,12 @@ Zeitschrift des Deutschen Palästina-Vereins Z. Dtsch. Paläst.-Ver.
International Journal of e-Collaboration Int. J. e-Collab.
Proceedings of A. Razmadze Mathematical Institute Proc. A. Razmadze Math. Inst.
Norsk Militært Tidsskrift Nor. Mil. Tidsskr.
Proceedings of the 2024 Conference on Science Proc. 2024 Conf. Sci.
Proceedings of the 2024 Conference on Science Proc. 2024 Conf. Sci.
IEEE Power and Energy Magazine IEEE Power Energy Mag.
IEEE Transactions on Automatic Control IEEE Trans. Autom. Control
E.S.A. bulletin E.S.A. bull.
Acta Universitatis Carolinae. Iuridica Acta Univ. Carol., Iurid.
Physical Review. A Phys. Rev., A
Physical Review. D Phys. Rev., D
Physical Review. E Phys. Rev., E
Physical Review. I Phys. Rev., I

0 comments on commit 045bcd1

Please # to comment.