Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

lemma with # # Finnish language #70

Open
mrgransky opened this issue Apr 4, 2023 · 2 comments
Open

lemma with # # Finnish language #70

mrgransky opened this issue Apr 4, 2023 · 2 comments

Comments

@mrgransky
Copy link

mrgransky commented Apr 4, 2023

Given the following code snippet:

import json
from trankit import Pipeline

p = Pipeline('auto', embedding='xlm-roberta-large')

doc = '''Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.'''

tokens = p(doc, is_sent=True)
print(json.dumps(tokens, indent=2, ensure_ascii=False))

For some reason, I get # in my lemma as seen in this sample doc:

{
  "text": "Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.",
  "tokens": [
    {
      "id": 1,
      "text": "Naton",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 2,
      "deprel": "nmod:poss",
      "span": [
        0,
        5
      ],
      "lemma": "Nato"
    },
    {
      "id": 2,
      "text": "päämajassa",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        6,
        16
      ],
      "lemma": "pää#maja"  <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 3,
      "text": "Brysselissä",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 2,
      "deprel": "appos",
      "span": [
        17,
        28
      ],
      "lemma": "Bryssel"
    },
    {
      "id": 4,
      "text": "järjestettiin",
      "upos": "VERB",
      "xpos": "V",
      "feats": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass",
      "head": 0,
      "deprel": "root",
      "span": [
        29,
        42
      ],
      "lemma": "järjestää"
    },
    {
      "id": 5,
      "text": "iltapäivällä",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ade|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        43,
        55
      ],
      "lemma": "ilta#päivä" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 6,
      "text": "Suomen",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 8,
      "deprel": "nmod:poss",
      "span": [
        56,
        62
      ],
      "lemma": "Suomi"
    },
    {
      "id": 7,
      "text": "virallinen",
      "upos": "ADJ",
      "xpos": "A",
      "feats": "Case=Nom|Degree=Pos|Derivation=Llinen|Number=Sing",
      "head": 8,
      "deprel": "amod",
      "span": [
        63,
        73
      ],
      "lemma": "virallinen"
    },
    {
      "id": 8,
      "text": "liittymisseremonia",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Nom|Number=Sing",
      "head": 4,
      "deprel": "obj",
      "span": [
        74,
        92
      ],
      "lemma": "liittyä#seremoni" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "Punct",
      "head": 4,
      "deprel": "punct",
      "span": [
        92,
        93
      ],
      "lemma": "."
    }
  ],
  "lang": "finnish"
}

I tired it both in Colab and terminal, but same results!

What am I doing wrong?

PS, I do not get the same error in demo website:
bild

Cheers,

@OttoTarkka
Copy link

Not an error, the component words of compound words (Finnish: yhdyssana) are separated by the '#' sign by design.

@mrgransky
Copy link
Author

but this only occurs when Standard package TDT is used,
FTB would not lead into the same issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants