Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Text extraction from File: wikilinks has an issue #87

Open
ikuyamada opened this issue Nov 30, 2014 · 5 comments
Open

Text extraction from File: wikilinks has an issue #87

ikuyamada opened this issue Nov 30, 2014 · 5 comments

Comments

@ikuyamada
Copy link

mwparserfromhell seemingly has an issue to extract text from "File:" wikilinks with additional attributes.

In [1]: import mwparserfromhell
In [2]: w = "[[File:test.jpg|thumb|Label text]]"
In [3]: mwparserfromhell.parse(w).nodes[0].text
Out[3]: u'thumb|Label text'

I think the desired output is not "thumb|Label text" but "Label text".

@Technical-13
Copy link

@ikuyamada I would actually expect it to spit out an array containing("thumb","Label text"). I'm guessing that it just hasn't evolved to that yet, and lacking that kind of support, "thumb|Label text" seems correct to me.

@earwig earwig self-assigned this Nov 30, 2014
@earwig earwig added this to the version 0.4 milestone Nov 30, 2014
@earwig
Copy link
Owner

earwig commented Nov 30, 2014

"thumb|Label text" is correct, since the parser treats all wikilink-like things the same way. Ideally, we would understand what a file is and treat its caption specially (so you could do node.caption instead of node.text, which would give the entire chunk), but this is problematic since we don't have a reliable way to determine what is a file link and what isn't, due to site- and language-specific namespace aliases. I suppose we could just have .caption exist for all links, but this would entail new parsing rules. I'm willing to add this since it's been requested before.

@Technical-13
Copy link

Feel free to 🐟 me if it is already in there, but does this mean that you are going to have it parse the whole string to have it output node.height, node.width, node.align, node.valign, node.mode (thumb, frameless, etc), node.link? If you are going to parse out each chunk, then you might as well put them in their own places.

@earwig
Copy link
Owner

earwig commented Jan 14, 2015

Hm... that's a bit clunky, but I suppose it's better than having a dictionary or some other alternative I can't think of right now.

@ricordisamoa
Copy link
Contributor

Many arguments for file links can also have localized forms...

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants