Text extraction from File: wikilinks has an issue #87

ikuyamada · 2014-11-30T09:17:08Z

mwparserfromhell seemingly has an issue to extract text from "File:" wikilinks with additional attributes.

In [1]: import mwparserfromhell
In [2]: w = "[[File:test.jpg|thumb|Label text]]"
In [3]: mwparserfromhell.parse(w).nodes[0].text
Out[3]: u'thumb|Label text'

I think the desired output is not "thumb|Label text" but "Label text".

The text was updated successfully, but these errors were encountered:

Technical-13 · 2014-11-30T14:35:15Z

@ikuyamada I would actually expect it to spit out an array containing("thumb","Label text"). I'm guessing that it just hasn't evolved to that yet, and lacking that kind of support, "thumb|Label text" seems correct to me.

earwig · 2014-11-30T15:26:51Z

"thumb|Label text" is correct, since the parser treats all wikilink-like things the same way. Ideally, we would understand what a file is and treat its caption specially (so you could do node.caption instead of node.text, which would give the entire chunk), but this is problematic since we don't have a reliable way to determine what is a file link and what isn't, due to site- and language-specific namespace aliases. I suppose we could just have .caption exist for all links, but this would entail new parsing rules. I'm willing to add this since it's been requested before.

Technical-13 · 2014-11-30T15:46:51Z

Feel free to 🐟 me if it is already in there, but does this mean that you are going to have it parse the whole string to have it output node.height, node.width, node.align, node.valign, node.mode (thumb, frameless, etc), node.link? If you are going to parse out each chunk, then you might as well put them in their own places.

earwig · 2015-01-14T06:10:33Z

Hm... that's a bit clunky, but I suppose it's better than having a dictionary or some other alternative I can't think of right now.

ricordisamoa · 2015-01-15T01:21:35Z

Many arguments for file links can also have localized forms...

earwig self-assigned this Nov 30, 2014

earwig added this to the version 0.4 milestone Nov 30, 2014

earwig added aspect: parser aspect: tree priority: mid and removed aspect: parser labels Nov 30, 2014

earwig modified the milestones: version 1.0, version 0.4 May 23, 2015

lahwaacz mentioned this issue Aug 22, 2016

Convert Wikicode to XML #161

Closed

lahwaacz mentioned this issue Jun 4, 2017

rewrite and extend Caveats #180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction from File: wikilinks has an issue #87

Text extraction from File: wikilinks has an issue #87

ikuyamada commented Nov 30, 2014

Technical-13 commented Nov 30, 2014

earwig commented Nov 30, 2014

Technical-13 commented Nov 30, 2014

earwig commented Jan 14, 2015

ricordisamoa commented Jan 15, 2015

Text extraction from File: wikilinks has an issue #87

Text extraction from File: wikilinks has an issue #87

Comments

ikuyamada commented Nov 30, 2014

Technical-13 commented Nov 30, 2014

earwig commented Nov 30, 2014

Technical-13 commented Nov 30, 2014

earwig commented Jan 14, 2015

ricordisamoa commented Jan 15, 2015