Migrate to parsoid api #267

cimurah · 2020-10-27T19:41:53Z

No description provided.

tests/Book/RefreshTest.php

MusikAnimal · 2020-10-27T19:49:46Z

src/Util/Api.php

+		$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );
+		return $this->getAsync( $url )
+					->then( function ( string $result ) {
+						if ( $result != '' ) {
+							$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );
+						}
+						return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;
+					}
+		);


Suggested change

$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );

return $this->getAsync( $url )

->then( function ( string $result ) {

if ( $result != '' ) {

$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );

}

return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;

}

);

$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );

return $this->getAsync( $url )

->then( function ( string $result ) {

if ( $result != '' ) {

$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );

}

return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;

} );

samwilson

Looking good.

The big part of this is to make sure that the new HTML we're getting is correctly parsed and turned into what's required for epub. I suspect there are things that we were doing that are no longer required, and things that now need to be done that didn't before.

README.md

samwilson · 2020-10-28T01:55:18Z

src/Util/Api.php

-			throw new HttpException( 500, 'No page information found in response' );
-		}
-		throw new NotFoundHttpException( "Page revision not found for: $title" );
+		$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );


The structure of the HTML returned by this API is different to that of the old API, and so the PageParser has to be updated. For example, this is looking for /wiki/ at the beginning of href values, but these are now relative and don't contain that string.

For example, the old HTML of enwikisource:Emma is:

<h3><span class="mw-headline" id="Volume_1">Volume 1</span></h3> <ul><li><a href="/wiki/Emma/Volume_1/Chapter_1" title="Emma/Volume 1/Chapter 1">Chapter 1</a></li> <li><a href="/wiki/Emma/Volume_1/Chapter_2" title="Emma/Volume 1/Chapter 2">Chapter 2</a></li> <li><a href="/wiki/Emma/Volume_1/Chapter_3" title="Emma/Volume 1/Chapter 3">Chapter 3</a></li> <li><a href="/wiki/Emma/Volume_1/Chapter_4" title="Emma/Volume 1/Chapter 4">Chapter 4</a></li>

while the new HTML is:

<section data-mw-section-id="-1" id="mwDQ"><h3 id="Volume_1">Volume 1</h3> <ul><li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_1" title="Emma/Volume 1/Chapter 1">Chapter 1</a></li> <li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_2" title="Emma/Volume 1/Chapter 2">Chapter 2</a></li> <li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_3" title="Emma/Volume 1/Chapter 3">Chapter 3</a></li> <li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_4" title="Emma/Volume 1/Chapter 4">Chapter 4</a></li>

There are probably other changes too; PageParser does lots of things.

Good observation! Taking a look at it. Thank you!

The new format is not consistent. For enwikisource:The_Kiss_and_its_History, the HTML returned is:

<table style="border-collapse:collapse; border-spacing:0px 0px; empty-cells:hide; width:100%;" class="ws-summary"> <tbody><tr> <td style="width:2.5em; max-width:2.5em; padding:0.0em 0.5em 0.0em 0.0em; vertical-align:top; text-align:right;">I. </td> <td><div style="position:relative; width:100%;"><div style="max-width:80%; text-align:left; text-indent:-1.0em; margin-left:1.0em;"><div style="display:inline; position:relative; text-align:left; padding:0.0em 0.5em 0.0em 0.0em; background:white; z-index:2;"><a href="/wiki/The_Kiss_and_its_History/Chapter_1" title="The Kiss and its History/Chapter 1"><span style="font-variant:small-caps">What is a Kiss?</span></a></div></div><div style="position:absolute; left:0px; bottom:0px; width:1.0em; height:1.00em; background:white; z-index:1;"></div><div style="position:absolute; right:0px; bottom:0px; width:100%; overflow:hidden; white-space:nowrap; text-align:right; z-index:0;"><div style="display:inline; float:right;">. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .</div></div></div> </td> <td style="text-align:right; vertical-align:bottom; padding:0.0em 0.0em 0.0em 0.5em; width:2.0em;">1 </td></tr></tbody></table>

I've added a check for both cases. Does anyone know why there are different formats?

It looks like the difference is relative vs absolute wikilinks:

Page:The Kiss and its History.djvu/7: [[The Kiss and its History/Chapter 1|{{sc|What is a Kiss?}}]]

Emma: [[/Volume 1/Chapter 1|Chapter 1]]

samwilson · 2020-10-28T02:13:36Z

src/Util/Api.php

-			$title = $page['title'];
-			if ( isset( $page['revisions'] ) ) {
-				foreach ( $page['revisions'] as $revision ) {
-					return Util::getXhtmlFromContent( $this->getLang(), $revision['*'], $title );


Why remove Util::getXhtmlFromContent()? This method is used to strip comments and turn the HTML into XHTML (as required by epub). It also adds a reference to main.css.

The new API responds with a page that includes the HTML already as follows:

<!DOCTYPE html> <html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikisource.org/wiki/Special:Redirect/revision/9909225"><head prefix="mwr: https://en.wikisource.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="b4fbd150-fa6e-11ea-aa5a-214d3c313213"/><meta charset="utf-8"/><meta property="mw:pageId" content="53313"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/9909159"/><meta property="mw:revisionSHA1" content="bbc6e24f9721b0ec0b075a7c31391b4cab85ea96"/><meta property="dc:modified" content="2020-02-03T21:20:39.000Z"/><meta property="mw:html:version" content="2.1.0"/><link rel="dc:isVersionOf" href="//en.wikisource.org/wiki/Poems%2C_Consisting_Chiefly_of_Translations_from_the_Asiatick_Languages"/><title>Poems, Consisting Chiefly of Translations from the Asiatick Languages</title><base href="//en.wikisource.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles%7Cmediawiki.page.gallery.styles%7Cext.cite.style%7Cext.cite.styles&only=styles&skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section...

So we wouldn't need to call the Util::getXhtmlFromContent() entirely since we now just need to remove comments and add XML so I added the logic in the getPageAsync() function.

I didn't want to modify the getXhtmlFromContent() because it is being called in multiple places and the $content there might be retrieved from the old API.

The new API has a reference to different CSS. Do you think it needs to include the main.css reference?

We need to produce XHTML for epubs, and there's also lots of other stuff in the head of the new content that we don't need. For example, we want something like this:

<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" dir="ltr"> <head> <meta content="application/xhtml+xml;charset=UTF-8" http-equiv="default-style"/> <link type="text/css" rel="stylesheet" href="main.css"/> <title>Emma</title> </head>

but are getting:

<!DOCTYPE html> <html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikisource.org/wiki/Special:Redirect/revision/9471884"> <head prefix="mwr: https://en.wikisource.org/wiki/Special:Redirect/"> <meta property="mw:TimeUuid" content="2793b590-fa54-11ea-973d-6144e17084ac"/> <meta charset="utf-8"/> <meta property="mw:pageId" content="3208"/> <meta property="mw:pageNamespace" content="0"/> <link rel="dc:replaces" resource="mwr:revision/9467089"/> <meta property="mw:revisionSHA1" content="164f861d4d643038c58877d4826a0085c9141d4f"/> <meta property="dc:modified" content="2019-07-23T15:52:33.000Z"/> <meta property="mw:html:version" content="2.1.0"/> <link rel="dc:isVersionOf" href="//en.wikisource.org/wiki/Emma"/> <title>Emma</title> <base href="//en.wikisource.org/wiki/"/> <link rel="stylesheet" href="/w/load.php?lang=en&modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles%7Cmediawiki.page.gallery.styles%7Cext.cite.style%7Cext.cite.styles&only=styles&skin=vector"/> <meta http-equiv="content-language" content="en"/> <meta http-equiv="vary" content="Accept"/> </head>

The other two uses of Util::getXhtmlFromContent() are fixable: one is just wanting an empty document, and one is inserting a simple translation of the word "About" (which as far as I can see never contains any HTML, but I haven't checked all Wikisources).

Maybe it's worth encapsulating the changing of this HTML content here though, because it shouldn't ever be needed elsewhere. So it's fine if you want to keep it in getPageAsync(), but it should still use getXhtmlFromContent() because we still want the simplified <head>. Or refactor both to make it simpler, whatever you reckon is best.

The link to main.css is also still required, as that is the main CSS file for the epub (it's a concatenation of the font-family stuff if a font is requested, and the MediaWiki:Epub.css from the Wikisource).

Thank you for the thorough explanation. I made the changes to Util::getXhtmlFromContent() so that it strips out the stuff before <body in the content retrieved from the new API endpoint.

An other maybe cleaner option might be to drop completely this getXhtmlFromContent function and parse the content as HTML in PageParser (maybe using Remex) and to the DOM changes to make it valid XHTML 5 inside of PageParser.

Clearing the Parsoid header removes also, I believe, the TemplateStyles data.

That's a good idea. It'd be nice to have it all centralized in one place. Although, maybe it can be done as a second step; I'm not sure.

The template styles are moved to the head after this, by PageParser::moveStyleToHead(), I think. At least, it looks like it works with stripping the entire head at this point. :)

src/Cleaner/BookCleanerEpub.php

samwilson · 2020-11-16T02:41:55Z

src/Util/Api.php

-			$title = $page['title'];
-			if ( isset( $page['revisions'] ) ) {
-				foreach ( $page['revisions'] as $revision ) {
-					return Util::getXhtmlFromContent( $this->getLang(), $revision['*'], $title );


That's a good idea. It'd be nice to have it all centralized in one place. Although, maybe it can be done as a second step; I'm not sure.

The template styles are moved to the head after this, by PageParser::moveStyleToHead(), I think. At least, it looks like it works with stripping the entire head at this point. :)

samwilson

Footnotes are not working; they link to the full Wikisource domain, rather than the page-local anchor. The old markup looks something like <sup id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> whereas the new looks like <sup id="cite_ref-1"><a href="https://en.wikisource.org/wiki/Page_name#cite_note-1"><span>[1]</span></a></sup> (I've stripped the irrelevant attrs in both cases). The links back from the footnote up to the ref is similarly fully-qualified.
Other fragment anchors are the same (i.e. [[#foo]] doesn't link to another location within the same page, but to the wiki).
A bunch of elements have a data-mw="{"parts" … "i":0}}]}" attribute which is often pretty large and could be nuked. This could be a follow-up. There are lots of other attributes that will probably want to be clean up later too (e.g. the category link clean-up leaves a preceding <span about="#mwt3"> element). I mainly mention the data-mw one because it might increase file sizes a bit (although probably not much).

Everything else looks great. There will be a bit of iterating to do later, but that's fine I think.

cimurah

Footnotes are not working; they link to the full Wikisource domain, rather than the page-local anchor. The old markup looks something like <sup id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> whereas the new looks like <sup id="cite_ref-1"><a href="https://en.wikisource.org/wiki/Page_name#cite_note-1"><span>[1]</span></a></sup> (I've stripped the irrelevant attrs in both cases). The links back from the footnote up to the ref is similarly fully-qualified.

Other fragment anchors are the same (i.e. [[#foo]] doesn't link to another location within the same page, but to the wiki).

A bunch of elements have a data-mw="{"parts" … "i":0}}]}" attribute which is often pretty large and could be nuked. This could be a follow-up. There are lots of other attributes that will probably want to be clean up later too (e.g. the category link clean-up leaves a preceding <span about="#mwt3"> element). I mainly mention the data-mw one because it might increase file sizes a bit (although probably not much).

Everything else looks great. There will be a bit of iterating to do later, but that's fine I think.

I've cleaned up the reference links and removed the data-mw attributes. I see the <span about="#mwt3"> in a few places. I'll remove them.

src/Cleaner/BookCleanerEpub.php

Point getPageAsync to new Parsoid API 'api/rest_v1' Bug: T264788

samwilson

Looks good to me. Let's merge, and iterate on any remaining DOM improvements.

MusikAnimal reviewed Oct 27, 2020

View reviewed changes

cimurah added the WIP Work in progress label Oct 27, 2020

samwilson requested changes Oct 28, 2020

View reviewed changes

cimurah force-pushed the migrate-to-parsoid-api branch 5 times, most recently from acfb2a3 to 1572b6e Compare October 30, 2020 20:33

cimurah requested a review from dayllanmaza October 30, 2020 20:36

cimurah force-pushed the migrate-to-parsoid-api branch 4 times, most recently from fb6695f to 349f00e Compare November 4, 2020 20:14

cimurah force-pushed the migrate-to-parsoid-api branch 4 times, most recently from 167b829 to 2439c82 Compare November 10, 2020 04:51

cimurah requested review from samwilson, Tpt and MusikAnimal November 10, 2020 05:04

samwilson requested changes Nov 16, 2020

View reviewed changes

Base automatically changed from dev to main November 16, 2020 23:22

cimurah force-pushed the migrate-to-parsoid-api branch 6 times, most recently from e30dcf6 to 7e331b4 Compare November 18, 2020 19:53

cimurah removed the WIP Work in progress label Nov 18, 2020

cimurah requested a review from samwilson November 18, 2020 19:54

cimurah force-pushed the migrate-to-parsoid-api branch 3 times, most recently from 5fa2940 to aafd18f Compare November 18, 2020 23:48

cimurah added the Ready for review label Nov 19, 2020

cimurah force-pushed the migrate-to-parsoid-api branch from aafd18f to a7ded38 Compare November 20, 2020 01:17

samwilson requested changes Nov 24, 2020

View reviewed changes

cimurah force-pushed the migrate-to-parsoid-api branch from a7ded38 to 903c70f Compare December 9, 2020 04:42

cimurah commented Dec 9, 2020

View reviewed changes

src/Cleaner/BookCleanerEpub.php Show resolved Hide resolved

cimurah requested a review from samwilson December 9, 2020 04:46

cimurah force-pushed the migrate-to-parsoid-api branch 12 times, most recently from 79b618c to 7dd6ad7 Compare December 11, 2020 21:07

Migrate page content retrieval to Parsoid's API

f6759ca

Point getPageAsync to new Parsoid API 'api/rest_v1' Bug: T264788

cimurah force-pushed the migrate-to-parsoid-api branch from 7dd6ad7 to f6759ca Compare December 11, 2020 21:11

samwilson approved these changes Dec 14, 2020

View reviewed changes

samwilson merged commit 526a5c0 into main Dec 14, 2020

samwilson deleted the migrate-to-parsoid-api branch December 14, 2020 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to parsoid api #267

Migrate to parsoid api #267

cimurah commented Oct 27, 2020

MusikAnimal Oct 27, 2020

samwilson left a comment

samwilson Oct 28, 2020

cimurah Oct 29, 2020

cimurah Nov 2, 2020

cimurah Nov 2, 2020

samwilson Nov 2, 2020

samwilson Oct 28, 2020

cimurah Oct 29, 2020

samwilson Oct 29, 2020

cimurah Oct 29, 2020

Tpt Oct 29, 2020

samwilson Nov 16, 2020

samwilson Nov 16, 2020

samwilson left a comment

cimurah left a comment

samwilson left a comment

Migrate to parsoid api #267

Migrate to parsoid api #267

Conversation

cimurah commented Oct 27, 2020

Choose a reason for hiding this comment

samwilson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samwilson left a comment

Choose a reason for hiding this comment

cimurah left a comment

Choose a reason for hiding this comment

samwilson left a comment

Choose a reason for hiding this comment