Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Migrate to parsoid api #267

Merged
merged 1 commit into from
Dec 14, 2020
Merged

Migrate to parsoid api #267

merged 1 commit into from
Dec 14, 2020

Conversation

cimurah
Copy link
Contributor

@cimurah cimurah commented Oct 27, 2020

No description provided.

tests/Book/RefreshTest.php Show resolved Hide resolved
src/Util/Api.php Outdated
Comment on lines 190 to 195
$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );
return $this->getAsync( $url )
->then( function ( string $result ) {
if ( $result != '' ) {
$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );
}
return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;
}
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );
return $this->getAsync( $url )
->then( function ( string $result ) {
if ( $result != '' ) {
$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );
}
return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;
}
);
$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );
return $this->getAsync( $url )
->then( function ( string $result ) {
if ( $result != '' ) {
$result = preg_replace( '#<\!--(.+)-->#isU', '', $result );
}
return '<?xml version="1.0" encoding="UTF-8" ?>' . $result;
} );

@cimurah cimurah added the WIP Work in progress label Oct 27, 2020
Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

The big part of this is to make sure that the new HTML we're getting is correctly parsed and turned into what's required for epub. I suspect there are things that we were doing that are no longer required, and things that now need to be done that didn't before.

README.md Outdated Show resolved Hide resolved
src/Util/Api.php Outdated
throw new HttpException( 500, 'No page information found in response' );
}
throw new NotFoundHttpException( "Page revision not found for: $title" );
$url = 'https://' . $this->domainName . '/api/rest_v1/page/html/' . urlencode( $title );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of the HTML returned by this API is different to that of the old API, and so the PageParser has to be updated. For example, this is looking for /wiki/ at the beginning of href values, but these are now relative and don't contain that string.

For example, the old HTML of enwikisource:Emma is:

<h3><span class="mw-headline" id="Volume_1">Volume 1</span></h3>
<ul><li><a href="/wiki/Emma/Volume_1/Chapter_1" title="Emma/Volume 1/Chapter 1">Chapter 1</a></li>
<li><a href="/wiki/Emma/Volume_1/Chapter_2" title="Emma/Volume 1/Chapter 2">Chapter 2</a></li>
<li><a href="/wiki/Emma/Volume_1/Chapter_3" title="Emma/Volume 1/Chapter 3">Chapter 3</a></li>
<li><a href="/wiki/Emma/Volume_1/Chapter_4" title="Emma/Volume 1/Chapter 4">Chapter 4</a></li>

while the new HTML is:

<section data-mw-section-id="-1" id="mwDQ"><h3 id="Volume_1">Volume 1</h3>
<ul><li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_1" title="Emma/Volume 1/Chapter 1">Chapter 1</a></li>
<li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_2" title="Emma/Volume 1/Chapter 2">Chapter 2</a></li>
<li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_3" title="Emma/Volume 1/Chapter 3">Chapter 3</a></li>
<li><a rel="mw:WikiLink" href="./Emma/Volume_1/Chapter_4" title="Emma/Volume 1/Chapter 4">Chapter 4</a></li>

There are probably other changes too; PageParser does lots of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation! Taking a look at it. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new format is not consistent. For enwikisource:The_Kiss_and_its_History, the HTML returned is:

<table style="border-collapse:collapse; border-spacing:0px 0px; empty-cells:hide; width:100%;" class="ws-summary">
<tbody><tr>
<td style="width:2.5em; max-width:2.5em; padding:0.0em 0.5em 0.0em 0.0em; vertical-align:top; text-align:right;">I.
</td>
<td><div style="position:relative; width:100%;"><div style="max-width:80%; text-align:left; text-indent:-1.0em; margin-left:1.0em;"><div style="display:inline; position:relative; text-align:left; padding:0.0em 0.5em 0.0em 0.0em; background:white; z-index:2;"><a href="/wiki/The_Kiss_and_its_History/Chapter_1" title="The Kiss and its History/Chapter 1"><span style="font-variant:small-caps">What is a Kiss?</span></a></div></div><div style="position:absolute; left:0px; bottom:0px; width:1.0em; height:1.00em; background:white; z-index:1;"></div><div style="position:absolute; right:0px; bottom:0px; width:100%; overflow:hidden; white-space:nowrap; text-align:right; z-index:0;"><div style="display:inline; float:right;">. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .</div></div></div>
</td>
<td style="text-align:right; vertical-align:bottom; padding:0.0em 0.0em 0.0em 0.5em; width:2.0em;">1
</td></tr></tbody></table>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check for both cases. Does anyone know why there are different formats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the difference is relative vs absolute wikilinks:

$title = $page['title'];
if ( isset( $page['revisions'] ) ) {
foreach ( $page['revisions'] as $revision ) {
return Util::getXhtmlFromContent( $this->getLang(), $revision['*'], $title );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove Util::getXhtmlFromContent()? This method is used to strip comments and turn the HTML into XHTML (as required by epub). It also adds a reference to main.css.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API responds with a page that includes the HTML already as follows:

 <!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikisource.org/wiki/Special:Redirect/revision/9909225"><head prefix="mwr: https://en.wikisource.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="b4fbd150-fa6e-11ea-aa5a-214d3c313213"/><meta charset="utf-8"/><meta property="mw:pageId" content="53313"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/9909159"/><meta property="mw:revisionSHA1" content="bbc6e24f9721b0ec0b075a7c31391b4cab85ea96"/><meta property="dc:modified" content="2020-02-03T21:20:39.000Z"/><meta property="mw:html:version" content="2.1.0"/><link rel="dc:isVersionOf" href="//en.wikisource.org/wiki/Poems%2C_Consisting_Chiefly_of_Translations_from_the_Asiatick_Languages"/><title>Poems, Consisting Chiefly of Translations from the Asiatick Languages</title><base href="//en.wikisource.org/wiki/"/><link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles%7Cmediawiki.page.gallery.styles%7Cext.cite.style%7Cext.cite.styles&amp;only=styles&amp;skin=vector"/><meta http-equiv="content-language" content="en"/><meta http-equiv="vary" content="Accept"/></head><body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section...

So we wouldn't need to call the Util::getXhtmlFromContent() entirely since we now just need to remove comments and add XML so I added the logic in the getPageAsync() function.

I didn't want to modify the getXhtmlFromContent() because it is being called in multiple places and the $content there might be retrieved from the old API.

The new API has a reference to different CSS. Do you think it needs to include the main.css reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to produce XHTML for epubs, and there's also lots of other stuff in the head of the new content that we don't need. For example, we want something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" dir="ltr">
<head>
	<meta content="application/xhtml+xml;charset=UTF-8" http-equiv="default-style"/>
	<link type="text/css" rel="stylesheet" href="main.css"/>
	<title>Emma</title>
</head>

but are getting:

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"
	  about="https://en.wikisource.org/wiki/Special:Redirect/revision/9471884">
<head prefix="mwr: https://en.wikisource.org/wiki/Special:Redirect/">
	<meta property="mw:TimeUuid" content="2793b590-fa54-11ea-973d-6144e17084ac"/>
	<meta charset="utf-8"/>
	<meta property="mw:pageId" content="3208"/>
	<meta property="mw:pageNamespace" content="0"/>
	<link rel="dc:replaces" resource="mwr:revision/9467089"/>
	<meta property="mw:revisionSHA1" content="164f861d4d643038c58877d4826a0085c9141d4f"/>
	<meta property="dc:modified" content="2019-07-23T15:52:33.000Z"/>
	<meta property="mw:html:version" content="2.1.0"/>
	<link rel="dc:isVersionOf" href="//en.wikisource.org/wiki/Emma"/>
	<title>Emma</title>
	<base href="//en.wikisource.org/wiki/"/>
	<link rel="stylesheet"
		  href="/w/load.php?lang=en&amp;modules=mediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Csite.styles%7Cmediawiki.page.gallery.styles%7Cext.cite.style%7Cext.cite.styles&amp;only=styles&amp;skin=vector"/>
	<meta http-equiv="content-language" content="en"/>
	<meta http-equiv="vary" content="Accept"/>
</head>

The other two uses of Util::getXhtmlFromContent() are fixable: one is just wanting an empty document, and one is inserting a simple translation of the word "About" (which as far as I can see never contains any HTML, but I haven't checked all Wikisources).

Maybe it's worth encapsulating the changing of this HTML content here though, because it shouldn't ever be needed elsewhere. So it's fine if you want to keep it in getPageAsync(), but it should still use getXhtmlFromContent() because we still want the simplified <head>. Or refactor both to make it simpler, whatever you reckon is best.

The link to main.css is also still required, as that is the main CSS file for the epub (it's a concatenation of the font-family stuff if a font is requested, and the MediaWiki:Epub.css from the Wikisource).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough explanation. I made the changes to Util::getXhtmlFromContent() so that it strips out the stuff before <body in the content retrieved from the new API endpoint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An other maybe cleaner option might be to drop completely this getXhtmlFromContent function and parse the content as HTML in PageParser (maybe using Remex) and to the DOM changes to make it valid XHTML 5 inside of PageParser.

Clearing the Parsoid header removes also, I believe, the TemplateStyles data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. It'd be nice to have it all centralized in one place. Although, maybe it can be done as a second step; I'm not sure.

The template styles are moved to the head after this, by PageParser::moveStyleToHead(), I think. At least, it looks like it works with stripping the entire head at this point. :)

@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 5 times, most recently from acfb2a3 to 1572b6e Compare October 30, 2020 20:33
@cimurah cimurah requested a review from dayllanmaza October 30, 2020 20:36
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 4 times, most recently from fb6695f to 349f00e Compare November 4, 2020 20:14
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 4 times, most recently from 167b829 to 2439c82 Compare November 10, 2020 04:51
src/Cleaner/BookCleanerEpub.php Show resolved Hide resolved
$title = $page['title'];
if ( isset( $page['revisions'] ) ) {
foreach ( $page['revisions'] as $revision ) {
return Util::getXhtmlFromContent( $this->getLang(), $revision['*'], $title );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. It'd be nice to have it all centralized in one place. Although, maybe it can be done as a second step; I'm not sure.

The template styles are moved to the head after this, by PageParser::moveStyleToHead(), I think. At least, it looks like it works with stripping the entire head at this point. :)

Base automatically changed from dev to main November 16, 2020 23:22
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 6 times, most recently from e30dcf6 to 7e331b4 Compare November 18, 2020 19:53
@cimurah cimurah removed the WIP Work in progress label Nov 18, 2020
@cimurah cimurah requested a review from samwilson November 18, 2020 19:54
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 3 times, most recently from 5fa2940 to aafd18f Compare November 18, 2020 23:48
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch from aafd18f to a7ded38 Compare November 20, 2020 01:17
Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Footnotes are not working; they link to the full Wikisource domain, rather than the page-local anchor. The old markup looks something like <sup id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> whereas the new looks like <sup id="cite_ref-1"><a href="https://en.wikisource.org/wiki/Page_name#cite_note-1"><span>[1]</span></a></sup> (I've stripped the irrelevant attrs in both cases). The links back from the footnote up to the ref is similarly fully-qualified.
  2. Other fragment anchors are the same (i.e. [[#foo]] doesn't link to another location within the same page, but to the wiki).
  3. A bunch of elements have a data-mw="{&quot;parts&quot; … &quot;i&quot;:0}}]}" attribute which is often pretty large and could be nuked. This could be a follow-up. There are lots of other attributes that will probably want to be clean up later too (e.g. the category link clean-up leaves a preceding <span about="#mwt3"> element). I mainly mention the data-mw one because it might increase file sizes a bit (although probably not much).

Everything else looks great. There will be a bit of iterating to do later, but that's fine I think.

@cimurah cimurah force-pushed the migrate-to-parsoid-api branch from a7ded38 to 903c70f Compare December 9, 2020 04:42
Copy link
Contributor Author

@cimurah cimurah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Footnotes are not working; they link to the full Wikisource domain, rather than the page-local anchor. The old markup looks something like <sup id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> whereas the new looks like <sup id="cite_ref-1"><a href="https://en.wikisource.org/wiki/Page_name#cite_note-1"><span>[1]</span></a></sup> (I've stripped the irrelevant attrs in both cases). The links back from the footnote up to the ref is similarly fully-qualified.
  2. Other fragment anchors are the same (i.e. [[#foo]] doesn't link to another location within the same page, but to the wiki).
  3. A bunch of elements have a data-mw="{&quot;parts&quot; … &quot;i&quot;:0}}]}" attribute which is often pretty large and could be nuked. This could be a follow-up. There are lots of other attributes that will probably want to be clean up later too (e.g. the category link clean-up leaves a preceding <span about="#mwt3"> element). I mainly mention the data-mw one because it might increase file sizes a bit (although probably not much).

Everything else looks great. There will be a bit of iterating to do later, but that's fine I think.

I've cleaned up the reference links and removed the data-mw attributes. I see the <span about="#mwt3"> in a few places. I'll remove them.

src/Cleaner/BookCleanerEpub.php Show resolved Hide resolved
@cimurah cimurah requested a review from samwilson December 9, 2020 04:46
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch 12 times, most recently from 79b618c to 7dd6ad7 Compare December 11, 2020 21:07
Point getPageAsync to new Parsoid API 'api/rest_v1'

Bug: T264788
@cimurah cimurah force-pushed the migrate-to-parsoid-api branch from 7dd6ad7 to f6759ca Compare December 11, 2020 21:11
Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Let's merge, and iterate on any remaining DOM improvements.

@samwilson samwilson merged commit 526a5c0 into main Dec 14, 2020
@samwilson samwilson deleted the migrate-to-parsoid-api branch December 14, 2020 02:16
# for free to join this conversation on GitHub. Already have an account? # to comment
Development

Successfully merging this pull request may close these issues.

4 participants