Skip to content

Commit

Permalink
[Data Liberation] WP_WXR_Reader (#1972)
Browse files Browse the repository at this point in the history
This PR introduces the `WP_WXR_Reader` class for parsing WordPress
eXtended RSS (WXR) files, along with supporting improvements to the XML
processing infrastructure.

**Note: `WP_WXR_Reader` is just a reader. It won't actually import the
data into WordPress** – that part is coming soon.

A part of #1894

## Motivation

There is no WordPress importer that would check all these boxes:

* Supports 100GB+ WXR files without running out of memory
* Can pause and resume along the way 
* Can resume even after a fatal error
* Can run without libxml and mbstring
* Is really fast

`WP_WXR_Reader` is a step in that direction. It cannot pause and resume
yet, but the next few PRs will add that feature.

## Implementation

`WP_WXR_Reader` uses the `WP_XML_Processor` to find XML tags
representing meaningful WordPress entities. The reader knows the WXR
schema and only looks for relevant elements. For example, it knows that
posts are stored in `rss > channel > item` and comments are stored in
`rss > channel > item > `wp:comment`.

The `$wxr->next_entity()` method stream-parses the next entity from the
WXR document and exposes it to the API consumer via
`$wxr->get_entity_type()` and `$wxr->get_entity_date()`. The next call
to `$wxr->next_entity()` remembers where the parsing has stopped and
parses the next entity after that point.

```php
$fp = fopen('my-wxr-file.xml', 'r');

$wxr_reader = WP_WXR_Reader::from_stream();
while(true) {
    if($wxr_reader->next_entity()) {
        switch ( $wxr_reader->get_entity_type() ) {
            case 'post':
                // ... process post ...
                break;

            case 'comment':
                // ... process comment ...
                break;

            case 'site_option':
                // ... process site option ...
                break;

            // ... process other entity types ...
        }
        continue;
    }

    // Next entity not found – we ran out of data to process.
    // Let's feed another chunk of bytes to the reader.

    if(feof($fp)) {
        break;
    }

    $chunk = fread($fp, 8192);
    if(false === $chunk) {
        $wxr_reader->input_finished();
        continue;
    }
    $wxr_reader->append_bytes($chunk);
}
```

Similarly to `WP_XML_Processor`, the `WP_WXR_Reader` enters a paused
state when it doesn't have enough XML bytes to parse the entire entity.

The _next_entity() -> fread -> break_ usage pattern may seem a bit
tedious. This is expected. Even if the WXR parsing part of the
`WP_WXR_Reader` offers a high-level API, working with byte streams
requires reasoning on a much lower level. The `StreamChain` class
shipped in this repository will make the API consumption easier with its
transformation–oriented API for chaining data processors.

### Supported WordPress entities

* posts – sourced from `<item>` tags
* comments – sourced from `<wp:comment>` tags
* comment meta – sourced from `<wp:commentmeta>` tags
* users – sourced from `<wp:author>` tags
* post meta – sourced from `<wp:postmeta>` tags
* terms – sourced from `<wp:term>` tags
* tags – sourced from `<wp:tag>` tags
* categories – sourced from `<wp:category>` tags

## Caveats

### Extensibility

`WP_WXR_Reader` ignores any XML elements it doesn't recognize. The WXR
format is extensible so in the future the reader may start supporting
registration of custom handlers for unknown tags in the future.

### Nested entities intertwined with data

`WP_WXR_Reader` flushes the current entity whenever another entity
starts. The upside is simplicity and a tiny memory footprint. The
downside is that it's possible to craft a WXR document where some
information would be lost. For example:

```xml
<rss>
	<channel>
		<item>
		  <title>Page with comments</title>
		  <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link>
		  <wp:postmeta>
		    <wp:meta_key>_wp_page_template</wp:meta_key>
		    <wp:meta_value><![CDATA[default]]></wp:meta_value>
		  </wp:postmeta>
		  <wp:post_id>146</wp:post_id>
		</item>
	</channel>
</rss>
```
`WP_WXR_Reader` would accumulate post data until the `wp:post_meta` tag.
Then it would emit a `post` entity and accumulate the meta information
until the `</wp:postmeta>` closer. Then it would advance to
`<wp:post_id>` and **ignore it**.

This is not a problem in all the `.wxr` files I saw. Still, it is
important to note this limitation. It is possible there is a `.wxr`
generator somewhere out there that intertwines post fields with post
meta and comments. If this ever comes up, we could:

* Emit the `post` entity first, then all the nested entities, and then
emit a special `post_update` entity.
* Do multiple passes over the WXR file – one for each level of nesting,
e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta

Buffering all the post meta and comments seems like a bad idea – there
might be gigabytes of data.

## Future Plans

The next phase will add pause/resume functionality to handle timeout
scenarios:

- Save parser state after each entity or every `n` entities to speed it
up. Then also save the `n` for a quick rewind after resuming.
- Resume parsing from saved state.

## Testing Instructions

Read the tests and ponder whether they make sense. Confirm the PHPUnit
test suite passed on CI. The test suite includes coverage for various
WXR formats and streaming behaviors.
  • Loading branch information
adamziel authored Nov 2, 2024
1 parent d03263e commit 2b1f0b6
Show file tree
Hide file tree
Showing 24 changed files with 227,290 additions and 92 deletions.
40 changes: 38 additions & 2 deletions packages/playground/data-liberation/bootstrap.php
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,14 @@
require_once __DIR__ . '/src/xml-api/WP_XML_Decoder.php';
require_once __DIR__ . '/src/xml-api/WP_XML_Processor.php';
require_once __DIR__ . '/src/WP_WXR_URL_Rewrite_Processor.php';

require_once __DIR__ . '/src/WP_WXR_Reader.php';
require_once __DIR__ . '/src/utf8_decoder.php';
require_once __DIR__ . '/vendor/autoload.php';


// Polyfill WordPress core functions
$GLOBALS['_doing_it_wrong_messages'] = [];
function _doing_it_wrong($method, $message, $version) {
$GLOBALS['_doing_it_wrong_messages'][] = $message;
}

function __($input) {
Expand Down Expand Up @@ -77,3 +79,37 @@ function wp_kses_uri_attributes() {
'xmlns',
);
}

function mbstring_binary_safe_encoding( $reset = false ) {
static $encodings = array();
static $overloaded = null;

if ( is_null( $overloaded ) ) {
if ( function_exists( 'mb_internal_encoding' )
&& ( (int) ini_get( 'mbstring.func_overload' ) & 2 ) // phpcs:ignore PHPCompatibility.IniDirectives.RemovedIniDirectives.mbstring_func_overloadDeprecated
) {
$overloaded = true;
} else {
$overloaded = false;
}
}

if ( false === $overloaded ) {
return;
}

if ( ! $reset ) {
$encoding = mb_internal_encoding();
array_push( $encodings, $encoding );
mb_internal_encoding( 'ISO-8859-1' );
}

if ( $reset && $encodings ) {
$encoding = array_pop( $encodings );
mb_internal_encoding( $encoding );
}
}

function reset_mbstring_encoding() {
mbstring_binary_safe_encoding( true );
}
1 change: 1 addition & 0 deletions packages/playground/data-liberation/phpunit.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
<phpunit xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" bootstrap="bootstrap.php" colors="true" xsi:noNamespaceSchemaLocation="https://schema.phpunit.de/10.0/phpunit.xsd" cacheDirectory=".phpunit.cache">
<testsuites>
<testsuite name="Application Test Suite">
<file>tests/WPWXRReaderTests.php</file>
<file>tests/WPWXRURLRewriterTests.php</file>
<file>tests/WPRewriteUrlsTests.php</file>
<file>tests/WPURLInTextProcessorTests.php</file>
Expand Down
Loading

0 comments on commit 2b1f0b6

Please # to comment.