Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

Open
36 of 91 tasks
adamziel opened this issue Oct 14, 2024 · 2 comments · Fixed by #1960
Open
36 of 91 tasks

Tracking Issue: Next-gen PHP Importers for Data Liberation #1894

adamziel opened this issue Oct 14, 2024 · 2 comments · Fixed by #1960
Labels
[Aspect] Data Liberation [Type] Project [Type] Tracking Tactical breakdown of efforts across the codebase and/or tied to Overview issues.

Comments

@adamziel
Copy link
Collaborator

adamziel commented Oct 14, 2024

Next Gen importers

This issue tracks the work related to Data Liberation Phase 2: Importing and Exporting Structured Data, that is:

  • Parsers
  • Importers
  • User and developer tools.

WordPress needs parsers. Not just any parsers, but parsers that are streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. A seemingly simple task such as moving a post to another website requires rewriting the URLs in that post, downloading the assets, and handling network failures. More complex tasks, such as importing a WXR file or transferring an entire site, are even more demanding.

WordPress also needs importers. Not just any importers, but importers that can handle large quantities of data from multitude of data formats, are extensible, and can proceed even when they encounter an error in the middle of the process. The WP_Stream_Importer class explored in this project is designed to fulfill these goals – see specific PRs below.

Finally, WordPress needs user and developer tools to use these importers. Not just any tools, but tools that work on the web, in CLI, in the Playground, guide the user with useful progress updates, and provide useful recovery paths when the inevitable errors occur. The work tracked here focuses on a wp-admin page, but the PHP software components are designed for easy reuse outside of wp-admin.

Tracking – ongoing Issues and PRs

Parsing

Exporting

Importing

Data formats

Reliability

UI

Other

Related resources

Next phases: Future Data Liberation roadmap

Note

The ideas below are the next phases of the project. They stretch far beyond the medium-term importers work tracked in this issue and only live here to paint the big picture.

  • WXR imports
    • Fork https://github.com/humanmade/WordPress-Importer. Give attribution to the original team, ping them and start a conversation
    • Port it to WP_XML_Tag_Processor
    • Start using that fork for importing WXR files in Playground
    • Rewrite the imported site URLs
    • Use AsyncHTTP\Client for fetching assets
    • Make it resumable if it fails halfway through
    • Report progress information to the user
    • Surface errors to the user, ask how to handle them
    • Use in Blueprints
    • Sort the imported entities in topological order
    • Test with tricky inputs
    • Create WP CLI command
    • Create a good looking wp-admin page
    • Publish it as a standalone plugin to start gathering feedback and bug reports
  • Extensibility
  • Markdown workflow for editing existing documentation sites from GitHub
    • Markdown importer
    • Markdown exporter – migrate @dmsnell's Markdown <-> Block markup TypeScript converter from https://github.com/dmsnell/blocky-formats to PHP
    • Discuss using Playground to edit Playground docs, Gutenberg docs, and potentially all WordPress docs
    • Discuss using it as a drop-in static site generator replacement (e.g. Jekyll)
  • Static block markup editor
  • Reliable Playground ZIP export / import
    • Fork the Sandbox Site plugin
    • Improve the SQL export to make it streamable and ensure there are absolutely no issues with escaping
    • Rewrite the exported and imported site URLs
    • Include extension points to enable custom treatment of any block attribute, database row etc. See one of the GitHub discussions referenced in Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools #1888
    • Consider shipping .sql files with the export to potentially enable importing the resulting .zip in a regular MySQL-based server environment
    • ...anything else actually?
  • "Duplicate Playground" feature
    • Iteration 1: Pipe the ZIP export to ZIP import
    • Iteration 2: Mount /wordpress-new in the duplicated Playground instance, run the PHP export/import code to migrate the site from /wordpress there
    • Iteration 3: Keep track of progress, make it resumable regardless of when the process is interrupted. This would enable exporting really big sites
  • Direct WordPress <-> WordPress transfer
    • Conceptually, this is like running Duplicate Playground over the internet
    • Important to keep track of progress and resources versions using a vector clock
    • Export / Import UI with scope (users? posts? etc.), error info (image.jpg couldn't be fetched after 3 retries), and error resolution mechanism (specify a different url? upload that image? retry 4th time?)
  • Live WordPress <-> WordPress data sync
    • Run the WordPress <-> WordPress transfer in a continuous way.
    • This is not about collaborative editing in the block editor, although there is likely an overlap around data synchronization.
  • Importers version 2 and beyond
    • Subtasks outlined in [Data Liberation] Entity Stream Importer
    • Import one post at a time, not "all static assets" and then "all posts". Identify each post's dependency graph and frontload that post's dependent data first.
    • Resume .partial assets download upon import pause and resume.
    • Resource quotas
@adamziel adamziel added [Aspect] Data Liberation [Type] Project [Type] Tracking Tactical breakdown of efforts across the codebase and/or tied to Overview issues. labels Oct 14, 2024
@bgrgicak bgrgicak moved this from Inbox to In progress in Playground Board Oct 15, 2024
@adamziel adamziel moved this from In progress to Project: In Progress in Playground Board Oct 16, 2024
adamziel added a commit that referenced this issue Oct 28, 2024
A part of #1894.
Follows up on
#1893.

This PR brings in a few more PHP APIs that were initially explored
outside of Playground so that they can be incubated in Playground. See
the linked descriptions for more details about each API:

* XML Processor from
WordPress/wordpress-develop#6713
* Stream chain from adamziel/wxr-normalize#1
* A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR
files

## Testing instructions

* Confirm the PHPUnit tests pass in CI
* Confirm the test suite looks reasonabel
* That's it for now! It's all new code that's not actually used anywhere
in Playground yet. I just want to merge it to keep iterating and
improving.
adamziel added a commit that referenced this issue Oct 30, 2024
A part of #1894.

Adds https://github.com/WordPress/blueprints-library as a git submodule
to the data-liberation package to enable easy code reuse between the
projects. I'm not yet sure, but perhaps moving all the PHP libraries to
the blueprints-library would make sense? TBD

No testing instructions. This is just a new submodule. No code changes
are involved.
adamziel added a commit that referenced this issue Oct 30, 2024
A part of #1894.

Adds https://github.com/WordPress/blueprints-library as a git submodule
to the data-liberation package to enable easy code reuse between the
projects. I'm not yet sure, but perhaps moving all the PHP libraries to
the blueprints-library would make sense? TBD

No testing instructions. This is just a new submodule. No code changes
are involved.
adamziel added a commit that referenced this issue Oct 31, 2024
…essor (#1960)

Merge `WP_XML_Tag_Processor` and `WP_XML_Processor` into a single
`WP_XML_Processor` class. This reduces abstractions, enables keeping
more properties as private, and simplifies the code.

Related to #1894
and WordPress/wordpress-develop#6713

 ## Testing instructions

Confirm the CI tests pass.
@github-project-automation github-project-automation bot moved this from Project: In Progress to Done in Playground Board Oct 31, 2024
@brandonpayton
Copy link
Member

@adamziel I think this may have been accidentally closed when #1960 was merged because it was "Related to" this one. There are a good number of tasks left unfinished, and this closing looks automated rather than intentional.

I'll reopen, and you can close again if it was intentional.

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 2, 2024

Let's also review Automattic's VIP WXR importer for going from WXR reading to importing:

https://github.com/search?q=repo%3AAutomattic%2Fvip-go-mu-plugins%20wxr&type=code

adamziel added a commit that referenced this issue Nov 2, 2024
This PR introduces the `WP_WXR_Reader` class for parsing WordPress
eXtended RSS (WXR) files, along with supporting improvements to the XML
processing infrastructure.

**Note: `WP_WXR_Reader` is just a reader. It won't actually import the
data into WordPress** – that part is coming soon.

A part of #1894

## Motivation

There is no WordPress importer that would check all these boxes:

* Supports 100GB+ WXR files without running out of memory
* Can pause and resume along the way 
* Can resume even after a fatal error
* Can run without libxml and mbstring
* Is really fast

`WP_WXR_Reader` is a step in that direction. It cannot pause and resume
yet, but the next few PRs will add that feature.

## Implementation

`WP_WXR_Reader` uses the `WP_XML_Processor` to find XML tags
representing meaningful WordPress entities. The reader knows the WXR
schema and only looks for relevant elements. For example, it knows that
posts are stored in `rss > channel > item` and comments are stored in
`rss > channel > item > `wp:comment`.

The `$wxr->next_entity()` method stream-parses the next entity from the
WXR document and exposes it to the API consumer via
`$wxr->get_entity_type()` and `$wxr->get_entity_date()`. The next call
to `$wxr->next_entity()` remembers where the parsing has stopped and
parses the next entity after that point.

```php
$fp = fopen('my-wxr-file.xml', 'r');

$wxr_reader = WP_WXR_Reader::from_stream();
while(true) {
    if($wxr_reader->next_entity()) {
        switch ( $wxr_reader->get_entity_type() ) {
            case 'post':
                // ... process post ...
                break;

            case 'comment':
                // ... process comment ...
                break;

            case 'site_option':
                // ... process site option ...
                break;

            // ... process other entity types ...
        }
        continue;
    }

    // Next entity not found – we ran out of data to process.
    // Let's feed another chunk of bytes to the reader.

    if(feof($fp)) {
        break;
    }

    $chunk = fread($fp, 8192);
    if(false === $chunk) {
        $wxr_reader->input_finished();
        continue;
    }
    $wxr_reader->append_bytes($chunk);
}
```

Similarly to `WP_XML_Processor`, the `WP_WXR_Reader` enters a paused
state when it doesn't have enough XML bytes to parse the entire entity.

The _next_entity() -> fread -> break_ usage pattern may seem a bit
tedious. This is expected. Even if the WXR parsing part of the
`WP_WXR_Reader` offers a high-level API, working with byte streams
requires reasoning on a much lower level. The `StreamChain` class
shipped in this repository will make the API consumption easier with its
transformation–oriented API for chaining data processors.

### Supported WordPress entities

* posts – sourced from `<item>` tags
* comments – sourced from `<wp:comment>` tags
* comment meta – sourced from `<wp:commentmeta>` tags
* users – sourced from `<wp:author>` tags
* post meta – sourced from `<wp:postmeta>` tags
* terms – sourced from `<wp:term>` tags
* tags – sourced from `<wp:tag>` tags
* categories – sourced from `<wp:category>` tags

## Caveats

### Extensibility

`WP_WXR_Reader` ignores any XML elements it doesn't recognize. The WXR
format is extensible so in the future the reader may start supporting
registration of custom handlers for unknown tags in the future.

### Nested entities intertwined with data

`WP_WXR_Reader` flushes the current entity whenever another entity
starts. The upside is simplicity and a tiny memory footprint. The
downside is that it's possible to craft a WXR document where some
information would be lost. For example:

```xml
<rss>
	<channel>
		<item>
		  <title>Page with comments</title>
		  <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link>
		  <wp:postmeta>
		    <wp:meta_key>_wp_page_template</wp:meta_key>
		    <wp:meta_value><![CDATA[default]]></wp:meta_value>
		  </wp:postmeta>
		  <wp:post_id>146</wp:post_id>
		</item>
	</channel>
</rss>
```
`WP_WXR_Reader` would accumulate post data until the `wp:post_meta` tag.
Then it would emit a `post` entity and accumulate the meta information
until the `</wp:postmeta>` closer. Then it would advance to
`<wp:post_id>` and **ignore it**.

This is not a problem in all the `.wxr` files I saw. Still, it is
important to note this limitation. It is possible there is a `.wxr`
generator somewhere out there that intertwines post fields with post
meta and comments. If this ever comes up, we could:

* Emit the `post` entity first, then all the nested entities, and then
emit a special `post_update` entity.
* Do multiple passes over the WXR file – one for each level of nesting,
e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta

Buffering all the post meta and comments seems like a bad idea – there
might be gigabytes of data.

## Future Plans

The next phase will add pause/resume functionality to handle timeout
scenarios:

- Save parser state after each entity or every `n` entities to speed it
up. Then also save the `n` for a quick rewind after resuming.
- Resume parsing from saved state.

## Testing Instructions

Read the tests and ponder whether they make sense. Confirm the PHPUnit
test suite passed on CI. The test suite includes coverage for various
WXR formats and streaming behaviors.
@adamziel adamziel changed the title Tracking Issue: Next Gen Importers for Data Liberation Tracking Issue: Next-gen PHP Importers for Data Liberation Dec 4, 2024
adamziel added a commit that referenced this issue Dec 11, 2024
…2058)

## Description

Adds the Data Liberation WXR importer as an option in the `importWxr`
step. The new importer is turned by including the `"importer":
"data-liberation"` option:

```json
{
  "steps": [
    {
      "step": "importWxr",
      "file": {
        "resource": "url",
        "url": "https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml"
      },
      "importer": "data-liberation"
    }
  ]
}
```

When the `importer` option is missing or set to "default," nothing
changes in the behavior of the step and it continues using the
https://github.com/humanmade/WordPress-Importer importer.

The new importer:

* Rewrites links in the imported content
* Downloads assets through Playground's CORS proxy
* Parallelizes the downloads
* Communicates progress

This PR is a part of
#1894

## Implementation details

This `importWxr` step fetches and includes the
`data-liberation-core.phar` file. The phar file is built with
[Box](https://box-project.github.io/box/configuration/) and contains the
importer library with its dependencies, which is a subset of the Data
Liberation library, a subset of the Blueprints library, and a few vendor
libraries.

This, unfortunately, means that any changes in the PHP files require
rebuilding the .phar file. Here's how you can do it:

```bash
nx build:phar playground-data-liberation
```

You can also build the entire Data Liberation package as a WordPress
plugin complete with a wp-admin page:

```bash
nx build:plugin playground-data-liberation
```

Both commands will output the built files to
`packages/playground/data-liberation/dist`

The progress updates are a first-class feature of the new importer. The
updated `importer` step receives them in real-time via a
`post_message_to_js()` call running after every import step. Then, it
passes them on to the progress bar UI.

### Other changes

* **TLS traffic now goes through the CORS proxy.** Since the new
importer uses `AsyncHTTP\Client` which deals with raw sockets,
Playground's [TLS-based network
bridge](#1926)
runs the outbound traffic through a cors proxy. Technically,
`TCPOverFetchWebsocket` gets the `corsProxy` URL passed to the
`playground.boot()` call.
* A few composer dependencies were forked, downgraded to PHP 7.2 using
Rector, and bundled with this PR to keep the Data Liberation importer
working.

## Remaining work

- [x] PHP 7.2 compatibility. Done by forking and Rector-downgrading
dependencies that were incompatible with PHP 7.2.
- [x] Report the importer's progress on the overall Blueprint progress
bar
- [x] Enqueue the data liberation plugin files for downloading at the
blueprint compilation stage
- [x] Don't eagerly rewrite attachments URLs in `WP_Stream_Importer`.
Exposing this information to the API consumer requires an explicit
decision. Do we rewrite it? Or do we ignore it?
- [x] Fix the TLS errors at the intersection of Playground network
transport and the async HTTP client library
- [x] Separate the markdown importer and its dependencies (md parser,
frontmatter parser, Symfony libraries) from the core plugin
- [x] Ship the importer and its tree-shaken deps (URL parser) as a
minified zip/phar

## Follow-up work

- [ ] Reconsider the `WP_Import_Session` API – do we need so many
verbosely named methods? Can we achieve the same outcomes with fewer
methods?
- [ ] Investigate why there's a significant delay before media downloads
start on PHP 7.2 – 7.4. It's likely a PHP.wasm issue.

## Testing instructions

* Default importer – [Open this
link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20})
and confirm it does what the current `importWxr` step do, that is it
stays at "Importing content" for a moment, fails to fetch media files
(CORS issues in network tools), but inserts posts and pages.
* Data Liberation – [Open this
link](http://localhost:5400/website-server/#{%20%22plugins%22:%20[],%20%22steps%22:%20[%20{%20%22step%22:%20%22importWxr%22,%20%22importer%22:%20%22data-liberation%22,%20%22file%22:%20{%20%22resource%22:%20%22url%22,%20%22url%22:%20%22https://raw.githubusercontent.com/wpaccessibility/a11y-theme-unit-test/master/a11y-theme-unit-test-data.xml%22%20}%20}%20],%20%22preferredVersions%22:%20{%20%22php%22:%20%228.3%22,%20%22wp%22:%20%226.7%22%20},%20%22features%22:%20{%20%22networking%22:%20true%20},%20%22login%22:%20true%20}),
confirm the import progress is visible and that the content and media
indeed get imported:

![CleanShot 2024-12-08 at 14 54
49@2x](https://github.com/user-attachments/assets/a7da3244-a10f-43d2-8e94-43d305220a7e)

## Related issues

* #1211 
* #2012 
* #1477 
* #1250 
* #1780
adamziel added a commit that referenced this issue Dec 17, 2024
Adds a forked version of the markdown parsing libraries required by the
upcoming Markdown importer. We need out own fork for PHP 7.2
compatibility. The downgrade process was performed semi-automatically
via Rector.

This PR adds the following libraries:

* `league/commonmark`
* `webuni/front-matter`

There are no testing steps here. This PR only adds new code without
modifying the existing one.

A part of:

* #2080
* #1894
adamziel added a commit that referenced this issue Dec 17, 2024
Moves the Markdown importer to a `data-liberation-markdown` package so
that it can be shipped as a separate `.phar` file and downloaded only
when needed.

 ## Testing instructions

This only moves code around. To test, confirm the CI PHP unit tests keep
working.

A part of:

* #2080
* #1894
adamziel added a commit that referenced this issue Dec 17, 2024
Builds data-liberation-markdown.phar.gz (200KB) to enable downloading the
Markdown importer only when needed instead of on every page load.

A part of:

* #2080
* #1894

 ## Testing instructions

Run `nx build playground-data-liberation-markdown`, confirm it finished
without errors. A smoke test of the built phar file is included in the
build command.
adamziel added a commit that referenced this issue Dec 17, 2024
Adds a basic WP_HTML_To_Blocks class that accepts HTML and outputs block markup.

It only considers the markup and won't consider any visual changes introduced via CSS or JavaScript.

A part of #1894

 ## Example

```html
$html = <<<HTML
<meta name="post_title" content="My first post">
<p>Hello <b>world</b>!</p>
HTML;

$converter = new WP_HTML_To_Blocks( $html );
$converter->convert();

var_dump( $converter->get_all_metadata() );
/*
 * array( 'post_title' => array( 'My first post' ) )
 */

var_dump( $converter->get_block_markup() );
/*
 * <!-- wp:paragraph -->
 * <p>Hello <b>world</b>!</p>
 * <!-- /wp:paragraph -->
 */
```

 ## Testing instructions

This PR mostly adds new code. Just confirm the unit tests pass in CI.
adamziel added a commit that referenced this issue Dec 17, 2024
Builds data-liberation-markdown.phar.gz (200KB) to enable downloading
the
Markdown importer only when needed instead of on every page load.

A part of:

* #2080
* #1894

 ## Testing instructions

Run `nx build playground-data-liberation-markdown`, confirm it finished
without errors. A smoke test of the built phar file is included in the
build command.
adamziel added a commit that referenced this issue Dec 19, 2024
Adds a basic `WP_HTML_To_Blocks` class that accepts HTML and outputs
block markup.

It's a very basic converter. It only considers the markup and won't
consider any visual changes introduced via CSS or JavaScript. Only a few
core blocks are supported in this initial PR. The API can easily support
more HTML elements and blocks.

To preserve visual fidelity between the original HTML page and the
produced block markup, we'll need an annotated HTML input produced by
the [Try WordPress](https://github.com/WordPress/try-wordpress/) browser
extension. It would contain each element's colors, sizes, etc. We cannot
possibly get all from just analyzing the HTML on the server without
building a full-blown, browser-like HTML renderer in PHP, and I know I'm
not building one.

A part of #1894

 ## Example

```php
$html = <<<HTML
<meta name="post_title" content="My first post">
<p>Hello <b>world</b>!</p>
HTML;

$converter = new WP_HTML_To_Blocks( $html );
$converter->convert();

var_dump( $converter->get_all_metadata() );
/*
 * array( 'post_title' => array( 'My first post' ) )
 */

var_dump( $converter->get_block_markup() );
/*
 * <!-- wp:paragraph -->
 * <p>Hello <b>world</b>!</p>
 * <!-- /wp:paragraph -->
 */
```

 ## Caveats

I had to patch WP_HTML_Processor to stop baling out on `<meta>` tags
referencing the document charset. Ideally we'd patch WordPress core to
stop baling out when the charset is UTF-8.

 ## Testing instructions

This PR mostly adds new code. Just confirm the unit tests pass in CI.

cc @brandonpayton @zaerl @sirreal @dmsnell @ellatrix
brandonpayton pushed a commit that referenced this issue Dec 31, 2024
Adds a basic `WP_HTML_To_Blocks` class that accepts HTML and outputs
block markup.

It's a very basic converter. It only considers the markup and won't
consider any visual changes introduced via CSS or JavaScript. Only a few
core blocks are supported in this initial PR. The API can easily support
more HTML elements and blocks.

To preserve visual fidelity between the original HTML page and the
produced block markup, we'll need an annotated HTML input produced by
the [Try WordPress](https://github.com/WordPress/try-wordpress/) browser
extension. It would contain each element's colors, sizes, etc. We cannot
possibly get all from just analyzing the HTML on the server without
building a full-blown, browser-like HTML renderer in PHP, and I know I'm
not building one.

A part of #1894

 ## Example

```php
$html = <<<HTML
<meta name="post_title" content="My first post">
<p>Hello <b>world</b>!</p>
HTML;

$converter = new WP_HTML_To_Blocks( $html );
$converter->convert();

var_dump( $converter->get_all_metadata() );
/*
 * array( 'post_title' => array( 'My first post' ) )
 */

var_dump( $converter->get_block_markup() );
/*
 * <!-- wp:paragraph -->
 * <p>Hello <b>world</b>!</p>
 * <!-- /wp:paragraph -->
 */
```

 ## Caveats

I had to patch WP_HTML_Processor to stop baling out on `<meta>` tags
referencing the document charset. Ideally we'd patch WordPress core to
stop baling out when the charset is UTF-8.

 ## Testing instructions

This PR mostly adds new code. Just confirm the unit tests pass in CI.

cc @brandonpayton @zaerl @sirreal @dmsnell @ellatrix
adamziel added a commit that referenced this issue Jan 9, 2025
Builds data-liberation-markdown.phar.gz (200KB) to enable downloading
the
Markdown importer only when needed instead of on every page load.

A part of:

* #2080
* #1894

 ## Testing instructions

Run `nx build playground-data-liberation-markdown`, confirm it finished
without errors. A smoke test of the built phar file is included in the
build command.
adamziel added a commit that referenced this issue Jan 9, 2025
…ocessor

Adds a is_self_closing_block() method to WP_Block_Markup_Processor to
enable detection and rewriting of block comments such as

```html
<!-- wp:core/separator /-->
```

This will be needed in the markdown processor.

A part of #1894.

 ## Testing instructions

CI. See the unit tests updated in this PR.
adamziel added a commit that referenced this issue Jan 9, 2025
…ocessor (#2120)

Adds a is_self_closing_block() method to WP_Block_Markup_Processor to
enable detection and rewriting of block comments such as

```html
<!-- wp:core/separator /-->
```

This will be needed in the markdown processor.

A part of #1894.

 ## Testing instructions

CI. See the unit tests updated in this PR.
adamziel added a commit that referenced this issue Jan 10, 2025
A part of #1894

Introduces a standardized API for converting between static data formats
and blocks+metadata.

* The `data format -> blocks+metadata` operation is represented by the
WP_Data_Format_Consumer interface
* The `blocks+metadata -> data format` operation is represented by the
WP_Data_Format_Producer interface

This PR also ships a few initial consumers and producers:

* `WP_Annotated_Block_Markup_Consumer` – for consuming static block
markup with `<meta>` tags.
* `WP_Markup_Processor_Consumer` – for consuming an HTML/XHTML markup
processor instance. It handles just the regular HTML/XHTML markup, not
block markup.
* `WP_Annotated_Block_Markup_Producer` – for serializing block markup +
metadata array as block markup with `<meta>` tags

## Example

The two-way conversion pipeline shipped in this PR goes between this:

```php
$block_markup = <<<BLOCKS
<!-- wp:paragraph -->
<p>Hello <b>world</b>!</p>
<!-- /wp:paragraph -->
BLOCKS;

$metadata =  array(
     'post_title' => array( 'My first post' ),
);
```

And this:

```html
<meta name="post_title" content="My first post">
<!-- wp:paragraph -->
<p>Hello <b>world</b>!</p>
<!-- /wp:paragraph -->
```

## Other changes

This PR also ships the block parser from WordPress core to enable
running unit tests – we need to call `parse_blocks()` now.

 ## Testing

The code isn't used anywhere yet – just rely on the CI.
adamziel added a commit that referenced this issue Jan 10, 2025
A part of #1894

Adds a new API for loading content from a WP_Filesystem instance:

* `WP_Filesystem_To_Post_Tree` for traversing a directory tree and
  mapping the structure a hierarchical WordPress post/meta entity stream
* `WP_Filesystem_Entity_Reader` for converting the raw file content into
  WordPress blocks

To convert a set of zipped files into WordPress entities:

```php
// Any Filesystem instance works here. Could be WP_Local_Filesystem,
// WP_Git_Filesystem, or anything else. Let's read from a zip file here:
$fs = new WP_Zip_Filesystem(
    WP_File_Reader::create('./docs.zip')
);

$reader = new WP_Filesystem_Entity_Reader( $fs );

foreach($reader as $entity) {
    var_dump($entity);
}
```

 ## Testing

The code isn't used anywhere yet – just rely on the CI checks.
adamziel added a commit that referenced this issue Jan 10, 2025
A part of #1894

Adds a new API for loading content from a WP_Filesystem instance:

* `WP_Filesystem_To_Post_Tree` for traversing a directory tree and
mapping the structure a hierarchical WordPress post/meta entity stream
* `WP_Filesystem_Entity_Reader` for converting the raw file content into
WordPress blocks

To convert a set of zipped files into WordPress entities:

```php
// Any Filesystem instance works here. Could be WP_Local_Filesystem,
// WP_Git_Filesystem, or anything else. Let's read from a zip file here:
$fs = new WP_Zip_Filesystem(
    WP_File_Reader::create('./docs.zip')
);

$reader = new WP_Filesystem_Entity_Reader( $fs );

foreach($reader as $entity) {
    var_dump($entity);
}
```

 ## Testing

The code isn't used anywhere yet – just rely on the CI checks.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
[Aspect] Data Liberation [Type] Project [Type] Tracking Tactical breakdown of efforts across the codebase and/or tied to Overview issues.
Projects
Status: Inbox
2 participants