Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Arrays type and rels should not contain duplicate items. #30

Closed
Zegnat opened this issue Mar 20, 2018 · 4 comments
Closed

Arrays type and rels should not contain duplicate items. #30

Zegnat opened this issue Mar 20, 2018 · 4 comments

Comments

@Zegnat
Copy link
Member

Zegnat commented Mar 20, 2018

The div element in the following example really only specifies 2 classes on itself. Even if the class attribute contains three terms. And the a element creates 2 different relations between the source document and the URL in href, even with three terms in the rel attribute.

<div class="h-entry h-cite h-entry">
  <a href="#" rel="me bookmark me"></a>
</div>

If we compare the development version of the Python parser, with the Go parser, the issue becomes clear. The Python parser only shows unique values for ["items"][0].type and ["rel-urls"]["#"].rels, while the Go parser will show duplicate h-entry and me values there.

The class and rel attributes in HTML are the only ones microformats parsing depends on that are sets in the source HTML where duplicate terms have no effect. These are mapped to arrays in type and rels respectively.

The proposed solution is to:

  1. define that only unique items should be added to type and rels.

This is actually already the case for rels:

set the value of that "rels" key to an array of all unique items in the set of rel values unioned with the current array value of the "rels" key

Parser output

Python

{
  "items": [{
      "type": [ "h-cite", "h-entry" ],
      "properties": {
        "url": [ "#" ],
        "name": [ "" ]
      }
  }],
  "rels": {
    "bookmark": [ "#" ],
    "me": [ "#" ]
  },
  "rel-urls": {
    "#": {
      "rels": [ "bookmark", "me" ],
      "text": ""
    }
  }
}

Go

{
  "items": [{
    "type": [ "h-entry", "h-cite", "h-entry" ],
    "properties": {
      "url": [ "#" ]
    }
  }],
  "rels": {
    "bookmark": [ "#" ],
    "me": [ "#", "#" ]
  },
  "rel-urls": {
    "#": {
      "rels": [ "me", "bookmark", "me" ]
    }
  }
}
@kartikprabhu
Copy link
Member

also from http://pin13.net/mf2-dev/

PHP

{
    "items": [
        {
            "type": [
                "h-cite",
                "h-entry",
                "h-entry"
            ],
            "properties": {
                "name": [
                    ""
                ],
                "url": [
                    "#"
                ]
            }
        }
    ],
    "rels": {
        "me": [
            "#",
            "#"
        ],
        "bookmark": [
            "#"
        ]
    },
    "rel-urls": {
        "#": {
            "rels": [
                "me",
                "bookmark",
                "me"
            ]
        }
    }
}

@tantek
Copy link
Member

tantek commented Mar 20, 2018

Since HTML 'class' and 'rel' attributes are defined as unordered sets, we must preserve that semantic across any parsing transformations, in order to avoid introducing meaningless noise like artificial ordering which could accidentally cause a consuming application to erroneously infer and depend on such.

The parsing spec must require uniqueness in the 'type' array accordingly, and since JSON arrays do have a defined ordering (whether you want it or not), the best we can do is to define a canonical ordering that does not imply anything about the unordered source, such as alphabetical ordering of unique h-* classnames.

For 'rel' attributes, the spec http://microformats.org/wiki/microformats2-parsing#parse_a_hyperlink_element_for_rel_microformats already says the right things for treating them as sets, and notably does preserve source order in the URL sub-arrays for each rel key, which is intentional.

(Originally published at: http://tantek.com/2018/079/t2/)

@kartikprabhu
Copy link
Member

+1
cc: @tantek

@Zegnat
Copy link
Member Author

Zegnat commented Mar 21, 2018

  • I initially misread the logic for the rels parser, and indeed, that is already deduped! (Through unions of sets.)
  • Spec has been updated so type only contains unique names.

@Zegnat Zegnat closed this as completed Mar 21, 2018
Zegnat added a commit to Zegnat/php-mf2 that referenced this issue Mar 22, 2018
* Parse the rel attribute in accordance with the WHATWG spec:
  https://infra.spec.whatwg.org/#split-on-ascii-whitespace
* Only list unique rel values in the rel-urls output, fixes microformats#159:
  microformats/microformats2-parsing#30
* Sort the unique rel values alphabetically:
  microformats/microformats2-parsing#29
* Correctly merge attribute values into the resulting object.
Zegnat added a commit to Zegnat/php-mf2 that referenced this issue Mar 24, 2018
* Parse the rel attribute in accordance with the WHATWG spec:
  https://infra.spec.whatwg.org/#split-on-ascii-whitespace
* Only list unique rel values in the rel-urls output, fixes microformats#159:
  microformats/microformats2-parsing#30
* Sort the unique rel values alphabetically:
  microformats/microformats2-parsing#29
* Correctly merge attribute values into the resulting object.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants