Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How to use "&" and other HTML entities as text content? #44

Closed
spun opened this issue Oct 12, 2020 · 7 comments
Closed

How to use "&" and other HTML entities as text content? #44

spun opened this issue Oct 12, 2020 · 7 comments
Labels
bug Something isn't working indev The issue is fixed/implemented in the dev branch

Comments

@spun
Copy link

spun commented Oct 12, 2020

xmlutil version: v0.80.0
platform: Android

Hi, I was decoding some XML from a third party and I noticed that decodeFromString was failing when the XML response had "&" or other HTML entities (like "'" for single quote) as text content inside an element/tag.

After finding that, I tried to create a simple example and I would like some help to understand how to deal with "&".

Example:

If I use encodeToString to encode the following content:

@Serializable
data class Results(
    val itemList: List<Item>
)

@Serializable
data class Item(
    @XmlValue(true)
    val text: String
)

[...]

val content =
    Results(
        listOf(
            Item("Item ' 1"),
            Item("Item & 2")
        )
    )

I get this XML as a result:

<Results>
  <Item>Item ' 1</Item>
  <Item>Item &amp; 2</Item>
</Results>

and now, if I use that same result output with decodeFromString I get an error

nl.adaptivity.xmlutil.XmlException: Found unexpected child tag: ENTITY_REF

I also noticed that, if I remove @XmlValue(true) and use "&" inside an attribute

<Results>
  <Item text="Item ' 1" />
  <Item text="Item &amp; 2" />
</Results>

the decoding goes perfectly.

What's the correct way of decoding XML that has "&amp;" as text inside an element?

@pdvrieze
Copy link
Owner

This is a bug. I'll add a test to fix this. It should "just work".

pdvrieze added a commit that referenced this issue Oct 16, 2020
… The

Android parser is relatively stupid and exposes these. This fixes
bug #44
@pdvrieze pdvrieze added bug Something isn't working indev The issue is fixed/implemented in the dev branch labels Oct 16, 2020
@pdvrieze
Copy link
Owner

Fixed it in dev. I'll release something tomorrow.

@pdvrieze
Copy link
Owner

Now fixed in 0.80.1

@DavidJRobertson
Copy link

DavidJRobertson commented Dec 17, 2020

I am seeing this issue with 0.80.1. Here is a snippet of the XML I am trying to use:

<StringWithMarkup>
    <String>Chloroacetic acid, &gt;=99%</String>
</StringWithMarkup>

(It throws at the &gt; part)

Screenshot 2020-12-17 at 01 53 02

The corresponding class is:

@XmlSerialName("StringWithMarkup", "http://pubchem.ncbi.nlm.nih.gov/pug_view", "")
@Serializable
data class StringWithMarkup(
    @XmlElement(true) @SerialName("String") val string: String = "",
    val markup: List<Markup> = emptyList()
)

And the format in use:

private val xmlFormat = XML {
      repairNamespaces = true
      policy = DefaultXmlSerializationPolicy(pedantic = false, autoPolymorphic = false) { input, inputKind, name, candidates ->
          if (name?.namespaceURI == "http://pubchem.ncbi.nlm.nih.gov/pug_view") {
              throw UnknownXmlFieldException(input.locationInfo, name.toString(), candidates)
          }
          return@DefaultXmlSerializationPolicy
      }
  }

@micwallace
Copy link

Just a note for anyone who still has the same issue. I was able to solve it by adding a normalize function and calling it on the string before calling xml.decodeFromString:

fun String.normalize() = replace(" />", "/>")
    .replace("\r\n", "\n")
    .replace("&gt;", ">")

@micwallace
Copy link

Actually this is still not working for other xml entities on Android, such as < This behavior is very strange. These entities shouldn't be decoded until the text is required. < can't be replaced beforehand because it would make the XML invalid.

@pdvrieze
Copy link
Owner

@micwallace I'm not sure what you mean about this. Do you mean that it stores the text in memory/Kotlin without entities (unescaped) and only encodes on serialization. That is something that cannot really be changed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working indev The issue is fixed/implemented in the dev branch
Projects
None yet
Development

No branches or pull requests

4 participants