Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: Extra content at the end of the document #161

Merged
merged 2 commits into from
Jul 7, 2024

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Jul 4, 2024

Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

[27]    Misc       ::=          Comment | PI | S

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

[16]    PI         ::=          '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

[17]    PITarget           ::=          Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

## Why?

XML with multiple root elements is invalid.

See: ruby#160 (comment)
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 641d9d1 to 4e9de51 Compare July 5, 2024 04:19
@naitoh naitoh requested a review from kou July 5, 2024 04:30
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 6, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
return [ :start_element, tag, attributes ]
end
else
text = @source.read_until("<")
if text.chomp!("<")
@source.position -= "<".bytesize
end
if @tags.empty? and @have_root
if text.strip != ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip allocates a new string. Can we avoid it?
For example: /\A\s*\z/.match?(text)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
I see.

## Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

```
[27]   	Misc	   ::=   	Comment | PI | S
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

```
[16]   	PI	   ::=   	'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

```
[17]   	PITarget	   ::=   	Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
```
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 4e9de51 to c094825 Compare July 7, 2024 00:40
@naitoh naitoh requested a review from kou July 7, 2024 00:45
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
@kou kou merged commit eb45c8d into ruby:master Jul 7, 2024
61 checks passed
@kou
Copy link
Member

kou commented Jul 7, 2024

Thanks.

kou pushed a commit that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: #161 (comment)
@naitoh naitoh deleted the fix_multiple_root_elements branch July 7, 2024 21:02
@DmitryPogrebnoy
Copy link
Contributor

DmitryPogrebnoy commented Oct 15, 2024

@naitoh @kou After this change, parsing of '<threads><thread id="1"/></threads><threads><thread id="1"/></threads>' using REXML::Parsers::PullParser.new(socket).pull throw an illegal seek exception. Should I create a separate issue for this problem?

The main cause is this if statement at lib/rexml/parsers/baseparser.rb:498

if @tags.empty? and @have_root
  raise ParseException.new("Malformed XML: Extra tag at the end of the document (got '<#{tag}')", @source)
end

@kou
Copy link
Member

kou commented Oct 16, 2024

<threads><thread id="1"/></threads><threads><thread id="1"/></threads> is an invalid XML.
Are you suggesting that we should support an invalid XML?

@DmitryPogrebnoy
Copy link
Contributor

DmitryPogrebnoy commented Oct 16, 2024

@kou Well, I use socket to get xml messages from the server and parse them using PullParser. Each message is complete and valid. Before change it worked like a charm. Now it doesn't work anymore.
If the current way with REXML::Parsers::PullParser.new(socket).pull does not work, what is the right approach for such a use case?

@kou
Copy link
Member

kou commented Oct 16, 2024

OK. Could you open a new issue for it?

@DmitryPogrebnoy
Copy link
Contributor

@kou Yep, here it is - #214

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants