Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[HTML, XML] Don't interpret <? and <% inside CDATA section #252

Closed
zufuliu opened this issue May 30, 2024 · 14 comments
Closed

[HTML, XML] Don't interpret <? and <% inside CDATA section #252

zufuliu opened this issue May 30, 2024 · 14 comments
Labels
asp Caused by ASP or ASP.NET for the html lexer committed Issue fixed in repository but not in release html Caused by the hypertext lexer php Caused by the hypertext lexer for PHP xml Caused by the xml lexer

Comments

@zufuliu
Copy link
Contributor

zufuliu commented May 30, 2024

Created for https://sourceforge.net/p/scintilla/bugs/1078/.

Per https://html.spec.whatwg.org/multipage/syntax.html#cdata-sections and https://www.w3.org/TR/xml11/#sec-cdata-sect, CDATA section is a literal block and only the closing ]]> is recognized. I think here are two fixes:

  1. Only interpret <? inside CDATA section for SCLEX_PHPSCRIPT.
  2. Don't interpret <? inside XML CDATA section (original bug), let SCLEX_HTML as is.
zufuliu added a commit to zufuliu/notepad4 that referenced this issue May 30, 2024
@mpheath
Copy link

mpheath commented May 30, 2024

There is also <% in XML CDATA causing issues.

diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx
index 4a5e092d..8a97d90e 100644
--- a/lexers/LexHTML.cxx
+++ b/lexers/LexHTML.cxx
@@ -1390,6 +1390,10 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 		else if (isMako && makoComment) {
 		}
 
+		// Ignore everything in XML CDATA until ]]>
+		else if (isXml && state == SCE_H_CDATA) {
+		}
+
 		// generic end of script processing
 		else if ((inScriptType == eNonHtmlScript) && (ch == '<') && (chNext == '/')) {
 			// Check if it's the end of the script tag (or any other HTML tag)

Check for XML CDATA before processing the other conditions seems to solve both <? and <% issues. No PHP stying in XML CDATA. If OK, then:

Issue252ab.zip

@nyamatongwe nyamatongwe added the html Caused by the hypertext lexer label May 30, 2024
@zufuliu
Copy link
Contributor Author

zufuliu commented May 31, 2024

Above change only fix the bug for XML (SCLEX_XML), not client HTML (SCLEX_HTML and SCLEX_PHPSCRIPT). I think we can add two properties to fix the bug for HTML:

  1. lexer.html.php_tag (default on, only used for SCLEX_HTML), when enabled will interpret <?.
  2. lexer.html.asp_tag (default on, not used for SCLEX_XML), when enabled will interpret <%. ASP tag was removed from PHP 7, see https://wiki.php.net/rfc/remove_alternative_php_tags and https://wiki.php.net/rfc/deprecate_php_short_tags

@zufuliu zufuliu changed the title [HTML, XML] Don't interpret <? inside CDATA section [HTML, XML] Don't interpret <? and <% inside CDATA section May 31, 2024
@nyamatongwe
Copy link
Member

@zufuliu Only interpret <? inside CDATA section for SCLEX_PHPSCRIPT

A quick search for PHP examples shows PHP embedded in web pages, not stand-alone PHP (SCLEX_PHPSCRIPT) which is the unusual case. For web pages, it is a text preprocessor over HTML so takes precedence for SCLEX_HTML. It is less commonly used for XML but may be used in the same way to produce data so is active for SCLEX_XML.

@zufuliu Don't interpret <? inside XML CDATA section (original bug), let SCLEX_HTML as is.

PHP generated XML is also useful.

There could be options to turn off pre-processors.

@zufuliu
Copy link
Contributor Author

zufuliu commented May 31, 2024

There could be options to turn off pre-processors.

I still think these options should default off for XML, as using PHP (or server-side technology) to preprocess XML is really rare (MIME configurations), otherwise most if not all use of SCLEX_XML will need to turn them off.

@nyamatongwe
Copy link
Member

I still think these options should default off for XML

It is not reasonable to break current applications.

@mpheath
Copy link

mpheath commented Jun 1, 2024

Please try again.

Issue252abc.zip

Both options default to 1 so is not a breaking change. If set to 0 then CDATA content is a single style until ]]> is reached.

diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx
index 4a5e092d..2089bfe7 100644
--- a/lexers/LexHTML.cxx
+++ b/lexers/LexHTML.cxx
@@ -724,6 +724,8 @@ struct OptionsHTML {
 	bool allowScripts = true;
 	bool isMako = false;
 	bool isDjango = false;
+	bool styleHTMLCDATA = true;
+	bool styleXMLCDATA = true;
 	bool fold = false;
 	bool foldHTML = false;
 	bool foldHTMLPreprocessor = true;
@@ -767,6 +769,12 @@ struct OptionSetHTML : public OptionSet<OptionsHTML> {
 		DefineProperty("lexer.xml.allow.scripts", &OptionsHTML::allowScripts,
 			"Set to 0 to disable scripts in XML.");
 
+		DefineProperty("lexer.html.cdata.tag", &OptionsHTML::styleHTMLCDATA,
+			"Set to 0 to disable styling within CDATA tags.");
+
+		DefineProperty("lexer.xml.cdata.tag", &OptionsHTML::styleXMLCDATA,
+			"Set to 0 to disable styling within CDATA tags.");
+
 		DefineProperty("lexer.html.mako", &OptionsHTML::isMako,
 			"Set to 1 to enable the mako template language.");
 
@@ -1221,6 +1229,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 	const bool allowScripts = options.allowScripts;
 	const bool isMako = options.isMako;
 	const bool isDjango = options.isDjango;
+	const bool styleCDATA = (isXml ? options.styleXMLCDATA : options.styleHTMLCDATA);
 	const CharacterSet setHTMLWord(CharacterSet::setAlphaNum, ".-_:!#", true);
 	const CharacterSet setTagContinue(CharacterSet::setAlphaNum, ".-_:!#[", true);
 	const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
@@ -1390,6 +1399,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
 		else if (isMako && makoComment) {
 		}
 
+		else if (state == SCE_H_CDATA && !styleCDATA) {
+		}
+
 		// generic end of script processing
 		else if ((inScriptType == eNonHtmlScript) && (ch == '<') && (chNext == '/')) {
 			// Check if it's the end of the script tag (or any other HTML tag)

Here is a lua on double click function that can be added to SciTEStartup.lua to help see the difference when double clicking on the edit pane.

function OnDoubleClick()
	if editor:GetPropertyInt('lexer.html.cdata.tag', 1) == 1 then
		editor.Property['lexer.html.cdata.tag'] = '0'
		editor.Property['lexer.xml.cdata.tag'] = '0'
	else
		editor.Property['lexer.html.cdata.tag'] = '1'
		editor.Property['lexer.xml.cdata.tag'] = '1'
	end
end

Note that the div tags in the html file display as red. If div is later added to the keyword list in SciTE.properties then that may cause TestLexers to create new styled files.

@zufuliu
Copy link
Contributor Author

zufuliu commented Jun 3, 2024

  1. lexer.html.php_tag (default on, only used for SCLEX_HTML), when enabled will interpret <?.

This is complex than I thought, <? in HTML is treated as comment by browser, see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-unexpected-question-mark-instead-of-tag-name

There could be options to turn off pre-processors.

Above patch only fixed CDATA (original bug), I think we can add options to turn off pre-processors globally:

  1. For XML, <? can be PHP start tag, XML instruction or error. <% is ASP (or template) start tag or error.
  2. For HTML, <? can be PHP start tag or comment start. <% is ASP (or template) start tag or element text.

@nyamatongwe
Copy link
Member

lexer.html.cdata.tag / lexer.xml.cdata.tag is confusing to me: its treating the file as ASP/PHP apart from a small exception of CDATA. ASP and PHP may have some understanding of the HTML/XML elements and so may special-case CDATA but that should be checked in the ASP and PHP documentation or with examples.

However, the motivating example in notepad-plus-plus/notepad-plus-plus#14576 does not appear to be intended for ASP or PHP processing and would be better treated as basic unprocessed XML.

Perhaps the desire here is to avoid splitting file types into with/without preprocessing so the user doesn't have to deal with this choice.

@zufuliu
Copy link
Contributor Author

zufuliu commented Jun 4, 2024

PHP does not understand HTML/XML elements:
https://www.php.net/manual/en/language.basic-syntax.phptags.php

When PHP parses a file, it looks for opening and closing tags, which are <?php and ?> which tell PHP to start and stop interpreting the code between them. Parsing in this manner allows PHP to be embedded in all sorts of different documents, as everything outside of a pair of opening and closing tags is ignored by the PHP parser.

ASP may (not test) understand HTML elements as it need to parse asp: prefixed tag and runas attribute on HTML element.

@nyamatongwe
Copy link
Member

PHP control could have 3 states since <?php is much less likely to mean something different than <? although it does in this case.

  1. disabled
  2. <?php enabled
  3. <?php and <? enabled

For ASP, the character just after <% could be checked as that decides between different ASP modes. From https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnet/development/inline-expressions it is unclear whether whitespace is required for embedded code blocks. There are other languages like JSP (and derivatives) that also use <% and currently work because of that similarity so could affect whether this is reasonable.

@nyamatongwe nyamatongwe added asp Caused by ASP or ASP.NET for the html lexer php Caused by the hypertext lexer for PHP labels Jun 4, 2024
zufuliu added a commit to zufuliu/notepad4 that referenced this issue Jun 10, 2024
@nyamatongwe
Copy link
Member

Here is a patch that implements lexer.xml.allow.php and lexer.html.allow.php with three states:

Set to 0 to disable PHP in HTML, 1 to accept <?php, 2 to also accept <?. The default is 2.

allowPHP.patch.txt

@zufuliu
Copy link
Contributor Author

zufuliu commented Jun 14, 2024

if (allowPHP == AllowPHP::PHP) should also allow short echo tag: <?= ?>.
https://www.php.net/manual/en/language.basic-syntax.phptags.php

As short tags can be disabled it is recommended to only use the normal tags (<?php ?> and <?= ?>) to maximise compatibility.

nyamatongwe added a commit that referenced this issue Jun 15, 2024
@nyamatongwe nyamatongwe added the xml Caused by the xml lexer label Jun 16, 2024
@nyamatongwe
Copy link
Member

@nyamatongwe For ASP, the character just after <% could be checked ... it is unclear whether whitespace is required for embedded code blocks.

Whitespace is not required and ASP examples with '<%` followed immediately by a call is common.

@zufuliu
Copy link
Contributor Author

zufuliu commented Jun 16, 2024

Needs update segIsScriptingIndicator() for following code:

lexilla/lexers/LexHTML.cxx

Lines 104 to 105 in a6f1998

if (Contains(s, "php"))
return eScriptPHP;

PHP script tag was removed in PHP 7, https://wiki.php.net/rfc/remove_alternative_php_tags

This RFC proposes the removal of ASP tags (<%) and script tags (<script language=php>) as a means of entering or leaving PHP mode.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
asp Caused by ASP or ASP.NET for the html lexer committed Issue fixed in repository but not in release html Caused by the hypertext lexer php Caused by the hypertext lexer for PHP xml Caused by the xml lexer
Projects
None yet
Development

No branches or pull requests

3 participants