Page MenuHomePhabricator

Parsoid generates invalid HTML5 when a list is nested inside a <small> tag
Closed, DeclinedPublic

Description

While importing old LQT threads to Flow in preparation for disabling LQT on mediawiki.org I have run into some old wikitext that generates invalid HTML5.

The simplest version of this error is:

<small>
* i broke it
</small>

This results in the following HTML:

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"><head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:articleNamespace" content="0"/><meta property="mw:parsoidVersion" content="0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/"/><title></title><base href="//en.wikipedia.org/wiki/"/><link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.elements|mediawiki.skinning.content|mediawiki.skinning.interface|skins.vector.styles|site|mediawiki.skinning.content.parsoid&amp;only=styles&amp;skin=vector"/></head><body data-parsoid='{"dsr":[0,30,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki" dir="ltr"><small data-parsoid='{"stx":"html","dsr":[0,30,7,8]}'>
<ul data-parsoid='{"dsr":[8,21,0,0]}'><li data-parsoid='{"dsr":[8,21,1,0]}'> i broke it!</li></ul>
</small></body></html>

Which gets the following error from w3c validator:

Element ul not allowed as child of element small in this context. (Suppressing further errors from this subtree.)

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.

This is related to T70395 . It is enforced by the parsoid tests at https://github.com/wikimedia/parsoid/blob/master/tests/parserTests.txt#L8350-L8410

This varies from the php parser, which puts a small tag inside each <li>

We actually explicit spec out <small>\n*a\n</small> in our parser tests to generate that a list nested inside a formatting tag to emulate what the php parser generates. See https://github.com/wikimedia/parsoid/blob/f997f0bd21abbe6ccf1d5656407f74a3efa2ec70/tests/parserTests.txt#L8350-L8410

So, we'll have to figure out what the nesting rules are, find out whether the tree building algorithm allows, and if not, find out whether we are circumventing the tree builder, or if the tree builder lib (domino) is not compliant here ...

Looks like the PHP parser is generating small tags wrapping list element content which is what is expected.

ssastry set Security to None.
[subbu@earth lib] echo "<small>\n*a\n</small>" | node parse --trace html --dump dom:post-builder
[HTML]         | {"type":"TagTk","name":"body","attribs":[],"dataAttribs":{}}
0-[HTML]       | {"type":"TagTk","name":"body","attribs":[],"dataAttribs":{}}
0-[HTML]       | {"type":"TagTk","name":"small","attribs":[],"dataAttribs":{"tsr":[0,7],"stx":"html","tagId":1}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[7,8]}}
0-[HTML]       | {"type":"TagTk","name":"ul","attribs":[],"dataAttribs":{"tsr":[8,8],"tagId":2}}
0-[HTML]       | {"type":"TagTk","name":"li","attribs":[],"dataAttribs":{"tsr":[8,9],"tagId":3}}
0-[HTML]       | "a"
0-[HTML]       | {"type":"EndTagTk","name":"li","attribs":[],"dataAttribs":{}}
0-[HTML]       | {"type":"EndTagTk","name":"ul","attribs":[],"dataAttribs":{}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[10,11]}}
0-[HTML]       | {"type":"EndTagTk","name":"small","attribs":[],"dataAttribs":{"tsr":[11,19],"stx":"html"}}
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[19,20]}}
0-[HTML]       | {"type":"EOFTk"}
---- DOM: after tree builder ----
<body data-parsoid="{}"><!--{"@type":"mw:shadow","attrs":[{"nodeName":"typeof","nodeValue":"mw:StartTag"},{"nodeName":"data-stag","nodeValue":"body:undefined"},{"nodeName":"data-parsoid","nodeValue":"{}"}]}--><small data-parsoid="{&quot;tsr&quot;:[0,7],&quot;stx&quot;:&quot;html&quot;,&quot;tagId&quot;:1}"><!--{"@type":"mw:shadow","attrs":[{"nodeName":"typeof","nodeValue":"mw:StartTag"},{"nodeName":"data-stag","nodeValue":"small:1"},{"nodeName":"data-parsoid","nodeValue":"{\"tsr\":[0,7],\"stx\":\"html\",\"tagId\":1}"}]}-->
<ul data-parsoid="{&quot;tsr&quot;:[8,8],&quot;tagId&quot;:2}"><!--{"@type":"mw:shadow","attrs":[{"nodeName":"typeof","nodeValue":"mw:StartTag"},{"nodeName":"data-stag","nodeValue":"ul:2"},{"nodeName":"data-parsoid","nodeValue":"{\"tsr\":[8,8],\"tagId\":2}"}]}--><li data-parsoid="{&quot;tsr&quot;:[8,9],&quot;tagId&quot;:3}"><!--{"@type":"mw:shadow","attrs":[{"nodeName":"typeof","nodeValue":"mw:StartTag"},{"nodeName":"data-stag","nodeValue":"li:3"},{"nodeName":"data-parsoid","nodeValue":"{\"tsr\":[8,9],\"tagId\":3}"}]}-->a</li><!--{"@type":"mw:shadow","attrs":[{"nodeName":"data-parsoid","nodeValue":"{}"},{"nodeName":"typeof","nodeValue":"mw:EndTag"},{"nodeName":"data-etag","nodeValue":"li"},{"nodeName":"data-parsoid","nodeValue":"{}"}]}--></ul><!--{"@type":"mw:shadow","attrs":[{"nodeName":"data-parsoid","nodeValue":"{}"},{"nodeName":"typeof","nodeValue":"mw:EndTag"},{"nodeName":"data-etag","nodeValue":"ul"},{"nodeName":"data-parsoid","nodeValue":"{}"}]}-->
</small><!--{"@type":"mw:shadow","attrs":[{"nodeName":"data-parsoid","nodeValue":"{\"tsr\":[11,19],\"stx\":\"html\"}"},{"nodeName":"typeof","nodeValue":"mw:EndTag"},{"nodeName":"data-etag","nodeValue":"small"},{"nodeName":"data-parsoid","nodeValue":"{\"tsr\":[11,19],\"stx\":\"html\"}"}]}-->
</body>

So, the tree builder seems to accept that just fine ... to be continued.

Could this be related to the bug which puts <small> inside <figure> elements? T65642

Also, if we are emitting the same markup as PHP+tidy is, then I don't think this is a bug in Parsoid, even if the w3c's HTML validator doesn't like it. We emit well-formed HTML, I don't think we make any guarantees about the HTML content model (which is a stricter validation format).

It doesn't look like parsoid is outputting the same html as php+tidy (unless tidy isn't used in prod? i really dont know how that works), if you visit https://www.mediawiki.org/wiki/User:EBernhardson_(WMF)/Sandbox and view source, there

<small>
* foo
</small>

results in

<ul>
<li><small>foo</small></li>
</ul>

while parsoid outputs

<small data-parsoid='{"stx":"html","dsr":[0,22,7,8]}'>
<ul data-parsoid='{"dsr":[8,13,0,0]}'><li data-parsoid='{"dsr":[8,13,1,0]}'> foo</li></ul>
</small>

This is a result of:

  1. doBlockLevels regexp-based paragraph wrapping bugs (T134469)
  2. Tidy doing more aggressive tree fixups from HTML4 era which works out fine in HTML5 land.
  3. HTML5 tree builder being lenient in what it accepts and HTML5 output compliance requirements being much stricter in terms of content model constraints.

RemexHTML has the same issue. However, in order to mimic Tidy, we added a p-wrapping Tidy compatibility pass to RemexHTML which effectively gives a result similar to Tidy in this case.

RemexHTML runs: HTML5 tree-builder + p-wrap compatibility tree rewriting
Parsoid runs: p-wrap (rendering) compatibility wrapping + HTML5 tree-builder

So, RemexHTML will generate HTML5 output for the above wikitext. But, this works only for situations when the p-wrap compatibility pass will kick in. It won't for wikitext of the form <small><ul><li>a</li></small>.

So, the underlying core reason here is that HTML5 treebuilder is more lenient in what it accepts. So, if we want truly HTML5 compliant output, we need to add an explicit pass to do that. See this HTML5 compliance discussion for how far we should go there.

ssastry lowered the priority of this task from Medium to Low.Apr 17 2017, 9:56 PM
  1. Parsoid and core parser output matches on this.
  2. We accept the bad markup and flag it as a lint error (misnested-tag) so editors can fix the markup.
  3. We aren't yet shooting for the fully compliant HTML5 output goal ( https://www.mediawiki.org/wiki/Parsing/Notes/HTML5_Compliance ).

So for now, nothing to do here.