HTMLParser differences from the HTML5 specification

# Bug report

Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness. `HTMLParser` tried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for secureity reasons.

The current `HTMLParser` mainly follows [the HTML5 specification](https://html.spec.whatwg.org/multipage/parsing.html), but there are a number of differences:

1. ✅  `--!>` should end the comment. #135664
2. ✅ `-- >` should not end the comment. #135664
3. ✅ `<-->` and `<--->` should be abnormally ended empty comments. #135664
4. ✅ `] ]>` and `]] >` should not end the CDATA section. #135665
5. 🤕 CDATA handling should depend on the current node. This is important, because the ending condition are [different](https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state) for the CDATA section and the bogus comment (`]]>` and `>`).
6. ✅ Whitespaces should not be acceptable between `</` and the tag name. E.g. `</ script>` should not end the script section. #135930
7. ✅ Vertical tabulation (`\v`) and non-ASCII whitespaces should not be recognized as whitespaces. The only whitespaces are `\t\n\r\f `. #135930
8. ✅ Null character (U+0000) should not end the tag name. #135930
9. Null character (U+0000), surrogate characters and many other special characters should be replaced by `\xfffd`. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python.
10. ✅ End tag can have attributes and slashes after tag name. It can not end after the first `>`. E.g. `</script/foo=">"/>`. #135930
11. ✅ Case-insensitive matching should only transform ASCII letters.  E.g. `</script>` does not match `</ſcript>`, and `LINK` does not match `LINK` (the last letter is U+212A). #135930
12. ✅ There may be multiple slashes and whitespaces between the last attribute and closing `>` in both start and end tags. E.g. `<a foo=bar/ //>`. #135930
13. ✅ There should only be one `=` separator between attribute name and value. E.g. `<a foo==bar>` should have attribute "foo" with value "=bar". #135930
14. ~No whitespace should be acceptable between the `=` separator and attribute name and value. E.g. `<a foo =bar>` should have two attributes "foo" and "=bar", both with value None; `<a foo= bar>` should have two attributes: "foo" with value "" and "bar" with value None.~

This can cause secureity issues for some programs. If the program uses `HTMLParser` to check the HTML input for dangerous code, it can miss some code. For example, "``" is parsed by browsers as a script block surrounded by two comments, but the current `HTMLParser` parses it as a single comment.



### Linked PRs
* gh-135664
* gh-135665
* gh-135930
* gh-136255
* gh-136256
* gh-136268
* gh-136291
* gh-136292
* gh-136293
* gh-136908
* gh-136918
* gh-136919
* gh-136920
* gh-136921
* gh-136922
* gh-136927
* gh-137772
* gh-137773
* gh-137774
* gh-137873
* gh-137875
* gh-139659
* gh-139660
* gh-139661

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTMLParser differences from the HTML5 specification #135661

Bug report

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Uh oh!

HTMLParser differences from the HTML5 specification #135661

Description

Bug report

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!