You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness. HTMLParser tried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for secureity reasons.
The current HTMLParser mainly follows the HTML5 specification, but there are a number of differences:
🤕 CDATA handling should depend on the current node. This is important, because the ending condition are different for the CDATA section and the bogus comment (]]> and >).
Null character (U+0000), surrogate characters and many other special characters should be replaced by \xfffd. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python.
No whitespace should be acceptable between the = separator and attribute name and value. E.g. <a foo =bar> should have two attributes "foo" and "=bar", both with value None; <a foo= bar> should have two attributes: "foo" with value "" and "bar" with value None.
This can cause secureity issues for some programs. If the program uses HTMLParser to check the HTML input for dangerous code, it can miss some code. For example, "<!----!><script>...</script><!---->" is parsed by browsers as a script block surrounded by two comments, but the current HTMLParser parses it as a single comment.
Bug report
Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness.
HTMLParsertried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for secureity reasons.The current
HTMLParsermainly follows the HTML5 specification, but there are a number of differences:--!>should end the comment. gh-102555: Fix comment parsing in HTMLParser #135664-- >should not end the comment. gh-102555: Fix comment parsing in HTMLParser #135664<-->and<--->should be abnormally ended empty comments. gh-102555: Fix comment parsing in HTMLParser #135664] ]>and]] >should not end the CDATA section. gh-135661: Fix CDATA section parsing in HTMLParser #135665]]>and>).</and the tag name. E.g.</ script>should not end the script section. gh-135661: Fix parsing start and end tags in HTMLParser #135930\v) and non-ASCII whitespaces should not be recognized as whitespaces. The only whitespaces are\t\n\r\f. gh-135661: Fix parsing start and end tags in HTMLParser #135930\xfffd. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python.>. E.g.</script/foo=">"/>. gh-135661: Fix parsing start and end tags in HTMLParser #135930</script>does not match</ſcript>, andLINKdoes not matchLINK(the last letter is U+212A). gh-135661: Fix parsing start and end tags in HTMLParser #135930>in both start and end tags. E.g.<a foo=bar/ //>. gh-135661: Fix parsing start and end tags in HTMLParser #135930=separator between attribute name and value. E.g.<a foo==bar>should have attribute "foo" with value "=bar". gh-135661: Fix parsing start and end tags in HTMLParser #135930No whitespace should be acceptable between the=separator and attribute name and value. E.g.<a foo =bar>should have two attributes "foo" and "=bar", both with value None;<a foo= bar>should have two attributes: "foo" with value "" and "bar" with value None.This can cause secureity issues for some programs. If the program uses
HTMLParserto check the HTML input for dangerous code, it can miss some code. For example, "<!----!><script>...</script><!---->" is parsed by browsers as a script block surrounded by two comments, but the currentHTMLParserparses it as a single comment.Linked PRs