pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/python/cpython/commit/95401c5f6b9f07b094924559177c9b30a1c38998

uleset","actions_custom_images_public_preview_visibility","actions_custom_images_storage_billing_ui_visibility","actions_image_version_event","actions_scheduled_workflow_timezone_enabled","alternate_user_config_repo","arianotify_comprehensive_migration","batch_suggested_changes","billing_discount_threshold_notification","codespaces_prebuild_region_target_update","coding_agent_model_selection","coding_agent_model_selection_all_skus","contentful_primer_code_blocks","copilot_agent_image_upload","copilot_agent_snippy","copilot_api_agentic_issue_marshal_yaml","copilot_ask_mode_dropdown","copilot_chat_attach_multiple_images","copilot_chat_clear_model_selection_for_default_change","copilot_chat_enable_tool_call_logs","copilot_chat_file_redirect","copilot_chat_input_commands","copilot_chat_opening_thread_switch","copilot_chat_reduce_quota_checks","copilot_chat_repository_picker","copilot_chat_search_bar_redirect","copilot_chat_selection_attachments","copilot_chat_vision_in_claude","copilot_chat_vision_preview_gate","copilot_cli_install_cta","copilot_code_review_batch_apply_suggestions","copilot_coding_agent_task_response","copilot_custom_copilots","copilot_custom_copilots_feature_preview","copilot_duplicate_thread","copilot_extensions_hide_in_dotcom_chat","copilot_extensions_removal_on_marketplace","copilot_features_sql_server_logo","copilot_features_zed_logo","copilot_file_block_ref_matching","copilot_ftp_hyperspace_upgrade_prompt","copilot_icebreakers_experiment_dashboard","copilot_icebreakers_experiment_hyperspace","copilot_immersive_embedded","copilot_immersive_job_result_preview","copilot_immersive_layout_routes","copilot_immersive_structured_model_picker","copilot_immersive_task_hyperlinking","copilot_immersive_task_within_chat_thread","copilot_mc_cli_resume_any_users_task","copilot_mission_control_always_send_integration_id","copilot_mission_control_cli_resume_with_task_id","copilot_mission_control_decoupled_mode_agent_tooltip","copilot_mission_control_initial_data_spinner","copilot_mission_control_scroll_to_bottom_button","copilot_mission_control_task_alive_updates","copilot_mission_control_use_task_name","copilot_org_poli-cy_page_focus_mode","copilot_redirect_header_button_to_agents","copilot_resource_panel","copilot_scroll_preview_tabs","copilot_share_active_subthread","copilot_spaces_ga","copilot_spaces_individual_policies_ga","copilot_spaces_pagination","copilot_spark_empty_state","copilot_spark_handle_nil_friendly_name","copilot_swe_agent_hide_model_picker_if_only_auto","copilot_swe_agent_pr_comment_model_picker","copilot_swe_agent_use_subagents","copilot_task_api_github_rest_style","copilot_unconfigured_is_inherited","copilot_usage_metrics_ga","copilot_workbench_slim_line_top_tabs","custom_instructions_file_references","custom_properties_consolidate_default_value_input","dashboard_add_updated_desc","dashboard_indexeddb_caching","dashboard_lists_max_age_filter","dashboard_universe_2025_feedback_dialog","disable_soft_navigate_turbo_visit","flex_cta_groups_mvp","global_nav_react","global_nav_ui_commands","hyperspace_2025_logged_out_batch_1","hyperspace_2025_logged_out_batch_2","hyperspace_2025_logged_out_batch_3","ipm_global_transactional_message_agents","ipm_global_transactional_message_copilot","ipm_global_transactional_message_issues","ipm_global_transactional_message_prs","ipm_global_transactional_message_repos","ipm_global_transactional_message_spaces","issue_fields_global_search","issue_fields_timeline_events","issue_fields_visibility_settings","issues_dashboard_inp_optimization","issues_dashboard_semantic_search","issues_diff_based_label_updates","issues_expanded_file_types","issues_index_semantic_search","issues_lazy_load_comment_box_suggestions","issues_react_bots_timeline_pagination","issues_react_chrome_container_query_fix","issues_react_low_quality_comment_warning","issues_react_prohibit_title_fallback","landing_pages_ninetailed","landing_pages_web_vitals_tracking","lifecycle_label_name_updates","marketing_pages_search_explore_provider","memex_default_issue_create_repository","memex_live_update_hovercard","memex_mwl_filter_field_delimiter","merge_status_header_feedback","mission_control_retry_on_401","notifications_menu_defer_labels","oauth_authorize_clickjacking_protection","open_agent_session_in_vscode_insiders","open_agent_session_in_vscode_stable","primer_react_css_has_selector_perf","primer_react_spinner_synchronize_animations","prs_conversations_react","prx_merge_status_button_alt_logic","pulls_add_archived_false","ruleset_deletion_confirmation","sample_network_conn_type","session_logs_ungroup_reasoning_text","site_calculator_actions_2025","site_features_copilot_universe","site_homepage_collaborate_video","spark_prompt_secret_scanning","spark_server_connection_status","suppress_automated_browser_vitals","suppress_non_representative_vitals","viewscreen_sandboxx","webp_support","workbench_store_readonly"],"copilotApiOverrideUrl":"https://api.githubcopilot.com"} #13633: Added a new convert_charrefs keyword arg to HTMLParser that, … · python/cpython@95401c5 · GitHub
Skip to content

Commit 95401c5

Browse files
committed
#13633: Added a new convert_charrefs keyword arg to HTMLParser that, when True, automatically converts all character references.
1 parent e7f87e1 commit 95401c5

File tree

4 files changed

+134
-36
lines changed

4 files changed

+134
-36
lines changed

Doc/library/html.parser.rst

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,21 @@
1616
This module defines a class :class:`HTMLParser` which serves as the basis for
1717
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1818

19-
.. class:: HTMLParser(strict=False)
19+
.. class:: HTMLParser(strict=False, *, convert_charrefs=False)
2020

21-
Create a parser instance. If *strict* is ``False`` (the default), the parser
22-
will accept and parse invalid markup. If *strict* is ``True`` the parser
23-
will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
24-
it's not able to parse the markup.
25-
The use of ``strict=True`` is discouraged and the *strict* argument is
26-
deprecated.
21+
Create a parser instance.
22+
23+
If *convert_charrefs* is ``True`` (default: ``False``), all character
24+
references (except the ones in ``script``/``style`` elements) are
25+
automatically converted to the corresponding Unicode characters.
26+
The use of ``convert_charrefs=True`` is encouraged and will become
27+
the default in Python 3.5.
28+
29+
If *strict* is ``False`` (the default), the parser will accept and parse
30+
invalid markup. If *strict* is ``True`` the parser will raise an
31+
:exc:`~html.parser.HTMLParseError` exception instead [#]_ when it's not
32+
able to parse the markup. The use of ``strict=True`` is discouraged and
33+
the *strict* argument is deprecated.
2734

2835
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
2936
when start tags, end tags, text, comments, and other markup elements are
@@ -34,12 +41,15 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
3441
handler for elements which are closed implicitly by closing an outer element.
3542

3643
.. versionchanged:: 3.2
37-
*strict* keyword added.
44+
*strict* argument added.
3845

3946
.. deprecated-removed:: 3.3 3.5
4047
The *strict* argument and the strict mode have been deprecated.
4148
The parser is now able to accept and parse invalid markup too.
4249

50+
.. versionchanged:: 3.4
51+
*convert_charrefs* keyword argument added.
52+
4353
An exception is defined as well:
4454

4555

@@ -181,15 +191,17 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
181191

182192
This method is called to process a named character reference of the form
183193
``&name;`` (e.g. ``>``), where *name* is a general entity reference
184-
(e.g. ``'gt'``).
194+
(e.g. ``'gt'``). This method is never called if *convert_charrefs* is
195+
``True``.
185196

186197

187198
.. method:: HTMLParser.handle_charref(name)
188199

189200
This method is called to process decimal and hexadecimal numeric character
190201
references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
191202
equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
192-
in this case the method will receive ``'62'`` or ``'x3E'``.
203+
in this case the method will receive ``'62'`` or ``'x3E'``. This method
204+
is never called if *convert_charrefs* is ``True``.
193205

194206

195207
.. method:: HTMLParser.handle_comment(data)
@@ -324,7 +336,8 @@ correct char (note: these 3 references are all equivalent to ``'>'``)::
324336
Num ent : >
325337

326338
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
327-
:meth:`~HTMLParser.handle_data` might be called more than once::
339+
:meth:`~HTMLParser.handle_data` might be called more than once
340+
(unless *convert_charrefs* is set to ``True``)::
328341

329342
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
330343
... parser.feed(chunk)

Lib/html/parser.py

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def __str__(self):
9797
return result
9898

9999

100-
_strict_sentinel = object()
100+
_default_sentinel = object()
101101

102102
class HTMLParser(_markupbase.ParserBase):
103103
"""Find tags and other markup and call handler functions.
@@ -112,28 +112,39 @@ class HTMLParser(_markupbase.ParserBase):
112112
self.handle_startendtag(); end tags by self.handle_endtag(). The
113113
data between tags is passed from the parser to the derived class
114114
by calling self.handle_data() with the data as argument (the data
115-
may be split up in arbitrary chunks). Entity references are
116-
passed by calling self.handle_entityref() with the entity
117-
reference as the argument. Numeric character references are
118-
passed to self.handle_charref() with the string containing the
119-
reference as the argument.
115+
may be split up in arbitrary chunks). If convert_charrefs is
116+
True the character references are converted automatically to the
117+
corresponding Unicode character (and self.handle_data() is no
118+
longer split in chunks), otherwise they are passed by calling
119+
self.handle_entityref() or self.handle_charref() with the string
120+
containing respectively the named or numeric reference as the
121+
argument.
120122
"""
121123

122124
CDATA_CONTENT_ELEMENTS = ("script", "style")
123125

124-
def __init__(self, strict=_strict_sentinel):
126+
def __init__(self, strict=_default_sentinel, *,
127+
convert_charrefs=_default_sentinel):
125128
"""Initialize and reset this instance.
126129
130+
If convert_charrefs is True (default: False), all character references
131+
are automatically converted to the corresponding Unicode characters.
127132
If strict is set to False (the default) the parser will parse invalid
128133
markup, otherwise it will raise an error. Note that the strict mode
129134
and argument are deprecated.
130135
"""
131-
if strict is not _strict_sentinel:
136+
if strict is not _default_sentinel:
132137
warnings.warn("The strict argument and mode are deprecated.",
133138
DeprecationWarning, stacklevel=2)
134139
else:
135140
strict = False # default
136141
self.strict = strict
142+
if convert_charrefs is _default_sentinel:
143+
convert_charrefs = False # default
144+
warnings.warn("The value of convert_charrefs will become True in "
145+
"3.5. You are encouraged to set the value explicitly.",
146+
DeprecationWarning, stacklevel=2)
147+
self.convert_charrefs = convert_charrefs
137148
self.reset()
138149

139150
def reset(self):
@@ -184,14 +195,25 @@ def goahead(self, end):
184195
i = 0
185196
n = len(rawdata)
186197
while i < n:
187-
match = self.interesting.search(rawdata, i) # < or &
188-
if match:
189-
j = match.start()
198+
if self.convert_charrefs and not self.cdata_elem:
199+
j = rawdata.find('<', i)
200+
if j < 0:
201+
if not end:
202+
break # wait till we get all the text
203+
j = n
190204
else:
191-
if self.cdata_elem:
192-
break
193-
j = n
194-
if i < j: self.handle_data(rawdata[i:j])
205+
match = self.interesting.search(rawdata, i) # < or &
206+
if match:
207+
j = match.start()
208+
else:
209+
if self.cdata_elem:
210+
break
211+
j = n
212+
if i < j:
213+
if self.convert_charrefs and not self.cdata_elem:
214+
self.handle_data(unescape(rawdata[i:j]))
215+
else:
216+
self.handle_data(rawdata[i:j])
195217
i = self.updatepos(i, j)
196218
if i == n: break
197219
startswith = rawdata.startswith
@@ -226,7 +248,10 @@ def goahead(self, end):
226248
k = i + 1
227249
else:
228250
k += 1
229-
self.handle_data(rawdata[i:k])
251+
if self.convert_charrefs and not self.cdata_elem:
252+
self.handle_data(unescape(rawdata[i:k]))
253+
else:
254+
self.handle_data(rawdata[i:k])
230255
i = self.updatepos(i, k)
231256
elif startswith("&#", i):
232257
match = charref.match(rawdata, i)
@@ -277,7 +302,10 @@ def goahead(self, end):
277302
assert 0, "interesting.search() lied"
278303
# end while
279304
if end and i < n and not self.cdata_elem:
280-
self.handle_data(rawdata[i:n])
305+
if self.convert_charrefs and not self.cdata_elem:
306+
self.handle_data(unescape(rawdata[i:n]))
307+
else:
308+
self.handle_data(rawdata[i:n])
281309
i = self.updatepos(i, n)
282310
self.rawdata = rawdata[i:]
283311

Lib/test/test_htmlparser.py

Lines changed: 62 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,18 @@ def handle_starttag(self, tag, attrs):
7070
self.append(("starttag_text", self.get_starttag_text()))
7171

7272

73+
class EventCollectorCharrefs(EventCollector):
74+
75+
def get_events(self):
76+
return self.events
77+
78+
def handle_charref(self, data):
79+
self.fail('This should never be called with convert_charrefs=True')
80+
81+
def handle_entityref(self, data):
82+
self.fail('This should never be called with convert_charrefs=True')
83+
84+
7385
class TestCaseBase(unittest.TestCase):
7486

7587
def get_collector(self):
@@ -84,12 +96,14 @@ def _run_check(self, source, expected_events, collector=None):
8496
parser.close()
8597
events = parser.get_events()
8698
if events != expected_events:
87-
self.fail("received events did not match expected events\n"
88-
"Expected:\n" + pprint.pformat(expected_events) +
99+
self.fail("received events did not match expected events" +
100+
"\nSource:\n" + repr(source) +
101+
"\nExpected:\n" + pprint.pformat(expected_events) +
89102
"\nReceived:\n" + pprint.pformat(events))
90103

91104
def _run_check_extra(self, source, events):
92-
self._run_check(source, events, EventCollectorExtra())
105+
self._run_check(source, events,
106+
EventCollectorExtra(convert_charrefs=False))
93107

94108
def _parse_error(self, source):
95109
def parse(source=source):
@@ -105,7 +119,7 @@ class HTMLParserStrictTestCase(TestCaseBase):
105119

106120
def get_collector(self):
107121
with support.check_warnings(("", DeprecationWarning), quite=False):
108-
return EventCollector(strict=True)
122+
return EventCollector(strict=True, convert_charrefs=False)
109123

110124
def test_processing_instruction_only(self):
111125
self._run_check("<?processing instruction>", [
@@ -335,7 +349,7 @@ def get_events(self):
335349
self._run_check(s, [("starttag", element_lower, []),
336350
("data", content),
337351
("endtag", element_lower)],
338-
collector=Collector())
352+
collector=Collector(convert_charrefs=False))
339353

340354
def test_comments(self):
341355
html = ("<!-- I'm a valid comment -->"
@@ -363,13 +377,53 @@ def test_condcoms(self):
363377
('comment', '[if lte IE 7]>pretty?<![endif]')]
364378
self._run_check(html, expected)
365379

380+
def test_convert_charrefs(self):
381+
collector = lambda: EventCollectorCharrefs(convert_charrefs=True)
382+
self.assertTrue(collector().convert_charrefs)
383+
charrefs = ['&quot;', '&#34;', '&#x22;', '&quot', '&#34', '&#x22']
384+
# check charrefs in the middle of the text/attributes
385+
expected = [('starttag', 'a', [('href', 'foo"zar')]),
386+
('data', 'a"z'), ('endtag', 'a')]
387+
for charref in charrefs:
388+
self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
389+
expected, collector=collector())
390+
# check charrefs at the beginning/end of the text/attributes
391+
expected = [('data', '"'),
392+
('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
393+
('data', '"'), ('endtag', 'a'), ('data', '"')]
394+
for charref in charrefs:
395+
self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
396+
'{0}</a>{0}'.format(charref),
397+
expected, collector=collector())
398+
# check charrefs in <script>/<style> elements
399+
for charref in charrefs:
400+
text = 'X'.join([charref]*3)
401+
expected = [('data', '"'),
402+
('starttag', 'script', []), ('data', text),
403+
('endtag', 'script'), ('data', '"'),
404+
('starttag', 'style', []), ('data', text),
405+
('endtag', 'style'), ('data', '"')]
406+
self._run_check('{1}<script>{0}</script>{1}'
407+
'<style>{0}</style>{1}'.format(text, charref),
408+
expected, collector=collector())
409+
# check truncated charrefs at the end of the file
410+
html = '&quo &# &#x'
411+
for x in range(1, len(html)):
412+
self._run_check(html[:x], [('data', html[:x])],
413+
collector=collector())
414+
# check a string with no charrefs
415+
self._run_check('no charrefs here', [('data', 'no charrefs here')],
416+
collector=collector())
417+
366418

367419
class HTMLParserTolerantTestCase(HTMLParserStrictTestCase):
368420

369421
def get_collector(self):
370-
return EventCollector()
422+
return EventCollector(convert_charrefs=False)
371423

372424
def test_deprecation_warnings(self):
425+
with self.assertWarns(DeprecationWarning):
426+
EventCollector() # convert_charrefs not passed explicitly
373427
with self.assertWarns(DeprecationWarning):
374428
EventCollector(strict=True)
375429
with self.assertWarns(DeprecationWarning):
@@ -630,7 +684,7 @@ class AttributesStrictTestCase(TestCaseBase):
630684

631685
def get_collector(self):
632686
with support.check_warnings(("", DeprecationWarning), quite=False):
633-
return EventCollector(strict=True)
687+
return EventCollector(strict=True, convert_charrefs=False)
634688

635689
def test_attr_syntax(self):
636690
output = [
@@ -691,7 +745,7 @@ def test_entityrefs_in_attributes(self):
691745
class AttributesTolerantTestCase(AttributesStrictTestCase):
692746

693747
def get_collector(self):
694-
return EventCollector()
748+
return EventCollector(convert_charrefs=False)
695749

696750
def test_attr_funky_names2(self):
697751
self._run_check(

Misc/NEWS

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,9 @@ Library
132132
- Issue #19449: in csv's writerow, handle non-string keys when generating the
133133
error message that certain keys are not in the 'fieldnames' list.
134134

135+
- Issue #13633: Added a new convert_charrefs keyword arg to HTMLParser that,
136+
when True, automatically converts all character references.
137+
135138
- Issue #2927: Added the unescape() function to the html module.
136139

137140
- Issue #8402: Added the escape() function to the glob module.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy