pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/python/cpython/commit/bd4bd3e76a684969022c00aafb8acf18006ac89b

="stylesheet" href="https://github.githubassets.com/assets/global-d18f184ea1a06a2c.css" /> gh-152100: Support set operations in character classes (GH-152153) · python/cpython@bd4bd3e · GitHub
Skip to content

Commit bd4bd3e

Browse files
gh-152100: Support set operations in character classes (GH-152153)
Implement set difference [A--B], intersection [A&&B] and union [A||B] in regular expression character classes (Unicode Technical Standard #18), including nested, complemented and compound set operands. Symmetric difference [A~~B] remains reserved. Also use the new syntax in the standard library (_strptime, textwrap, doctest, pkgutil). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent a6c2d4a commit bd4bd3e

9 files changed

Lines changed: 324 additions & 162 deletions

File tree

Doc/library/re.rst

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -279,25 +279,47 @@ The special characters are:
279279
``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
280280
and parentheses.
281281

282-
.. .. index:: single: --; in regular expressions
283-
.. .. index:: single: &&; in regular expressions
284-
.. .. index:: single: ~~; in regular expressions
285-
.. .. index:: single: ||; in regular expressions
286-
287-
* Support of nested sets and set operations as in `Unicode Technical
288-
Standard #18`_ might be added in the future. This would change the
289-
syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
290-
in ambiguous cases for the time being.
291-
That includes sets starting with a literal ``'['`` or containing literal
292-
character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
293-
avoid a warning escape them with a backslash.
282+
.. index::
283+
single: --; in regular expressions
284+
single: &&; in regular expressions
285+
single: ||; in regular expressions
286+
287+
* A character set may contain a nested set written in square brackets, and
288+
two sets may be combined with a set operator, as in `Unicode Technical
289+
Standard #18`_:
290+
291+
* ``[A--B]`` (*difference*) matches a character that is in *A* but not
292+
in *B*; for example ``[a-z--[aeiou]]`` matches an ASCII lowercase
293+
consonant.
294+
* ``[A&&B]`` (*intersection*) matches a character that is in both *A*
295+
and *B*; for example ``[\w&&[a-z]]`` matches an ASCII lowercase letter.
296+
* ``[A||B]`` (*union*) matches a character that is in *A* or in *B*; this
297+
is the same as listing the members of both sets in a single set, but
298+
allows combining nested sets.
299+
300+
Operators have no precedence and are applied from left to right. To
301+
group, write a nested set as the operand after an operator, as in
302+
``[a-z--[aeiou]]``. A leading ``'^'`` complements the whole result.
303+
A ``'['`` begins a nested set only immediately after a set operator;
304+
anywhere else -- including at the start of a character set -- it is an
305+
ordinary character, so existing patterns keep their meaning. Escape it
306+
as ``'\['`` to include a literal ``'['`` right after an operator.
294307

295308
.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
296309

310+
.. note::
311+
312+
Symmetric difference (``A~~B``) is not yet supported; a literal ``'~~'``
313+
in a character set still raises a :exc:`FutureWarning`.
314+
297315
.. versionchanged:: 3.7
298316
:exc:`FutureWarning` is raised if a character set contains constructs
299317
that will change semantically in the future.
300318

319+
.. versionchanged:: next
320+
Added support for nested sets and the set operators ``--``, ``&&``
321+
and ``||``.
322+
301323
.. index:: single: | (vertical bar); in regular expressions
302324

303325
``|``

Doc/whatsnew/3.16.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,18 @@ os
181181
(Contributed by Maurycy Pawłowski-Wieroński in :gh:`149464`.)
182182

183183

184+
re
185+
--
186+
187+
* :mod:`re` now supports set operations and nested sets in character classes,
188+
as described in `Unicode Technical Standard #18
189+
<https://unicode.org/reports/tr18/>`__: set difference (``[A--B]``),
190+
intersection (``[A&&B]``) and union (``[A||B]``), where an operand may be a
191+
nested set written in square brackets. For example, ``[a-z--[aeiou]]``
192+
matches an ASCII lowercase consonant.
193+
(Contributed by Serhiy Storchaka in :gh:`152100`.)
194+
195+
184196
shlex
185197
-----
186198

Lib/_strptime.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ def __calc_date_time(self):
238238
current_format = current_format.replace(tz, "%Z")
239239
# Transform all non-ASCII digits to digits in range U+0660 to U+0669.
240240
if not current_format.isascii() and self.LC_alt_digits is None:
241-
current_format = re_sub(r'\d(?<![0-9])',
241+
current_format = re_sub(r'[\d--0-9]',
242242
lambda m: chr(0x0660 + int(m[0])),
243243
current_format)
244244
for old, new in replacement_pairs:

Lib/doctest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1768,7 +1768,7 @@ def check_output(self, want, got, optionflags):
17681768
'', want)
17691769
# If a line in got contains only spaces, then remove the
17701770
# spaces.
1771-
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
1771+
got = re.sub(r'(?m)^[\s--\n]+$', '', got)
17721772
if got == want:
17731773
return True
17741774

Lib/pkgutil.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -443,7 +443,7 @@ def resolve_name(name, *, strict=False):
443443
within the imported package to get to the desired object.
444444
"""
445445
global _LENIENT_PATTERN, _STRICT_PATTERN
446-
dotted_words = r'(?!\d)(\w+)(\.(?!\d)(\w+))*'
446+
dotted_words = r'([\w--\d]\w*)(\.([\w--\d]\w*))*'
447447
if strict:
448448
if _STRICT_PATTERN is None:
449449
_STRICT_PATTERN = re.compile(

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy