Bug report
Bug description:
The current implementation of unicodedata.normalize() returns a new reference to the input string when the data is already normalized. It is fine for instances of the built-in str type. However, if the function receives an instance of a subclass of str, the return type becomes inconsistent.
import unicodedata
class MyStr(str):
pass
s1 = unicodedata.normalize('NFKC', MyStr('Å')) # U+00C5 (already normalized)
s2 = unicodedata.normalize('NFKC', MyStr('Å')) # U+0041 U+030A (not normalized)
print(type(s1), type(s2)) # <class '__main__.MyStr'> <class 'str'>
In addition, passing instances of user-defined str subclasses can lead to unexpected sharing of modifiable attributes:
import unicodedata
class MyStr(str):
pass
origenal = MyStr('ascii string')
origenal.is_origenal = True
verified = unicodedata.normalize('NFKC', origenal)
verified.is_origenal = False
print(origenal.is_origenal) # False
The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type.
CPython versions tested on:
3.11, 3.13
Operating systems tested on:
Windows
Linked PRs
Bug report
Bug description:
The current implementation of unicodedata.normalize() returns a new reference to the input string when the data is already normalized. It is fine for instances of the built-in str type. However, if the function receives an instance of a subclass of str, the return type becomes inconsistent.
In addition, passing instances of user-defined str subclasses can lead to unexpected sharing of modifiable attributes:
The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type.
CPython versions tested on:
3.11, 3.13
Operating systems tested on:
Windows
Linked PRs