starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909
starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
Conversation
| # "包" is U+5305. UTF-8 is E5 8C 85. | ||
| # If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL). | ||
| # Java's Pattern.compile(".*") stops at \u0085. | ||
| assert_eq("abc包def".splitlines(), ["abc包def"]) |
There was a problem hiding this comment.
Could you add a guarded test that verifies the proper splitting behavior with UTF-16 strings (see
for the pattern)? In that modeNEL should probably still be split at.
There was a problem hiding this comment.
I don't understand.
if _utf8_byte_strings, then we got [\u0061 (a), (b) (c) \u00E5, \u008C, \u0085, (d), (e), (f)], and we should not split on \u0085 because it is part of a UTF-8 code
if not _utf8_byte_strings, then we have [(a) (b) (c) (包) (d) (e) (f)] and still we should not split on any codepoint.
Currently both bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1 and bazel test //src/test/java/net/starlark/java/eval:ScriptTest pass. I don't think we should add _utf8_byte_strings here.
There was a problem hiding this comment.
Sorry, I missed that splitlines is explicitly specced to split on \n, \r, \r\n only. I thought that we would need to continue to split on an actual \u0085 character if not _utf8_byte_strings, but that's not true.
|
@bazel-io fork 9.1.0 |
|
@bazel-io fork 8.7.0 |
tetromino
left a comment
There was a problem hiding this comment.
So the core problem is that . in Java regex doesn't match NEL?
Thank you for the fix!
|
I've imported the patch for internal review |
|
This PR appears to break tests for some non-Bazel users of Starlark (e.g. Copybara specifically). Looking into it... |
|
It looks like some Copybara-based scripts internally at Google have come to depend on the incorrect old |
Description
string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.
Motivation
Fixes #28885
Build API Changes
Does this PR affect the Build API? (e.g. Starlark API, providers, command-line flags, native rules)
Is the change backward compatible?
If it's a breaking change, what is the migration plan?
Checklist
Release Notes
RELNOTES[INC]: string.splitlines() no longer treat u+0085 as a newline character when internal_starlark_utf_8_byte_strings is set (which defaults to true)