pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

URL: http://github.com/bazelbuild/bazel/pull/28909

ets/global-94620c216484da1f.css" /> starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters by H5-O5 · Pull Request #28909 · bazelbuild/bazel · GitHub
Skip to content

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909

Open
H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
H5-O5:master
Open

starlark interpreter: Fix string.splitlines() incorrectly splitting UTF8 characters#28909
H5-O5 wants to merge 1 commit intobazelbuild:masterfrom
H5-O5:master

Conversation

@H5-O5
Copy link

@H5-O5 H5-O5 commented Mar 5, 2026

Description

string.splitlines() should not split on u+0085 (NEL) in UTF-8 as byte array mode.

Motivation

Fixes #28885

Build API Changes

  • Does this PR affect the Build API? (e.g. Starlark API, providers, command-line flags, native rules)

  • Is the change backward compatible?

    • No.
  • If it's a breaking change, what is the migration plan?

    • This specific case could be an exception where no migration plan is needed. Since:
        1. previous behavior on a UTF-8 file is clearly a bug.
        1. it should be pretty rare for this to affect other encodings as well.

Checklist

  • I have added tests for the new use cases (if any).
  • I have updated the documentation (if applicable).

Release Notes

RELNOTES[INC]: string.splitlines() no longer treat u+0085 as a newline character when internal_starlark_utf_8_byte_strings is set (which defaults to true)

@H5-O5 H5-O5 requested review from brandjon and tetromino as code owners March 5, 2026 23:47
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Mar 5, 2026
# "包" is U+5305. UTF-8 is E5 8C 85.
# If Starlark treats each byte as a Latin1 char, then 85 is \u0085 (NEL).
# Java's Pattern.compile(".*") stops at \u0085.
assert_eq("abc包def".splitlines(), ["abc包def"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a guarded test that verifies the proper splitting behavior with UTF-16 strings (see

for the pattern)? In that mode NEL should probably still be split at.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.

if _utf8_byte_strings, then we got [\u0061 (a), (b) (c) \u00E5, \u008C, \u0085, (d), (e), (f)], and we should not split on \u0085 because it is part of a UTF-8 code
if not _utf8_byte_strings, then we have [(a) (b) (c) (包) (d) (e) (f)] and still we should not split on any codepoint.

Currently both bazel test //src/test/java/net/starlark/java/eval:ScriptTest_Latin1 and bazel test //src/test/java/net/starlark/java/eval:ScriptTest pass. I don't think we should add _utf8_byte_strings here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that splitlines is explicitly specced to split on \n, \r, \r\n only. I thought that we would need to continue to split on an actual \u0085 character if not _utf8_byte_strings, but that's not true.

@fmeum
Copy link
Collaborator

fmeum commented Mar 6, 2026

@bazel-io fork 9.1.0

@fmeum
Copy link
Collaborator

fmeum commented Mar 6, 2026

@bazel-io fork 8.7.0

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the core problem is that . in Java regex doesn't match NEL?

Thank you for the fix!

@tetromino tetromino added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Mar 6, 2026
@tetromino
Copy link
Contributor

I've imported the patch for internal review

@tetromino
Copy link
Contributor

This PR appears to break tests for some non-Bazel users of Starlark (e.g. Copybara specifically). Looking into it...

@iancha1992 iancha1992 added team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel and removed awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally labels Mar 6, 2026
@tetromino
Copy link
Contributor

It looks like some Copybara-based scripts internally at Google have come to depend on the incorrect old splitlines() behavior; please wait a bit, I'll need to update them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

splitlines() incorrectly splitting UTF-8 characters under Latin1 byte-string mode

4 participants

pFad - Phonifier reborn

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.





Check this box to remove all script contents from the fetched content.



Check this box to remove all images from the fetched content.


Check this box to remove all CSS styles from the fetched content.


Check this box to keep images inefficiently compressed and original size.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy