pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

seanthegeek · 2026-03-11T13:59:35Z

ASM Lexer Fixes

Note: This patch was written by Claude Sonnet 4.6 via the Claude Code CLI. I know the Pygments project may be cautious about LLM-generated contributions, and I'd genuinely welcome feedback on the quality of this work — both the code itself and how well it follows Pygments conventions. I'm using this as a real-world test of how well Claude handles a non-trivial open-source contribution task given a detailed prompt. Any review comments, even harsh ones, are appreciated.

Issues Fixed

Issue	Fix
#1231 — `sp` matched inside `sprintf@plt`	Added negative lookahead `(?![a-zA-Z0-9_])` on register patterns; `<symbol@plt>` now tokenized as `Comment.Special`
#728 — `%define` not recognized when indented	Preprocessor directive rule now uses `^\s*` anchor with explicit directive list
Missing registers	Added `rip`, `eip`, `ip`, `sil`, `dil`, `bpl`, `spl`, `rflags`, `eflags`, `flags`, `mxcsr`, `gdtr`, `ldtr`, `idtr`, `bnd0–bnd3`, `cr5–cr15`, `k0–k7` (AVX-512 opmask)
Missing directives	Added `ALIGNB`, `FLOAT`, `INCBIN`, `ISTRUC`, `IEND`, `AT`

New Test Snippets

tests/snippets/nasm/registers_extended.txt — extended register tests
tests/snippets/nasm/preproc_indented.txt — indented preprocessor directives
tests/snippets/nasm/objdump_plt.txt — <symbol@plt> disassembly annotations
tests/snippets/nasm/directives_extended.txt — extended assembler directives

Results

Error token count: 0
pytest: 5209 passed, 16 skipped (4 new tests added)
ruff: clean

Original prompt (verbatim)

Pygments Lexer: NASM (Netwide Assembler)

Task

Fix the existing NASM (Netwide Assembler) lexer in Pygments. Work inside my local fork of the pygments/pygments repo on a separate branch.

Official references

MANDATORY: Before writing or modifying the lexer, you MUST fetch and read every
URL in this list. This is not background reading — it is a required prerequisite
step. Fetch each page, extract the keywords or function names, and verify them
against the lexer before declaring any work complete.

NASM documentation: https://www.nasm.us/doc/
NASM instruction reference: https://www.nasm.us/doc/nasmdocb.html
NASM preprocessor: https://www.nasm.us/doc/nasmdoc4.html
NASM directives: https://www.nasm.us/doc/nasmdoc7.html
NASM expressions: https://www.nasm.us/doc/nasmdoc3.html
x86/x86-64 instruction reference: https://www.felixcloutier.com/x86/
Pygments issue Fix NasmLexer to support syntax like <sprintf@plt> #1231 (sprintf@plt bug): Fix NasmLexer to support syntax like <sprintf@plt> #1231
Pygments issue NASM lexer: Macros with whitespace before it are not recognized #728 (macro whitespace bug): NASM lexer: Macros with whitespace before it are not recognized #728

Pygments references

Write your own lexer: https://pygments.org/docs/lexerdevelopment/
Contributing to Pygments: https://pygments.org/docs/contributing/
Builtin tokens: https://pygments.org/docs/tokens/
Available lexers: https://pygments.org/docs/lexers/
Pygments GitHub repo: https://github.com/pygments/pygments
Existing ASM lexers (structural reference): pygments/lexers/asm.py
SQL lexer (structural reference for query languages): pygments/lexers/sql.py

Phase 1: Setup and audit

Confirm you're in the root of a Pygments repo checkout (look for pygments/lexers/, tests/, setup.py).
Run git checkout -b fix/nasm main to create a dedicated branch.
Set up a venv: python -m venv venv && source venv/bin/activate && pip install -e ".[dev]".
Run tox -e py to confirm the existing test suite passes.

Establish a baseline — run the existing lexer against a sample and count Error tokens:

echo '<sample code>' | python -m pygments -l nasm -f html | grep -o 'class="err"' | wc -l

Read the existing lexer end-to-end. Understand the current states, token patterns, and keyword sets.

Known issues to fix

Issue Fix NasmLexer to support syntax like <sprintf@plt> #1231: <sprintf@plt> causes sp inside sprintf to be tokenized as a register.
Issue NASM lexer: Macros with whitespace before it are not recognized #728: %define preceded by whitespace produces Error tokens.
Missing registers: x86-64 extended registers, AVX-512, mask registers may have gaps.
Missing preprocessor directives: Some % directives may not be covered.

Phase 1: Research

Before writing any code, fetch and read the official references listed above.

Do not invent or assume any syntax elements. If something is ambiguous in the docs, web-search to verify before including it.

Phase 2: Fix the lexer

Apply fixes to the existing lexer file.

Review the existing lexer at pygments/lexers/asm.py (the NasmLexer class) and fix:

Register matching greediness: The lexer matches register names like sp inside longer words (e.g., sprintf). Fix by using word boundary anchors or negative lookahead.
Macro whitespace: %define and other preprocessor directives must be recognized even when preceded by whitespace, not just at column 0.
Missing registers: Audit and add any missing x86-64 extended registers, AVX-512 registers, mask registers.
Missing directives: Ensure all NASM preprocessor and assembler directives are covered.
Disassembly compatibility: Consider gracefully handling <symbol@plt> patterns and hex address prefixes.

After each fix, run the tests to confirm no regressions:

tox -e py -- tests/snippets/nasm/

Phase 3: Expand tests

Review and expand the existing test snippets in tests/snippets/nasm/. Add snippets that cover the syntax that was previously broken.

Each snippet file is a .txt file containing source code. Run:

tox -- --update-goldens tests/snippets/nasm/new_test.txt

This auto-populates expected tokens. Review them for correctness, then check them in.

Phase 4: Test and iterate

This is the critical phase. Use pygmentize as the feedback loop.

Run tox -e py. Fix any failures.

Test your lexer on the example file and count Error tokens:

python -m pygments -l nasm -f html tests/examplefiles/nasm/* | grep -o 'class="err"' | wc -l

If there are Error tokens, identify the unmatched text:

python -m pygments -l nasm -f testcase tests/examplefiles/nasm/* | grep "Token.Error"

For each Error token:
a. Identify what syntax element the unmatched text represents.
b. Web-search the official docs to confirm the syntax is valid.
c. Fix the lexer rule.
d. Re-run tox -e py -- tests/snippets/nasm/ to confirm no regressions.
e. Re-test with pygmentize to verify the Error is gone.
Repeat until the Error token count is zero.
Run the full test suite one more time: tox -e py.

Visually inspect the HTML output for sanity:

python -m pygments -l nasm -f html -O full,style=monokai tests/examplefiles/nasm/* > /tmp/preview.html
open /tmp/preview.html  # or xdg-open on Linux

Confirm that keywords, functions, operators, strings, numbers, and comments are each highlighted distinctly.

Phase 5: Finalize

Run tox -e py one final time — full pass, zero failures.
Review the diff: git diff --stat. You should have these files:
- pygments/lexers/asm.py (the fixes)
- tests/snippets/nasm/ (new or updated test snippets)
- Possibly tests/examplefiles/nasm/ (expanded example)
Commit: git add -A && git commit -m "Fix NASM (Netwide Assembler) lexer: <summarize fixes>".
Report what you've done: list the keyword count, function count, token types used, and confirm zero Error tokens.

Constraints (applies to all phases)

No hallucinated syntax. Every keyword, function, operator, and language construct must come from the official documentation listed above. If you're unsure, web-search the docs before adding it.
Follow Pygments conventions exactly. Read existing lexers (especially sql.py and the lexer development guide) for patterns. Use words(), bygroups(), include(), and default() helpers appropriately.
Python code must include type hints and pass ruff linter checks.
The Error token count is the ground truth. tox passing is necessary but not sufficient — you must also have zero Token.Error in both test snippets and example files.
Iterate until clean. Do not declare the task complete until both tox -e py passes AND the Error token count is zero.

…ew test cases for preprocessor and instruction parsing

birkenfeld · 2026-03-11T19:30:45Z

Hi, and thanks for being upfront about the origen of this code.

You're right about people being cautious about AI contribution, especially if - as it appears from your note - you're submitting the AI output without own review.

One think I see at a glance is that both the old and new code include regexes that should nowadays be generated by feeding a word list to words().

…andling; improve regex patterns for clarity

seanthegeek · 2026-03-11T19:51:26Z

Done :)

I guess my point is with all of this is that AI can be very useful for improving code, if you give the right model the right prompt.

Anteru · 2026-03-30T10:56:57Z

Would be also good (besides self-review) to teach Claude how to run the tests/read the contribution guidelines, because tox -e check locally would have captured the CI failure before submission. Not sure why that failed though. How did Claude run the tests but not run the checks?

Enhance NasmLexer with extended register and directive support; add n…

3ad8e2b

…ew test cases for preprocessor and instruction parsing

seanthegeek changed the title ~~NASM lexer fixes~~ ASM lexer fixes Mar 11, 2026

seanthegeek changed the title ~~ASM lexer fixes~~ NASM lexer fixes Mar 11, 2026

Refactor NasmLexer: streamline preprocessor and assembler directive h…

8808abc

…andling; improve regex patterns for clarity

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NASM lexer fixes#3059

NASM lexer fixes#3059
seanthegeek wants to merge 2 commits intopygments:masterfrom
seanthegeek:fix/nasm

seanthegeek commented Mar 11, 2026 •

edited

Loading

Uh oh!

birkenfeld commented Mar 11, 2026

Uh oh!

seanthegeek commented Mar 11, 2026

Uh oh!

Anteru commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

pFad - Phone/Frame/Anonymizer/Declutterfier! Saves Data!

Conversation

seanthegeek commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASM Lexer Fixes

Issues Fixed

New Test Snippets

Results

Pygments Lexer: NASM (Netwide Assembler)

Task

Official references

Pygments references

Phase 1: Setup and audit

Known issues to fix

Phase 1: Research

Phase 2: Fix the lexer

Phase 3: Expand tests

Phase 4: Test and iterate

Phase 5: Finalize

Constraints (applies to all phases)

Uh oh!

birkenfeld commented Mar 11, 2026

Uh oh!

seanthegeek commented Mar 11, 2026

Uh oh!

Anteru commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfad - The Proxy pFad © 2024 Your Company Name. All rights reserved.

seanthegeek commented Mar 11, 2026 •

edited

Loading