name: js-regex-cjk-word-boundary description: | Fix JavaScript regex failures when matching CJK (Chinese/Japanese/Korean) text using \b word boundaries. Use when: (1) Regex pattern with \b silently fails to match Japanese, Chinese, or Korean text, (2) Pattern works for Latin/ASCII text but not CJK, (3) hasDriverLicenceCue or similar text-detection function returns false for CJK input despite correct characters, (4) inferIssuerFromTitleText or pattern-matching functions fail on non-Latin scripts. Root cause: JavaScript \b only recognizes [a-zA-Z0-9_] as "word characters" — CJK characters are classified as \W (non-word), so \b before/after CJK always sees a non-word/non-word boundary and fails to match. author: Claude Code version: 1.0.0 date: 2026-03-09
JavaScript Regex: \b Word Boundary Fails with CJK Characters
Problem
JavaScript's \b (word boundary) assertion silently fails when used with CJK
(Chinese/Japanese/Korean) characters. The regex compiles without error and runs without
throwing, but it simply never matches CJK text that should match. This is particularly
insidious because:
- No error is thrown — the regex just returns
false - The same pattern works perfectly for Latin/ASCII text
- The CJK characters in the pattern are correct (verified by direct string comparison)
Context / Trigger Conditions
- Regex pattern like
/\b\u904B\u8EE2\u514D\u8A31\u8A3C\b/(Japanese: 運転免許証) returnsfalseon text containing the exact characters - Pattern like
/\b\uC6B4\uC804\uBA74\uD5C8\uC99D\b/(Korean: 운전면허증) fails similarly - Text-detection functions (e.g.,
hasDriverLicenceCue(),inferIssuerFromTitleText()) return incorrect results for CJK input while working correctly for all Latin-script patterns - Any regex using
\bboundaries around non-ASCII Unicode characters (also affects Cyrillic, Arabic, Thai, etc.)
Root Cause
JavaScript's \b matches the boundary between a "word character" (\w = [a-zA-Z0-9_]) and
a "non-word character" (\W). CJK characters are classified as \W (non-word characters).
When \b appears before a CJK character, it looks for a \w-to-\W or \W-to-\w
transition. But if the preceding character is also \W (whitespace, punctuation, start of
string, or another CJK character), the boundary condition is \W-to-\W, which \b does
NOT match.
// This FAILS — \b doesn't work with CJK
/\b\u904B\u8EE2\u514D\u8A31\u8A3C\b/.test("運転免許証") // false!
// This WORKS — no word boundaries
/\u904B\u8EE2\u514D\u8A31\u8A3C/.test("運転免許証") // true
Solution
Quick Fix: Remove \b from CJK patterns
Simply remove \b assertions from any regex pattern that matches CJK characters:
// Before (broken):
[/\b\u904B\u8EE2\u514D\u8A31\u8A3C\b/, "JAPAN"], // 運転免許証
[/\b\uC6B4\uC804\uBA74\uD5C8\uC99D\b/, "KOREA"], // 운전면허증
// After (working):
[/\u904B\u8EE2\u514D\u8A31\u8A3C/, "JAPAN"], // 運転免許証
[/\uC6B4\uC804\uBA74\uD5C8\uC99D/, "KOREA"], // 운전면허증
Better Fix: Use Unicode-aware boundaries (if available)
For environments supporting the v flag (ES2024+), you can use Unicode property escapes:
// Unicode-aware approach (requires /v flag support):
/(?<=^|[\s\p{P}])\u904B\u8EE2\u514D\u8A31\u8A3C(?=$|[\s\p{P}])/v
Alternative: Manual boundary with lookbehind/lookahead
// Manual boundary that works with CJK:
/(?<!\p{L})\u904B\u8EE2\u514D\u8A31\u8A3C(?!\p{L})/u
For mixed Latin/CJK pattern arrays
When you have an array of patterns where some are Latin and some are CJK, use \b only for
Latin patterns:
const patterns = [
[/\bDRIVER'?S?\s+LICEN[CS]E\b/, "match"], // Latin — \b works
[/\bFÜHRERSCHEIN\b/, "match"], // Latin+diacritics — \b works (Ü is \W but context helps)
[/\u904B\u8EE2\u514D\u8A31\u8A3C/, "match"], // CJK — no \b needed
[/\uC6B4\uC804\uBA74\uD5C8\uC99D/, "match"], // CJK — no \b needed
];
Verification
// Test that CJK matching works:
console.log(/\u904B\u8EE2\u514D\u8A31\u8A3C/.test("運転免許証")); // true
console.log(/\uC6B4\uC804\uBA74\uD5C8\uC99D/.test("운전면허증")); // true
// Verify it doesn't false-match substrings you don't want:
console.log(/\u904B\u8EE2\u514D\u8A31\u8A3C/.test("別の運転免許証テスト")); // true (substring match is usually fine for CJK)
Example
Real-world case from a document-type detection function:
function hasDriverLicenceCue(text) {
const combined = text.toUpperCase();
return (
/\bDRIVER'?S?\s+LICEN[CS]E\b/.test(combined) || // English
/\bFÜHRERSCHEIN\b/.test(combined) || // German
/\bPERMIS DE CONDUIRE\b/.test(combined) || // French
/\u904B\u8EE2\u514D\u8A31\u8A3C/.test(combined) || // Japanese (no \b!)
/\uC6B4\uC804\uBA74\uD5C8\uC99D/.test(combined) // Korean (no \b!)
);
}
Notes
- This affects ALL non-ASCII Unicode scripts, not just CJK: Cyrillic, Arabic, Thai, Devanagari, etc.
- The
u(unicode) flag does NOT fix this —\bbehavior with\w/\Wis unchanged. - Removing
\bfrom CJK patterns is generally safe because CJK characters are unlikely to appear as substrings within other words in unrelated contexts. - The
regexp-cjknpm package provides CJK-aware regex utilities if you need more sophisticated matching. - When debugging: if a regex works for "DRIVER'S LICENCE" but not "運転免許証", suspect
\bfirst.