Build a Simple Diacritics Remover in JavaScript (Step-by-Step)Removing diacritics (accents, cedillas, tildes, etc.) from text is a common task when normalizing input for search, matching, sorting, URL slugs, or simple ASCII-only storage. This tutorial walks through several practical approaches in JavaScript: built-in Unicode normalization, a mapping table, and a small npm-friendly utility. Each approach includes code, trade-offs, and usage suggestions so you can pick what fits your needs.
Why remove diacritics?
- Improves search and matching by making “résumé” match “resume”.
- Simplifies generation of slugs and filenames.
- Helps systems that only support ASCII characters.
Approach 1 — Use String.prototype.normalize() + regex (recommended for most cases)
JavaScript’s Unicode normalization can decompose characters into base letters plus combining marks. Removing the combining marks leaves the base ASCII (or non-accented) characters.
Example:
function removeDiacriticsNormalize(input) { // NFD decomposes combined letters into letter + diacritic marks return input.normalize('NFD').replace(/[̀-ͯ]/g, ''); } // Usage console.log(removeDiacriticsNormalize('résumé — São Paulo — Voilà')); // "resume — Sao Paulo — Voila"
Pros:
- Very short and fast for most Latin-script use-cases.
- No external dependencies.
Cons:
- Doesn’t convert some letters that are considered distinct letters (eg. Polish ł → l is fine, but some scripts/letters like German ß remain ß because it’s not a combining accent; ß might need special handling).
- For full ASCII-only conversion you may want additional substitutions (e.g., “œ” → “oe”, “ß” → “ss”).
Approach 2 — Normalize + small post-processing map (balanced coverage)
Combine normalization with a small mapping table for characters that normalization doesn’t split into base + combining marks (ligatures, special letters).
Example:
const EXTRA_MAP = { 'ß': 'ss', 'Æ': 'AE', 'æ': 'ae', 'Œ': 'OE', 'œ': 'oe', 'Ø': 'O', 'ø': 'o', 'Ł': 'L', 'ł': 'l' // add other special cases you need }; function removeDiacriticsWithMap(input) { const normalized = input.normalize('NFD').replace(/[̀-ͯ]/g, ''); return normalized.replace(/[ -ɏ]/g, (ch) => EXTRA_MAP[ch] || ch); } // Usage console.log(removeDiacriticsWithMap('straße, Œuvre, Łódź')); // "strasse, OEuvre, Lodz"
Pros:
- Handles common special-cases while keeping code small.
- Gives predictable ASCII outputs for commonly problematic characters.
Cons:
- You must maintain the map for any additional characters you want to convert.
- Map-based replacements may miss rare characters.
Approach 3 — Full mapping table (highest control)
If you need exact conversion for many languages, build or use a comprehensive mapping table covering Latin-extended ranges. This method is deterministic and works offline without relying on Unicode decomposition correctness across environments.
Example (truncated):
const FULL_MAP = { 'À':'A','Á':'A','Â':'A','Ã':'A','Ä':'A','Å':'A','Ā':'A','Ă':'A','Ą':'A', 'à':'a','á':'a','â':'a','ã':'a','ä':'a','å':'a','ā':'a','ă':'a','ą':'a', 'Ç':'C','ç':'c','Ć':'C','ć':'c','Č':'C','č':'c', // ... many more entries }; function removeDiacriticsFullMap(input) { return input.split('').map(ch => FULL_MAP[ch] || ch).join(''); }
Pros:
- Total control over every mapped character.
- Useful for critical systems where deterministic mapping is required.
Cons:
- Large data structure (increases bundle size).
- Time-consuming to build and maintain.
Approach 4 — Use a tiny library (quickest for production)
If you prefer not to write and maintain mapping data, use a small, well-tested library like diacritics or remove-accents on npm. Example (pseudo):
npm install remove-accents
import removeAccents from 'remove-accents'; console.log(removeAccents('résumé — São Paulo')); // "resume — Sao Paulo"
Pros:
- Saves development time.
- Libraries usually cover many edge cases.
Cons:
- Adds a dependency and slightly increases bundle size.
- Verify maintenance and licensing before using.
Performance notes
- normalize(‘NFD’).replace(…) is very fast in modern engines for typical strings.
- Full mapping via split/map/join is slightly slower but predictable.
- For large-scale processing (millions of strings), benchmark options in your environment and consider server-side batch normalization.
Tests and edge cases to consider
- Ligatures: œ → oe, æ → ae.
- Language-specific letters: ß → ss, ł → l.
- Characters outside Latin script: Cyrillic, Greek, Arabic should generally be left unchanged unless you intentionally transliterate them.
- Combining marks beyond U+036F (rare) — consider extending regex if you find others.
- Unicode normalization availability: modern browsers and Node.js support it; very old environments might lack it.
Putting it together — a practical utility
A compact utility that uses normalization plus a small extras map, suitable for most web apps:
const EXTRA_MAP = { 'ß': 'ss', 'Æ': 'AE', 'æ': 'ae', 'Œ': 'OE', 'œ': 'oe', 'Ø': 'O', 'ø': 'o', 'Ł': 'L', 'ł': 'l' }; export function removeDiacritics(input) { if (!input) return input; const normalized = input.normalize('NFD').replace(/[̀-ͯ]/g, ''); return normalized.replace(/[ -ɏ]/g, ch => EXTRA_MAP[ch] || ch); }
Use this in forms, slug generators, search normalization, or anywhere you need consistent ASCII-like text.
Final recommendations
- For most cases: use normalize(‘NFD’) + regex and add a tiny map for special characters.
- If you need broad, maintained coverage and don’t mind a dependency: use a lightweight npm package.
- If you must control every mapping (legal/localization constraints): build a full mapping table and include tests.
Leave a Reply