Accurate Word Counter for non-Latin characters in Javascript regex

There is a problem that involves Javascript and regular expressions. The JS implementation of regexp does not support Unicode properly, for example /\b\S+\b/g regular expression will not count words with Unicode characters of many national alphabets and scripts, such as Cyrillic, Greek and Hindi. Unfortunately \S is restricted to Latin-only characters of English alphabet.

To solve this problem we must explicitly include all Unicode characters. My solution is to use /([\u0080-\uFFFF\w]\u0027?)+/g regular expression instead. It covers the wide range of Unicode characters (from 0080 to FFFF) that includes all national alphabets + apostrophe symbol (0027). This regex has been tested with the following sample text and it counts all 55 words accurately, ignoring all special characters and punctuation, I used https://regexr.com to test it with this sample text that includes words from several alphabets.

Artem Nagornyi's SDET recipes

Search This Blog

Accurate Word Counter for non-Latin characters in Javascript regex

Labels

Popular posts from this blog

Switching between keyboard layouts in Openbox (Arch Linux)

Integrating TestRail and Gitlab CI/CD