There is a problem that involves Javascript and regular expressions. The JS implementation of regexp does not support Unicode properly, for example /\b\S+\b/g regular expression will not count words with Unicode characters of many national alphabets and scripts, such as Cyrillic, Greek and Hindi. Unfortunately \S is restricted to Latin-only characters of English alphabet.
To solve this problem we must explicitly include all Unicode characters. My solution is to use /([\u0080-\uFFFF\w]\u0027?)+/g regular expression instead. It covers the wide range of Unicode characters (from 0080 to FFFF) that includes all national alphabets + apostrophe symbol (0027). This regex has been tested with the following sample text and it counts all 55 words accurately, ignoring all special characters and punctuation, I used https://regexr.com to test it with this sample text that includes words from several alphabets.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
One Jack's two "three" 'happy' Кав′ярня два три нав'язливо ένα δύο τρία âêîôû,O’Brien/ëïü,är en häst बर्ताव करना चाहिए।.!@#$%^&*()-+=<>?~`|\{}[],.? | |
Jess' pencils are sharp. | |
Direct speech in British English - 'That,' he said, 'is nonsense.' | |
Direct speech in American English - "What time will he arrive?" she asked. | |
What does 'integrated circuit' mean? | |
Rock 'n' Roll |