2019-10-09
Using grep and regex to extract words from a file that contain only one kind of vowel
stackoverflow
Question

I have a large dictionary file that contains one word per line.

I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.

This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt

returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?

Answer
1

Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.

awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
    c = "[b-df-hj-np-tv-z]"
    for (v in vowel)
      regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
    $0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
        hit[v] = ( hit[v] ? hit[v] ORS : "") $0
        next } }
    END { for (v in vowel) {
        printf "=== %s ===\n", vowel[v]
        print hit[v] } }' /usr/share/dict/words

You'll notice that it prints words with syllabic y like jolly and cycle. A more sophisticated regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.

The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.

Demo: https://ideone.com/wNrvPu

Using grep and regex to extract words from a file that contain only one kind of vowel
See more ...