I'm trying to remove punctuation from tokenized text using regular expressions. Can anyone explain the following behaviour:
$ STRING='hey , you ! what " are you doing ? say ... ," what '
$ echo $STRING | sed -r 's/ [^[:alnum:][:space:]-]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | sed -r 's/ [[:punct:]]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | perl -pe 's/ [^[:alnum:][:space:]-]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | perl -pe 's/ [[:punct:]]+ / /g;'
hey you what are you doing say ," what
The ," token is preserved in the output, which I don't want. It's possible to match this token with:
$ echo $STRING | perl -pe 's/ [",]+ / /g;'
hey you ! what are you doing ? say ... what