dimanche 28 juin 2015

Regex to remove punctuation from tokenized text

I'm trying to remove punctuation from tokenized text using regular expressions. Can anyone explain the following behaviour:

$ STRING='hey , you ! what " are you doing ? say ... ," what '
$ echo $STRING | sed -r 's/ [^[:alnum:][:space:]-]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | sed -r 's/ [[:punct:]]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | perl -pe 's/ [^[:alnum:][:space:]-]+ / /g;'
hey you what are you doing say ," what
$ echo $STRING | perl -pe 's/ [[:punct:]]+ / /g;'
hey you what are you doing say ," what

The ," token is preserved in the output, which I don't want. It's possible to match this token with:

$ echo $STRING | perl -pe 's/ [",]+ / /g;'
hey you ! what are you doing ? say ... what

Aucun commentaire:

Enregistrer un commentaire