Why does tokenize leave some punctuation attached?

User 2594 | 12/5/2015, 8:47:10 PM

I noticed that a comma in the text gets left attached to the word that it follows. Same with a period and an exclamation point (etc.), rather than treating those as their own tokens. Can anyone comment on the rationale behind that decision?

Comments

User 2594 | 12/7/2015, 3:38:19 AM

I decided to make my own tokenizer. If anybody cares, here is what I created: ` import re

IsA = lambda s: '[' + s + ']' IsNotA = lambda s: '[^' + s + ']'

Upper = IsA( 'A-Z' ) Lower = IsA( 'a-z' ) Letter = IsA( 'a-zA-Z' ) Digit = IsA( '0-9' ) AlphaNumeric = IsA( 'a-zA-Z0-9' ) NotAlphaNumeric = IsNotA( 'a-zA-Z0-9' )

EndOfString = '$' OR = '|'

ZeroOrMore = lambda s: s + '' ZeroOrMoreNonGreedy = lambda s: s + '?' OneOrMore = lambda s: s + '+' OneOrMoreNonGreedy = lambda s: s + '+?'

StartsWith = lambda s: '^' + s Capture = lambda s: '(' + s + ')' PreceededBy = lambda s: '(?<=' + s + ')' FollowedBy = lambda s: '(?=' + s + ')' NotFollowedBy = lambda s: '(?!' + s + ')' StopWhen = lambda s: s CaseInsensitive = lambda s: '(?i:' + s + ')'

ST = '(?:st|ST)' ND = '(?:nd|ND)' RD = '(?:rd|RD)' TH = '(?:th|TH)'

Whitespace = r'\s' NonWhitespace = r'\S'

CharacterToken = r'[^-.\sa-zA-Z0-9]' NonCharacterToken = r'[-.\sa-zA-Z0-9]'

def OneOf( *args ): return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \ OneOf( # Treat consecutive whitespace as a single token PreceededBy( NonWhitespace ) + FollowedBy( Whitespace ), PreceededBy( Whitespace ) + FollowedBy( NonWhitespace ),

# Treat Period and Elipses (technically one or more dots) as their own token
PreceededBy( IsNotA( '.' ) ) + FollowedBy( IsA( '.' ) ),
PreceededBy( IsA( '.' ) ) + FollowedBy( IsNotA( '.' ) ),

# Treat one or more dashes as one token
PreceededBy( IsNotA( '-' ) ) + FollowedBy( IsA( '-' ) ),
PreceededBy( IsA( '-' ) ) + FollowedBy( IsNotA( '-' ) ),

# Treat all other non-alpha-numerics as their own tokens individually
FollowedBy( CharacterToken ),
PreceededBy( CharacterToken ) + FollowedBy( NonCharacterToken ),

# ABC | !!! - break at whitespace or non-alpha-numeric boundary
PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

# ABC | Abc - break at what looks like the start of a word or sentence
FollowedBy( Upper + Lower ),

# abc | ABC - break when a lower-case letter is followed by an upper case
PreceededBy( Lower )  + FollowedBy( Upper ),

# abc | 123 - break between words and digits
PreceededBy( Letter ) + FollowedBy( Digit ),

# 1st | oak - recognize when the string starts with an ordinal
PreceededBy( StartsWith( '1' + ST ) ),
PreceededBy( StartsWith( '2' + ND ) ),
PreceededBy( StartsWith( '3' + RD ) ),

# 1st | abc - contains an ordinal
PreceededBy( IsNotA( '1' ) + '1' + ST ),
PreceededBy( IsNotA( '1' ) + '2' + ND ),
PreceededBy( IsNotA( '1' ) + '3' + RD ),
PreceededBy( '1' + IsA( '123' )  + TH ),
PreceededBy( IsA( '04-9' )       + TH ),

# 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

# abcde | $ - end of the string
FollowedBy( EndOfString )

)

matcher = re.compile( pattern )

def tokenize( s ): return matcher.findall( s ) ` When I use HTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427",


User 19 | 12/9/2015, 8:01:02 PM

Thanks for posting, John! This is great and will likely be of use to others.