Splitting a string into segments and storing them in a list.
Tokenization based on white space, a hyphen, or any punctuation.
Bag of words representation
Count the numnber of times a particular token appears = bag of workds
It counts the number of times a word was pulled out of t he bag.
This approach discard information about word order.
A better approach is using n-grams.
# Import CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer# Create the token pattern: TOKENS_ALPHANUMERICTOKENS_ALPHANUMERIC ='[A-Za-z0-9]+(?=\\s+)'# Fill missing values in df.Position_Extradf.Position_Extra.fillna('', inplace=True)# Instantiate the CountVectorizer: vec_alphanumericvec_alphanumeric =CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)# Fit to the datavec_alphanumeric.fit(df.Position_Extra)# Print the number of tokens and first 15 tokensmsg ="There are {} tokens in Position_Extra if we split on non-alpha numeric"print(msg.format(len(vec_alphanumeric.get_feature_names())))print(vec_alphanumeric.get_feature_names()[:15])
In this exercise, you'll complete the function definition combine_text_columns(). When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform() method.
# Define combine_text_columns()defcombine_text_columns(data_frame,to_drop=NUMERIC_COLUMNS + LABELS):""" converts all text in each row of data_frame to single vector """# Drop non-text columns that are in the df to_drop =set(to_drop)&set(data_frame.columns.tolist()) text_data = data_frame.drop(to_drop, axis=1)# Replace nans with blanks text_data.fillna('', inplace=True)# Join all text items in a row that have a space in betweenreturn text_data.apply(lambdax: " ".join(x), axis=1)
Now you will use combine_text_columns to convert all training text data in your DataFrame to a single vector that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform()method.
# Import the CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer# Create the basic token patternTOKENS_BASIC ='\\S+(?=\\s+)'# Create the alphanumeric token patternTOKENS_ALPHANUMERIC ='[A-Za-z0-9]+(?=\\s+)'# Instantiate basic CountVectorizer: vec_basicvec_basic =CountVectorizer(token_pattern=TOKENS_BASIC)# Instantiate alphanumeric CountVectorizer: vec_alphanumericvec_alphanumeric =CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)# Create the text vectortext_vector =combine_text_columns(df)# Fit and transform vec_basicvec_basic.fit_transform(text_vector)# Print number of tokens of vec_basicprint("There are {} tokens in the dataset".format(len(vec_basic.get_feature_names())))# Fit and transform vec_alphanumericvec_alphanumeric.fit_transform(text_vector)