NLP

Tokenization

  • Splitting a string into segments and storing them in a list.

  • Tokenization based on white space, a hyphen, or any punctuation.

Bag of words representation

  • Count the numnber of times a particular token appears = bag of workds

  • It counts the number of times a word was pulled out of t he bag.

  • This approach discard information about word order.

A better approach is using n-grams.

In this exercise, you'll complete the function definition combine_text_columns(). When completed, this function will convert all training text data in your DataFrame to a single string per row that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform() method.

Now you will use combine_text_columns to convert all training text data in your DataFrame to a single vector that can be passed to the vectorizer object and made into a bag-of-words using the .fit_transform()method.

Last updated

Was this helpful?