Apply CountVectorizer to column with list of words in rows in Python
I made a preprocessing part for text analysis and after removing stopwords and stemming like this:
test[col] = test[col].apply( lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words]) train[col] = train[col].apply( lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
I've got a column with list of "cleaned words". Here are 3 rows in a column:
['size'] ['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps'] ['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']
I now want to apply CountVectorizer to this column:
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words X_train = cv.fit_transform(train[col])
But I got an Error:
TypeError: expected string or bytes-like object
It would be a bit strange to create string from list and than separate by CountVectorizer again.
As I found no other way to avoid an error, I joined the lists in column
train[col]=train[col].apply(lambda x: " ".join(x) ) test[col]=test[col].apply(lambda x: " ".join(x) )
Only after that I started to get the result
X_train = cv.fit_transform(train[col]) X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())
Apply CountVectorizer to column with list of words in rows in Python, I made a preprocessing part for text analysis and after removing stopwords and stemming like this: test[col] = test[col].apply( lambda x: [ps.stem(item) for item in Notice that here we have 9 unique words. So 9 columns. Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset. In this case, we only have one book title (i.e. the document), and therefore we have only 1 row. The values in each cell are the word counts.
When you use
fit_transform, the params passed in have to be an iterable of strings or bytes-like objects. Looks like you should be applying that over your column instead.
X_train = train[col].apply(lambda x: cv.fit_transform(x))
You can read the docs for
10+ Examples for Using CountVectorizer, from sklearn.feature_extraction.text import CountVectorizer # Make a vectorizer This will give you a dataframe where each column is a word, and each row has a 0 or 1 as to You'll use this one when there is a short list of specific words. The following are code examples for showing how to use sklearn.feature_extraction.text.CountVectorizer().They are from open source Python projects. You can vote up the examples you like or vote down the ones you don't like.
Your input should be list of strings or bytes, in this case you seem to provide list of list.
It looks like you already tokenized your string into tokens, inside separate lists. What you can do is a hack as below:
inp = [['size'] ['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']] ['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet'] inp = ["<some_space>".join(x) for x in inp] vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word") vectorizer.fit_transform(inp)
Counting words with scikit-learn's CountVectorizer, Preliminaries. # Load library import numpy as np from sklearn.feature_extraction.text import CountVectorizer import pandas as pd If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. token
Vectorizing Python coding snippets, bag_of_words a matrix where each row represents a specific text in corpus and each column represents a word in vocabulary, that is, all words CountVectorizer expects each row to just be a single string, so in order to use all of the text columns, you'll need a method to turn a list of strings into a single string. In this exercise, you'll complete the function definition combine_text_columns().
Bag Of Words, from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer Simply cast the output of the transformation to a list as follows: When i use this dataframe column tweets for classification, I am getting error "setting an array We used word2vec to create word embeddings (vector representations for words). Returns a list of the cleaned text """ # Check characters to see if they are in punctuation nopunc = [char for char in mess if char not in string. punctuation] # Join the characters again to form the string. nopunc = ''. join (nopunc) # Now just remove any stopwords return [word for word in nopunc. split if word. lower not in stopwords. words
How to list the most common words from text corpus using Scikit , Count Vectorizer converts a collection of text data to a matrix of token It is simply a matrix with terms as the rows and document names( or dataframe columns) as the columns and a count of the frequency of words as the cells of the Go Java Python and 2 2 2 application 0 1 0 are 1 0 1 bytecode 0 1 0 can Text Processing 1 — Old Fashioned Methods (Bag of Words and TFxIDF) we will apply CountVectorizer that converts a collection of text documents to a matrix In python, you can apply LDA