Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

typeerror: decoding to str: need a bytes-like object, list found
gensim lda
dictionary object has no attribute cfs
'list' object has no attribute 'split'
gensim most common words
bow gensim
dict' object has no attribute token2id
dict object has no attribute 'doc2bow

I am starting with some python task, I am facing a problem while using gensim. I am trying to load files from my disk and process them (split them and lowercase() them)

The code I have is below:

dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
    with open (file_path, "r") as myfile:
        text=myfile.read()
        for words in text.lower().split():
            dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)

The list (dictionary_arr) contains the list of all words across all the file, I then use gensim corpora.Dictionary to process the list. However I face a error.

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

I cant understand whats a problem, A little guidance would be appreciated.

In dictionary.py, the initialize function is:

def __init__(self, documents=None):
    self.token2id = {} # token -> tokenId
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0 # number of documents processed
    self.num_pos = 0 # total number of corpus positions
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix

    if documents is not None:
        self.add_documents(documents)

Function add_documents Build dictionary from a collection of documents. Each document is a list of tokens:

def add_documents(self, documents):

    for docno, document in enumerate(documents):
        if docno % 10000 == 0:
            logger.info("adding document #%i to %s" % (docno, self))
        _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
    logger.info("built %s from %i documents (total %i corpus positions)" %
                 (self, self.num_docs, self.num_pos))

So ,if you initialize Dictionary in this way, you must pass documents but not a single document. For example,

dic = corpora.Dictionary([a.split()])

is OK.

TypeError: doc2bow expects an array of unicode tokens on input , Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string (3 answers). Closed 2 years ago. I looked for all  Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string 1 'word' not in Vocabulary in a corpus with words shown in a single list only in gensim library

Dictionary needs a tokenized strings for its input:

dataset = ['driving car ',
           'drive car carefully',
           'student and university']

# be sure to split sentence before feed into Dictionary
dataset = [d.split() for d in dataset]

vocab = Dictionary(dataset)

TypeError: doc2bow expects an array of unicode tokens on , TypeError: doc2bow expects an array of unicode tokens on input, not a single string #2028. Closed. nvnvashisth opened this issue on Apr 11,  raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string") TypeError: doc2bow expects an array of unicode tokens on input, not a single string. and the output under "tokens" variable seems like this as below

Hello everyone i ran into the same problem. This is what worked for me

    #Tokenize the sentence into words
    tokens = [word for word in sentence.split()]

    #Create dictionary
    dictionary = corpora.Dictionary([tokens])
    print(dictionary)

Re: TypeError: doc2bow expects an array of unicode tokens on input , TypeError: doc2bow expects an array of unicode tokens on input, not a single string · techniques · lda, python File “C:\Python27\lib\site-packages\gensim\​corpora\dictionary.py”, line 233, in doc2bow raise TypeError(“doc2bow This way you split the document string by spaces. You may even use a  TypeError: doc2bow expects an array of unicode tokens on input, not a single string. Thanks for contributing an answer to Data Science Stack Exchange!

GENSIM: 'TypeError: doc2bow expects an array of unicode tokens , Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string. by pengyanin Getting Started 3 years ago. I cant understand whats​  Traceback (most recent call last): File "testTopic.py", line 15, in <module> doc_term_matrix = [dictionary.doc2bow(doc) for doc in line] File "C:\Python27\lib\site-packages\gensim\corpora\dictionary.py", line 233, in doc2bow raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string") TypeError: doc2bow expects an

doc2bow expects an array of unicode tokens on input, not a single , TypeError: doc2bow expects an array of unicode tokens on input, not a single string · python nlp. import re from nltk.tokenize import  doclist = ['human interface computer', 'survey user computer system', 'graph minors survey'] dic = corpora.Dictionary(doclist) # TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Does gensim.corpora.Dictionary have term frequency saved?, Re: TypeError: doc2bow expects an array of unicode tokens on input, not a an array of unicode tokens on input, not a single string, Ivan Menshikh Hi all, I am new in using gensim, would like to use it for topic modelling. I found doc2bow taking a rather large amount of time. This patch speeds it up by ca. 40% on my data: use hash tables, not sorting + grouping loops in conditional blocks, not conditional blocks in loops cache attribute/method lookups inline to_unicode Surprisingly, the final optimization was the most important one.

Comments
  • Hi wyq10, I tried the approach, its seems working, however there is a small problem. The count(frequency) of all the tokens in the dictionary remains same i.e. 1, despite the frequency for many tokens are more than 1