Appending lists of words or characters from all rows in a dataframe

pandas convert one row to list
list to dataframe
dataframe row to list
pandas concat
pandas dataframe
convert dataframe column to list python
pandas dataframe to list of lists
pandas series to list

Is there a way to append lists present in different rows in a dataframe without using a 'for' loop ?

I am able to achieve that by using 'for' loop, but I want to achieve this in a much more efficient way, probably without using 'for' loop

d = {'col1': [1,2,3,4,5], 'col2': [['a'],['a','b','c'],['d'],['e'],['a','e','d']]}
df = pd.DataFrame(data=d)
word_list = []
for i in df['col2']:
  word_list = word_list + i

I want to get an output list like this: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

One way to do it is with panda's sum function:

In [1]: import pandas as pd
   ...: d = {'col1': [1,2,3,4,5], 'col2': [['a'],['a','b','c'],['d'],['e'],['a','e','d']]}
   ...: df = pd.DataFrame(data=d)

In [2]: df['col2'].sum()
Out[2]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

However, itertools.chain.from_iterable is much faster:

In [3]: import itertools
   ...: list(itertools.chain.from_iterable(df['col2']))
Out[3]: ['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

In [4]: %timeit df['col2'].sum()
92.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit list(itertools.chain.from_iterable(df['col2']))
20.4 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In my testing, itertools.chain.from_iterable can be up to 30x faster for larger dataframes (~1000 rows). Another option is

import functools
import operator

functools.reduce(operator.iadd, df['col2'], [])

which is pretty much equally as fast as itertools.chain.from_iterable. I made a graph for all of the answers that were posted:

(The x-axis is the length of the dataframe)

As you can see, everything using sum or functools.reduce with operators.add is unusable, with np.concat being slightly better. However, the three winners by far are itertools.chain, itertool.chain.from_iterable, and functools.reduce with operators.iadd. They take almost no time. Here is the code used to produce the plot:

import functools
import itertools
import operator
import random
import string

import numpy as np
import pandas as pd
import perfplot # see https://github.com/nschloe/perfplot for this awesome library


def gen_data(n):
    return pd.DataFrame(data={0: [
        [random.choice(string.ascii_lowercase) for _ in range(random.randint(10, 20))]
        for _ in range(n)
    ]})

def pd_sum(df):
    return df[0].sum()

def np_sum(df):
    return np.sum(df[0].values)

def np_concat(df):
    return np.concatenate(df[0]).tolist()

def functools_reduce_add(df):
    return functools.reduce(operator.add, df[0].values)

def functools_reduce_iadd(df):
    return functools.reduce(operator.iadd, df[0], [])

def itertools_chain(df):
    return list(itertools.chain(*(df[0])))

def itertools_chain_from_iterable(df):
    return list(itertools.chain.from_iterable(df[0]))

perfplot.show(
    setup=gen_data,
    kernels=[
        pd_sum,
        np_sum,
        np_concat,
        functools_reduce_add,
        functools_reduce_iadd,
        itertools_chain,
        itertools_chain_from_iterable
    ],
    n_range=[10, 50, 100, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000],
    equality_check=None
)

Create a list from rows in Pandas dataframe, iterrows() function and then we can append the data of each row to the end of the list. filter_none. edit close. play_arrow. link brightness_4 code  1 Answer1. active oldest votes. 2. Use concat or append with DataFrame contructor: df = df.append(pd.DataFrame( [2,3,4,5], columns= ['col1'])) df = pd.concat( [df, pd.DataFrame( [2,3,4,5], columns= ['col1'])]) print (df) col1 col2 col3 0 1 x 3.0 1 1 y 4.0 0 2 NaN NaN 1 3 NaN NaN 2 4 NaN NaN 3 5 NaN NaN. share.

Can't find a dupe, sum of lists will return a combined list

df.col2.sum()

['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

Or use Numpy

np.sum(df.col2.values)

pandas.DataFrame.append, A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. Examples. >>> df = pd.DataFrame([[1, 2]​  Create a list from rows in Pandas dataframe Python list is easy to work with and also list has a lot of in-built functions to do a whole lot of operations on lists. Pandas dataframe’s columns consist of series but unlike the columns, Pandas dataframe rows are not having any similar association.

Or use numpy.concatenate:

print(np.concatenate(df['col2']).tolist())

Output:

['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

Working with text data, object dtype breaks dtype-specific operations like DataFrame.select_dtypes() . Currently, the performance of object dtype arrays of strings and arrays. ways to concatenate a Series or Index , either with itself or others, all based on cat() , resp​. In this case, the number or rows must match the lengths of the calling Series  In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data) once at the end, outside the loop. Each call to df.append requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and

Okay, another way(Just FYI):

from functools import reduce
reduce(lambda x,y: x+y,df.col2.values)

or:

from functools import reduce
import operator
reduce(operator.add,df.col2.values)

#['a', 'a', 'b', 'c', 'd', 'e', 'a', 'e', 'd']

pandas.Series.str.cat, Concatenate strings in the Series/Index with given separator. If others is not passed, then all values in the Series/Index are concatenated into a Series, Index, DataFrame, np.ndarray (one- or two-dimensional) and other list-likes of If na_rep is None, and others is not None, a row containing a missing value in any of the  Is there any way to print the elements of the list to a new column of the data frame? The order is the same in both structures, I mean, the GO.ID column is ordered as the list elements. I'm looking for something like paste bash command. I've tried lapply and export the list to a file. Then write.table with the dataframe and then paste command

String Functions in Python with Examples, This tutorial outlines various string or character functions used in Python. Leading and Trailing Spaces; Convert Numeric to String; Concatenate or Join Strings; SQL IN Operator in Pandas To deal text data in Python Pandas Dataframe, we can use str attribute. How to find rows containing either A or B in variable var1? If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged. Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Python list comprehension with Examples, This tutorial covers how list comprehension works in Python. Convert character variable to integer How to create tuples from lists; Split Sentences into words with list comprehension All these three have different programming styles of iterating through each b = [] for row in mat: for x in row: if x%2 == 1: b.append(x) b. To append or add a row to DataFrame, create the new row as Series and use DataFrame.append () method. In this tutorial, we shall learn how to append a row to an existing DataFrame, with the help of illustrative example programs. Syntax – append () Following is the syntax of DataFrame.appen () function.

How to Concatenate Column Values in Pandas DataFrame, You may use pandas to concatenate column values in Python. In this guide You can bypass this error by mapping the values to strings using this syntax: df1 = df['1st Next, I'll review the following 3 examples, in order to demonstrate how to concatenate column values in pandas DataFrame: Example All rights reserved © Where each df is a DataFrame of the form above, except that the value of the 'Labels' column is replaced with a 1 or 0, depending on whether dictionary key 'label_i' is in the original label list for that row.

Comments
  • For all possible answers have a look on this post stackoverflow.com/a/54089037/10734525 by jezrael. The notion of that question is little different but the approach is what you can learn.
  • sum has O(n^2) complexity so that 30x will become much higher for larger dataframes. It is practically not usable for anything that has 100k+ rows.
  • This isn't working for me. I get [['a'], ['a', 'b', 'c'], ['d'], ['e'], ['a', 'e', 'd']]
  • great going genius :)
  • Thank you @meW. :) glad you liked it.
  • Sorry to one-up you again, but this is 30-40x slower than my functools.reduce solution above for a dataframe with 1000 rows.
  • @anky_91 I tested both the first and second, I was talking about the second.
  • Our answers were somewhat similar, if you are interested I added a full performance comparison for all of the answers.