Count consecutive zeros over pandas rows

Having the following pd.DataFrame

pd.DataFrame({'2010':[0, 45, 5], '2011': [12, 56, 0], '2012': [11, 22, 0], '2013': [0, 5, 0], '2014': [0, 0, 0]})

  2010 2011 2012 2013 2014
1  0    12   11   0    0
2  45   56   22   5    0
3  5    0    0    0    0

I would like to count the consecutive zeros over rows

1 [1, 2]
2 [1]
3 [4]

Looking for different efficient ways

For efficiency, I would suggest going pure NumPy way -

def islandlen_perrow(df, trigger_val=0):
    a=df.values==trigger_val
    pad = np.zeros((a.shape[0],1),dtype=bool)
    mask = np.hstack((pad, a, pad))
    mask_step = mask[:,1:] != mask[:,:-1]
    idx = np.flatnonzero(mask_step)
    island_lens = idx[1::2] - idx[::2]
    n_islands_perrow = mask_step.sum(1)//2
    out = np.split(island_lens,n_islands_perrow[:-1].cumsum())
    return out

Sample run -

In [69]: df
Out[69]: 
   2010  2011  2012  2013  2014
0     0    12    11     0     0
1    45    56    22     5     0
2     5     0     0     0     0

In [70]: islandlen_perrow(df, trigger_val=0)
Out[70]: [array([1, 2], dtype=int64), array([1], dtype=int64), array([4], dtype=int64)]

In [76]: pd.Series(islandlen_perrow(df, trigger_val=0))
Out[76]: 
0    [1, 2]
1       [1]
2       [4]
dtype: object

Timings on larger array -

In [77]: df = pd.DataFrame(np.random.randint(0,4,(1000,1000)))

In [78]: from itertools import groupby

# @Daniel Mesejo's soln
In [79]: def count_zeros(x):
    ...:     return [sum(1 for _ in group) for key, group in groupby(x, key=lambda i: i == 0) if key]

In [80]: %timeit df.apply(count_zeros, axis=1)
1 loop, best of 3: 228 ms per loop

# @coldspeed's soln-1
In [84]: %%timeit
    ...: v = df.stack()
    ...: m = v.eq(0)
    ...: 
    ...: (m.ne(m.shift())
    ...:   .cumsum()
    ...:   .where(m)
    ...:   .dropna()
    ...:   .groupby(level=0)
    ...:   .apply(lambda x: x.value_counts(sort=False).tolist()))
1 loop, best of 3: 516 ms per loop

# @coldspeed's soln-2
In [88]: %%timeit
    ...: v = df.stack()
    ...: m = v.eq(0)
    ...: (m.ne(m.shift())
    ...:   .cumsum()
    ...:   .where(m)
    ...:   .dropna()
    ...:   .groupby(level=0)
    ...:   .value_counts(sort=False)
    ...:   .groupby(level=0)
    ...:   .apply(list))
1 loop, best of 3: 343 ms per loop

# @jpp's soln
In [90]: %timeit [[len(list(grp)) for flag, grp in groupby(row, key=bool) if not flag] \
    ...:                 for row in df.values]
1 loop, best of 3: 334 ms per loop

# @J. Doe's soln
In [94]: %%timeit
    ...: data = df
    ...: data_transformed = np.equal(data.astype(int).values.tolist(), 0).astype(str)
    ...: pd.DataFrame(data_transformed).apply(lambda x: [i.count('True') for i in ''.join(list(x)).split('False') if i], axis=1)
1 loop, best of 3: 519 ms per loop

# From this post
In [89]: %timeit pd.Series(islandlen_perrow(df, trigger_val=0))
100 loops, best of 3: 9.8 ms per loop

Filter the DataFrame with rows having value == 0. Separate these rows if they are not consecutive. In other words, there is at least one row with value != 0. The first step is very easy, but apparently not the second. Let’s have the intuitive steps before coding the solution. Create a “mask” series with all boolean values.

Using itertools.groupby with a list comprehension:

from itertools import groupby

df['counts'] = [[len(list(grp)) for flag, grp in groupby(row, key=bool) if not flag] \
                for row in df.values]

print(df)

   2010  2011  2012  2013  2014  counts
0     0    12    11     0     0  [1, 2]
1    45    56    22     5     0     [1]
2     5     0     0     0     0     [4]

I'm using pandas and I am dealing with time series of sales. What I would like to do is to remove the columns where a certain number of consecutive zeros appear, since forecasting for sparse series or repeated zero values tend to be unreliable. My dataframe has a bit more than 4400 time series for a time span of 5 years.

If you're interested in a pure pandas/numpy solution... you can do this with groupby and value_counts:

v = df.stack()
m = v.eq(0)

(m.ne(m.shift())
  .cumsum()
  .where(m)
  .dropna()
  .groupby(level=0)
  .apply(lambda x: x.value_counts(sort=False).tolist()))

0    [1, 2]
1       [1]
2       [4]
dtype: object

Or, avoiding the lambda,

(m.ne(m.shift())
  .cumsum()
  .where(m)
  .dropna()
  .groupby(level=0)
  .value_counts(sort=False)
  .groupby(level=0)
  .apply(list))

0    [1, 2]
1       [1]
2       [4]
dtype: object

Create a Dataframe Contents of the Dataframe : Name Age City Experience a jack 34.0 Sydney 5.0 b Riti 31.0 Delhi 7.0 c Aadi 16.0 NaN 11.0 d Mohit NaN Delhi 15.0 e Veena 33.0 Delhi 4.0 f Shaunak 35.0 Mumbai NaN g Shaun 35.0 Colombo 11.0 **** Get the row count of a Dataframe using Dataframe.shape Number of Rows in dataframe : 7 **** Get the row

You could use itertools.groupby:

import pandas as pd

from itertools import groupby


def count_zeros(x):
    return [sum(1 for _ in group) for key, group in groupby(x, key=lambda i: i == 0) if key]


df = pd.DataFrame({'2010':[0, 45, 5], '2011': [12, 56, 0], '2012': [11, 22, 0], '2013': [0, 5, 0], '2014': [0, 0, 0]})

result = df.apply(count_zeros, axis=1)
print(result)

Output

0    [1, 2]
1       [1]
2       [4]
dtype: object

pandas.core.window.rolling.Rolling.count¶ Rolling.count (self) [source] ¶ The rolling count of any non-NaN observations inside the window. Returns Series or DataFrame. Returned object type is determined by the caller of the rolling calculation.

One method is to transform the values to boolean values, and splitting the string by False values

data_transformed = np.equal(data.astype(int).values.tolist(), 0).astype(str)
pd.DataFrame(data_transformed).apply(lambda x: [i.count('True') for i in ''.join(list(x)).split('False') if i], axis=1)

Count non-NA cells for each column or row. The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA. Parameters axis {0 or ‘index’, 1 or ‘columns’}, default 0. If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

I am looking for a way to count the consecutive number of 1's in a given column of a table. I then want a count the number of times they fall into a specific group: 1-5 consecutive 1's, 6-12 consecutive 1's, and greater than 12 consecutive 1's. I have done this in Excel using the following formula b

In the output above, Pandas has created four separate bins for our volume column and shows us the number of rows that land in each bin. Both counts() and value_counts() are great utilities for quickly understanding the shape of your data. Conclusion. In this post, we learned about groupby, count, and value_counts – three of the main methods

Pandas : count rows in a dataframe | all or those only that satisfy a condition; Pandas : Loop or Iterate over all or certain columns of a dataframe; Pandas : Check if a value exists in a DataFrame using in & not in operator | isin() Pandas : Get frequency of a value in dataframe column/index & find its positions in Python; Python: Find indexes