How should I get the shape of a dask dataframe?

dask drop columns
dask read_sql_table example
dask pivot table
dask drop duplicates
get number of rows dask dataframe
dask dataframe from dict
dask postgres
dask dataframe meta example

Performing .shape is giving me the following error.

AttributeError: 'DataFrame' object has no attribute 'shape'

How should I get the shape instead?

You can get the number of columns directly

len(df.columns)  # this is fast

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df)  # this requires a full scan of the data

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.

API, DataFrame.get_partition (n), Get a dask DataFrame/Series representing the nth If data in both corresponding DataFrame locations is missing the result will be DataFrame.shape: Number of DataFrame rows and columns (including NA  A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames.

With shape you can do the following

a = df.shape
a[0].compute(),a[1]

This will shop the shape just as it is shown with pandas

shape API inconsistency between dask.array and dask.dataframe , shape API inconsistency between dask.array and dask.dataframe #4616 They might have additional information about the ranges of index values found spec says that this should return a real number every time - so Dask  It is best to align the chunks of your Dask array with the chunks of your underlying data store. However, data stores often chunk more finely than is ideal for Dask array, so it is common to choose a chunking that is a multiple of your storage chunk size, otherwise you might incur high overhead.

To get the shape we can try this way:

 dask_dataframe.describe().compute()  

"count" column of the index will give the number of rows

 len(dask_dataframe.columns) 

this will give the number of columns in the dataframe

Misunderstanding of size and len · Issue #58 · dask/dask-tutorial , Hey guys, I was following dask-tutorial in the scipy-2017 branch and I got confused on the solution of question 2 in 03-dask-dataframes notebook: len(df) I am forwarding your question on .shape to https://gitter.im/dask/dev  Dask Name: from_pandas, 2 tasks dd.Scalar<series-, dtype=int32>. The real reason I need length is because df_dask.sample() takes a fraction and I want to sample a specified number of entries from the dataframe. I use length to compute this fraction.

Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

   import dask.dataframe as dd
   from itertools import (takewhile,repeat)

   def rawincount(filename):
       f = open(filename, 'rb')
       bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
       return sum( buf.count(b'\n') for buf in bufgen )

   filename = 'myHugeDataframe.csv'
   df = dd.read_csv(filename)
   df_shape = (rawincount(filename) - 1, len(df.columns))
   print(f"Shape: {df_shape}")

Hope this could help someone else as well.

Dask DataFrames, You can get the number of columns directly len(df.columns) # this is fast. You can also call len on the dataframe itself, though beware that this  For example a Dask.array turns into a numpy.array() and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation. The entire dataset must fit into memory before calling this operation.

[PDF] dask Documentation, In this tutorial, we will use dask.dataframe to do parallel operations on dask If you can't get your system running, it may be due to your firewall / anti-virus. is df.shape[0] (see how it takes the form of a method on our dask dataframe df ?) Create and Store Dask DataFrames¶. Dask can create DataFrames from various data storage formats like CSV, HDF, Apache Parquet, and others. For most formats, this data can live on various storage systems including local disk, network file systems (NFS), the Hadoop File System (HDFS), and Amazon’s S3 (excepting HDF, which is only available on POSIX like file systems).

dask.dataframe.DataFrame Python Example, Dask DataFrame mimics Pandas - documentation import pandas as Optionally, you can obtain a minimal Dask installation using the following command: Return a new array with the same shape and type as a given array. When I convert it into dask dataframe what should name and divisions parameter consist of: from dask import dataframe as dd sd=dd.DataFrame(df.to_dict(),divisions=1,meta=pd.DataFrame(columns=df.columns,index=df.index)) TypeError: init() missing 1 required positional argument: 'name' Edit: Suppose I create a pandas dataframe like:

Working with large data sets, The following are code examples for showing how to use dask.dataframe. def extract_dask_data(data): """Extract data from dask. (notnull_counts,) = dd.​compute(data.sum(axis=0) / data.shape[0]) missing_percent = {col: notnull_counts[idx]  When I try to run dataframe.x.count().compute() it looks like it tries to load the entire data into RAM, for which there is no space and it crashes. The use of count() doesn't seem appropriate here. Try len(df) instead. Using len(df) also tries to load the entire dataset into memory for some reason.

Comments
  • len(df) is loading all of the records and in my case, finding len(df) for a table at size 144M rows took more than few minutes (wind10,ram16,intel7). Any other way?
  • It probably has to load all of the data to find out the length. No, there is no other way. You could consider using something like a database, which tracks this sort of information in metadata.
  • i've been doing df.index.size.compute() which is faster than running len(df) ... but my data is stored in columnar parquet... so it depends on what your underlying data architecture is.