Find unique elements of column with chunksize pandas

pandas unique multiple columns
pandas chunksize
pandas dataframe unique
pandas count unique values in column
pandas count unique values in multiple columns
pandas unique values in column
pandas create dataframe with unique values
pandas read_csv

Given a sample(!) data frame:

test = 

time  clock
1     1
1     1
2     2
2     2
3     3
3     3

I was trying to do some operations with pandas chunksize:

for df in pd.read_csv("...path...",chunksize = 10):
    time_spam = df.time.unique()
    detector_list = df.clock.unique()

But it gives me operation to the length of the chunsize. If 10, then give me 10 rows only.

P.S. It is sample data

Please try:

for df in pd.read_csv("...path...",chunksize = 10, iterator=True):
    time_spam = df.time.unique()
    detector_list = df.clock.unique()

You need to use the iterator flag as described here:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

How to Get Unique Values from a Column in Pandas Data Frame , Pandas library in Python easily let you find the unique values. In this tutorial, we will see examples of getting unique values of a column using� Often while working with a big data frame in pandas, you might have a column with string/characters and you want to find the number of unique elements present in the column. Pandas library in Python easily let you find the unique values. In this tutorial, we will see examples of getting unique values of a column using two Pandas functions.

Here's how you can create lists of unique elements parsing chunks:

# Initialize lists
time_spam = []
detector_list = []

# Cycle over each chunk
for df in pd.read_csv("...path...", chunksize = 10):

    # Add elements if not already in the list
    time_spam += [t for t in df['time'].unique() if t not in time_spam] 
    detector_list += [c for c in df['clock'].unique() if c not in detector_list ] 

Efficient Pandas: Using Chunksize for Large Data Sets, But for this article, we shall use the pandas chunksize attribute or To answer these questions, first, we need to find a data set that contains An iterable also has the __get_item__() method that makes it possible to extract elements let's create a dictionary whose keys are the unique rating keys using a� Stack Overflow Public questions and answers; Teams Private questions and answers for your team; Enterprise Private self-hosted questions and answers for your enterprise; Talent Hire technical talent

File test.csv:

col1,col2
1,1
1,2
1,3
1,4
2,1
2,2
2,3
2,4

Code:

col1, col2 = [], []
for df in pd.read_csv('test.csv', chunksize = 3):
    col1.append(df.col1)
    col2.append(df.col2)

Results:

print(pd.concat(col1).unique())

[1 2]

print(pd.concat(col2).unique())

[1 2 3 4]

pandas.read_csv — pandas 1.1.0 documentation, Read a comma-separated values (csv) file into DataFrame. Passing in False will cause data to be overwritten if there are duplicate names in the columns. dtypeType See the IO Tools docs for more information on iterator and chunksize . Get unique values from a column in Pandas DataFrame 10-12-2018. E with duplicate elements. Now, let’s get the unique values of a column in this dataframe.

What's New — pandas 0.19.2 documentation, Bug in DataFrame.insert where multiple calls with duplicate columns can fail ( GH14291) when index or columns is not scalar and values is not specified ( GH14380) merge_asof() for asof-style time-series joining, see here .rolling() is now When read_csv() is called with chunksize=n and without specifying an index,� Contents of the Dataframe : Name Age City Experience a jack 34.0 Sydney 5 b Riti 31.0 Delhi 7 c Aadi 16.0 NaN 11 d Mohit 31.0 Delhi 7 e Veena NaN Delhi 4 f Shaunak 35.0 Mumbai 5 g Shaun 35.0 Colombo 11 *** Find unique values in a single column *** Unique elements in column "Age" [34.

pandas.read_csv — pandas 0.20.3 documentation, Duplicate columns will be specified as 'X.0''X.N', rather than 'X'. Dict of functions for converting values in certain columns. Keys can either be See the IO Tools docs for more information on iterator and chunksize . compression : {' infer'� pandas.Series.unique¶ Series.unique [source] ¶ Return unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. Returns ndarray or ExtensionArray. The unique values returned as a NumPy array. See Notes.

How to read data using pandas read_csv, Read this post to get a thorough understanding of using pandas read_csv. example for read_csv usecols ignoring element order while parsing If a dataset has duplicate column names, convert it to a dataframe by setting By specifying a chunksize , you can retrieve the data in same sized 'chunks' . Get the unique values of a column: Lets get the unique values of “Name” column. df.Name.unique() The unique() function gets the list of unique column values .

Comments
  • I'm not too sure I understand the point of the question, what are you expecting as output?
  • unique values of column time and clock
  • What is the rationale for chunking the df? Not enough memory?
  • yes guys, As I wrote - sample data. Real data has 90 million rows :(
  • let me check speed
  • This depends on your chunk size
  • let me check speed
  • A faster alternative could be to add them in a single list, then use list(set(...))