Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length

numpy read csv with header
pandas dataframe
numpy import csv
boolean indexing pandas
pandas boolean column
numpy genfromtxt csv
python read csv column into array
python read csv into 2d array
train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)

I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.

I know there is a function

df.sample(random_state=123) 

However, Id need this functionality when reading in the csv because of the size of the file.

I know for certain that the number of rows is more than 900k, so I can do...

np.random.seed(42)
skip = np.random.randint(0,900000,200000)
train_df = pd.read_csv(train_file, header=0, skiprows=skip)

But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?


Here is necessary read file twice - first for length and then by read_csv, because read_csv cannot return the length of the file:

np.random.seed(1245)

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_file = 'file.csv'
num = file_len(train_file)
print (num)

skip = np.random.randint(0,num,200000)
#more dynamic - 20% of length 
#skip = np.random.randint(0,num,int(num * 0.2))
train_df = pd.read_csv(train_file, header=0, skiprows=skip)
print (train_df)

NumPy Tutorial: Data Analysis with Python – Dataquest, Here are the first few rows of the winequality-red.csv file, which we'll be using Before using NumPy, we'll first try to work with the data using Python and the csv We now know how to create arrays, but unless we can retrieve results from them, The shape specifies the number of dimensions, and the size of the array in  1 Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length Sep 26 '18 1 I can't get the desired ouput with print and str.center Aug 22 '18 View all questions and answers →


You could try

import pandas as pd
import numpy as np
np.random.seed(4)
pd.read_csv(file, header=0,
            skiprows=lambda i: i>0 and np.random.choice(5))

Selecting Subsets of Data in Pandas: Part 3 - Dunder Data, This is part three of a four-part series on how to select subsets of data from a import numpy as np>>> df = pd.read_csv('../. we can use a list or NumPy array with different values for each row. returns the number of rows in the DataFrame ensuring that the size of the array is correct. df_orig = df.copy(). numpy.random.seed(seed=None)¶. Seed the generator. This method is called when RandomState is initialized. It can be called again to re-seed the generator.


np.random.seed(42)
p = 0.3 #% of rows to read in
train_df = pd.read_csv(train_file, header=0, skiprows=lambda x: (x>0) & (np.random.random() > p))

Selecting Subsets of Data in Pandas: Part 2 - Dunder Data, This is part two of a four-part series on how to select subsets of data from a Part 1 of this series covered subset selection with [] , .loc and .iloc . All three of these indexers use either the row/column labels or their integer to be the same exact length as the object you are doing boolean selection on. so[criteria].head()  Can be an integer, an array (or other sequence) of integers of any length, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. The concept of seed is relevant for the generation of random numbers.


Python Exploratory Data Analysis Tutorial, As you will know by now, the Python data manipulation library been introduced to the basics of SciPy, NumPy, Matplotlib and Pandas, analysis positively with feature engineering and feature selection. Note that in this case, you made use of read_csv() because the data from random import ______. x = Variable(5) # Matrix variable with 4 rows and 7 columns. A = Variable(4, 7) ##### import numpy # Problem data. m = 10 n = 5 numpy.random.seed(1) A = numpy.random.randn(m, n) b = numpy.random.randn(m, 1) # Construct the problem.


How to use Pandas Sample to Select Rows and Columns, Here we will learn how to; select rows at random, set a random seed, Now we know how many rows and columns there are (19543 and 5 rows import numpy as np rows = np.random.choice(df.index.values, 200) df.sample(frac=1).head() By default Pandas sample will sample without replacement. The seed is for when we want repeatable results. If you don't want that, don't seed your generator. It will use the system time for an elegant random seed. Here's an example:


IO tools (text, CSV, HDF5, …), The workhorse function for reading text files (a.k.a. flat files) is read_csv() . Row number(s) to use as the column names, and the start of the data. E.g. {'a': np.​float64, 'b': np.int32} (unsupported with engine='python' ). The usecols argument allows you to select any subset of the columns in a file, either using the column  This tutorial describes how to subset or extract data frame rows based on certain criteria. Additionally, we'll describe how to subset a random number or fraction of rows. You will also learn how to remove rows with missing values in a given column.