## Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length

pandas dataframe
numpy import csv
boolean indexing pandas
pandas boolean column
numpy genfromtxt csv
python read csv column into array
python read csv into 2d array
```train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)
```

I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.

I know there is a function

```df.sample(random_state=123)
```

However, Id need this functionality when reading in the csv because of the size of the file.

I know for certain that the number of rows is more than 900k, so I can do...

```np.random.seed(42)
skip = np.random.randint(0,900000,200000)
```

But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?

Here is necessary read file twice - first for length and then by `read_csv`, because `read_csv` cannot return the length of the file:

```np.random.seed(1245)

def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1

train_file = 'file.csv'
num = file_len(train_file)
print (num)

skip = np.random.randint(0,num,200000)
#more dynamic - 20% of length
#skip = np.random.randint(0,num,int(num * 0.2))
print (train_df)
```

NumPy Tutorial: Data Analysis with Python – Dataquest, Here are the first few rows of the winequality-red.csv file, which we'll be using Before using NumPy, we'll first try to work with the data using Python and the csv We now know how to create arrays, but unless we can retrieve results from them, The shape specifies the number of dimensions, and the size of the array in  1 Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length Sep 26 '18 1 I can't get the desired ouput with print and str.center Aug 22 '18 View all questions and answers →

You could try

```import pandas as pd
import numpy as np
np.random.seed(4)
skiprows=lambda i: i>0 and np.random.choice(5))
```

Selecting Subsets of Data in Pandas: Part 3 - Dunder Data, This is part three of a four-part series on how to select subsets of data from a import numpy as np>>> df = pd.read_csv('../. we can use a list or NumPy array with different values for each row. returns the number of rows in the DataFrame ensuring that the size of the array is correct. df_orig = df.copy(). numpy.random.seed(seed=None)¶. Seed the generator. This method is called when RandomState is initialized. It can be called again to re-seed the generator.

```np.random.seed(42)
p = 0.3 #% of rows to read in