## Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length

```train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)
```

I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.

I know there is a function

```df.sample(random_state=123)
```

However, Id need this functionality when reading in the csv because of the size of the file.

I know for certain that the number of rows is more than 900k, so I can do...

```np.random.seed(42)
skip = np.random.randint(0,900000,200000)
```

But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?

Here is necessary read file twice - first for length and then by `read_csv`, because `read_csv` cannot return the length of the file:

```np.random.seed(1245)

def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1

train_file = 'file.csv'
num = file_len(train_file)
print (num)

skip = np.random.randint(0,num,200000)
#more dynamic - 20% of length
#skip = np.random.randint(0,num,int(num * 0.2))
print (train_df)
```

You could try

```import pandas as pd
import numpy as np
np.random.seed(4)
skiprows=lambda i: i>0 and np.random.choice(5))
```

Selecting Subsets of Data in Pandas: Part 3 - Dunder Data, This is part three of a four-part series on how to select subsets of data from a import numpy as np>>> df = pd.read_csv('../. we can use a list or NumPy array with different values for each row. returns the number of rows in the DataFrame ensuring that the size of the array is correct. df_orig = df.copy(). numpy.random.seed(seed=None)¶. Seed the generator. This method is called when RandomState is initialized. It can be called again to re-seed the generator.

```np.random.seed(42)
p = 0.3 #% of rows to read in