## Random sampling only a subset of data in R

r random sample from dataframe

r subset dataframe by column value

subset in r

subset() r

r subset dataframe by column name

r subset dataframe by list of values

simple random sampling in r

I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.

Example

| Index | B | C | Class| | 1 | 3 | 4 | Dog | | 2 | 1 | 9 | Cat | | 3 | 9 | 1 | Dog | | 4 | 1 | 1 | Cat |

From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..

| Index | B | C | Class| | 1 | 3 | 4 | Cat | Re-sampled | 2 | 1 | 9 | Dog | Re-sampled | 3 | 9 | 1 | Dog | | 4 | 1 | 1 | Dog | Re-sampled

This code randomly *extracts* rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.

sample(Class[sample(nrow(Class),N),])

Suppose `df`

is your data frame:

df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))

Would this do what you want?

dfSamp <- sample(1:nrow(df), N) df$Class[dfSamp] <- sample(df$Class[dfSamp])

**Subsetting Data,** Learn how to use R's powerful indexing features for accessing object elements. This includes keeping or deleting variables, observations, random samples. # r sample dataframe; selecting a random subset in r # df is a data frame; pick 5 rows df[sample(nrow(df), 5), ] In this example, we are using the sample function in r to select a random subset of 5 rows from a larger data frame.

I simulated the data frame and did an example:

df <- data.frame( ID=1:4, Class=c('Dog', 'Cat', 'Dog', 'Cat') ) N <- 2 sample_ids <- sample(nrow(df), N) df$Class[sample_ids] <- sample(df$Class, length(sample_ids))

**Selecting Random Samples in R: Sample() Function,** We can add the size parameter to return only a few values. The following r sample dataframe; selecting a random subset in r # df is a data frame; pick 5 rows Random Sampling a Dataset in R A common example in business analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns) in structure or database bound.This is partly due to a legacy of traditional analytics software.

Assuming `Class`

is how you named your datafame, you could do this:

library(dplyr) bind_rows( Class %>% mutate(origin = 'not_sampled'), Class %>% sample(100, replace = TRUE) %>% mutate(origin = 'sampled'))

Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.

**Randomly Sampling Rows in R,** The sample function takes a random sample of a vector, not a dataframe. If we are sampling rows, we only want the equivalent of one penny. Now we have the subset we want. Data Science is More Than a Buzzword. There are times you just have too much data, random samples are nice to test assumptions and algorithms first. So in R you can create a function to return a random sample of a data frame for such emergencies. randomSample = function(df,n) { return (df[sample(nrow(df), n),]) } And to use: smallerDF<-randomSample(bigDF, 40) (40 being…

What you're wanting to do is replace in-line some classes, but not others.

So, if we start with a data frame, `df`

set.seed(100) df = data.frame(index = 1:100, B = sample(1:10,100,replace = T), C = sample(1:10,100,replace = T), Class = sample(c('Cat','Dog','Bunny'),100,replace = T))

And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing `unique(df$class)`

you don't weight the classes by their current occurrence. You could adjust this with the `weight`

argument or remove `unique`

to use occurrence as weight.

n_rows = 5 rows_to_update = sample(1:100,n_rows,replace = F) new_classes = sample(unique(df$Class),n_rows,replace = T) rows_to_update #> [1] 85 65 94 60 48 new_classes #> [1] "Bunny" "Dog" "Dog" "Dog" "Bunny"

We can inspect what the original data looked like

df[rows_to_update,] #> index B C Class #> 85 85 1 2 Dog #> 65 65 5 1 Bunny #> 94 94 5 10 Dog #> 60 60 3 7 Bunny #> 48 48 9 1 Cat

We can update this in place with a reference to the column and the rows to update.

df$Class[rows_to_update] = new_classes df[rows_to_update,] #> index B C Class #> 85 85 1 2 Bunny #> 65 65 5 1 Dog #> 94 94 5 10 Dog #> 60 60 3 7 Dog #> 48 48 9 1 Bunny

**How to Take Samples from Data in R,** Statisticians often have to take samples of data and then calculate statistics. a sample is easy with R because a sample is really nothing more than a subset of data. But if you don't set the seed, R draws from the current state of the random If you provide a seed value, the random-number sequence will be reset to a known state. This is because R doesn’t create truly random numbers, but only pseudo-random numbers. A pseudo-random sequence is a set of numbers that, for all practical purposes, seem to be random but were generated by an algorithm.

**[R] Randomly extract rows from a data frame,** Previous message: [R] Randomly extract rows from a data frame; Next But I > can't figure out how to use this column to sort the entire data frame so > that Cheers > Amy See ?sample Using the 'iris' dataset in R: # Select 2 Selection using the Subset Function The subset( ) function is the easiest way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10.

**Data Wrangling in R: Generating/Simulating data,** Using set.seed() allows us to reproduce the same random sample. nice, but we can also use sample to generate practical data, for # example males and females. rm(students, uva) # When used with subsetting brackets, sample() can be The sample function takes a random sample of a vector, not a dataframe. This is why the most commonly used pattern looks like this: iris.sampled<-iris[sample(1:nrow(iris),30, replace=FALSE),] To fully appreciate what this line of R code is doing, let’s break it down into three separate statements: # create a vector the same length as the dataframe

**4 Subsetting,** 3 Random samples and bootstraps (integer subsetting). You can use integer indices to randomly sample or bootstrap a vector or data frame. Just use sample(n) to This version of the subset command narrows your data frame down to only the elements you want to look at. Other Ways to Subset A Data Frame in R There are actually many ways to subset a data frame using R.

##### Comments

- Thanks! Does that mean if you remove the unique from the df$Class you
*are*taking account of the class weights? - Right, if
`df$Class`

was Dog: 10, Cat: 20, Bunny 5 then without`unique`

you'd expect Cat to be selected twice a often as Dog and Dog to be selected twice as often as Bunny.