Random sampling only a subset of data in R

r extract rows with certain value
r random sample from dataframe
r subset dataframe by column value
subset in r
subset() r
r subset dataframe by column name
r subset dataframe by list of values
simple random sampling in r

I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.

Example

| Index | B | C | Class|
| 1     | 3 | 4 | Dog  |
| 2     | 1 | 9 | Cat  |
| 3     | 9 | 1 | Dog  |
| 4     | 1 | 1 | Cat  |

From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..

| Index | B | C | Class|
| 1     | 3 | 4 | Cat  | Re-sampled 
| 2     | 1 | 9 | Dog  | Re-sampled 
| 3     | 9 | 1 | Dog  |
| 4     | 1 | 1 | Dog  | Re-sampled 

This code randomly extracts rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.

 sample(Class[sample(nrow(Class),N),])  

Suppose df is your data frame:

df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))

Would this do what you want?

dfSamp <- sample(1:nrow(df), N)
df$Class[dfSamp] <- sample(df$Class[dfSamp])

Subsetting Data, Learn how to use R's powerful indexing features for accessing object elements. This includes keeping or deleting variables, observations, random samples. # r sample dataframe; selecting a random subset in r # df is a data frame; pick 5 rows df[sample(nrow(df), 5), ] In this example, we are using the sample function in r to select a random subset of 5 rows from a larger data frame.

I simulated the data frame and did an example:

df <- data.frame(
  ID=1:4,
  Class=c('Dog', 'Cat', 'Dog', 'Cat')
)

N <- 2
sample_ids <- sample(nrow(df), N)

df$Class[sample_ids] <- sample(df$Class, length(sample_ids))

Selecting Random Samples in R: Sample() Function, We can add the size parameter to return only a few values. The following r sample dataframe; selecting a random subset in r # df is a data frame; pick 5 rows​  Random Sampling a Dataset in R A common example in business analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as columns) in structure or database bound.This is partly due to a legacy of traditional analytics software.

Assuming Class is how you named your datafame, you could do this:

library(dplyr)

bind_rows(
  Class %>% 
    mutate(origin = 'not_sampled'),
  Class %>% 
    sample(100, replace = TRUE) %>% 
    mutate(origin = 'sampled'))

Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.

Randomly Sampling Rows in R, The sample function takes a random sample of a vector, not a dataframe. If we are sampling rows, we only want the equivalent of one penny. Now we have the subset we want. Data Science is More Than a Buzzword. There are times you just have too much data, random samples are nice to test assumptions and algorithms first. So in R you can create a function to return a random sample of a data frame for such emergencies. randomSample = function(df,n) { return (df[sample(nrow(df), n),]) } And to use: smallerDF<-randomSample(bigDF, 40) (40 being…

What you're wanting to do is replace in-line some classes, but not others.

So, if we start with a data frame, df

set.seed(100)
df = data.frame(index = 1:100,
                B = sample(1:10,100,replace = T),
                C = sample(1:10,100,replace = T),
                Class = sample(c('Cat','Dog','Bunny'),100,replace = T))

And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing unique(df$class) you don't weight the classes by their current occurrence. You could adjust this with the weight argument or remove unique to use occurrence as weight.

n_rows = 5
rows_to_update = sample(1:100,n_rows,replace = F)
new_classes = sample(unique(df$Class),n_rows,replace = T)
rows_to_update
#> [1] 85 65 94 60 48
new_classes
#> [1] "Bunny" "Dog"   "Dog"   "Dog"   "Bunny"

We can inspect what the original data looked like

df[rows_to_update,]
#>    index B  C Class
#> 85    85 1  2   Dog
#> 65    65 5  1 Bunny
#> 94    94 5 10   Dog
#> 60    60 3  7 Bunny
#> 48    48 9  1   Cat

We can update this in place with a reference to the column and the rows to update.

df$Class[rows_to_update] = new_classes
df[rows_to_update,]
#>    index B  C Class
#> 85    85 1  2 Bunny
#> 65    65 5  1   Dog
#> 94    94 5 10   Dog
#> 60    60 3  7   Dog
#> 48    48 9  1 Bunny

How to Take Samples from Data in R, Statisticians often have to take samples of data and then calculate statistics. a sample is easy with R because a sample is really nothing more than a subset of data. But if you don't set the seed, R draws from the current state of the random​  If you provide a seed value, the random-number sequence will be reset to a known state. This is because R doesn’t create truly random numbers, but only pseudo-random numbers. A pseudo-random sequence is a set of numbers that, for all practical purposes, seem to be random but were generated by an algorithm.

[R] Randomly extract rows from a data frame, Previous message: [R] Randomly extract rows from a data frame; Next But I > can't figure out how to use this column to sort the entire data frame so > that Cheers > Amy See ?sample Using the 'iris' dataset in R: # Select 2  Selection using the Subset Function The subset( ) function is the easiest way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10.

Data Wrangling in R: Generating/Simulating data, Using set.seed() allows us to reproduce the same random sample. nice, but we can also use sample to generate practical data, for # example males and females​. rm(students, uva) # When used with subsetting brackets, sample() can be  The sample function takes a random sample of a vector, not a dataframe. This is why the most commonly used pattern looks like this: iris.sampled<-iris[sample(1:nrow(iris),30, replace=FALSE),] To fully appreciate what this line of R code is doing, let’s break it down into three separate statements: # create a vector the same length as the dataframe

4 Subsetting, 3 Random samples and bootstraps (integer subsetting). You can use integer indices to randomly sample or bootstrap a vector or data frame. Just use sample(​n) to  This version of the subset command narrows your data frame down to only the elements you want to look at. Other Ways to Subset A Data Frame in R There are actually many ways to subset a data frame using R.

Comments
  • Thanks! Does that mean if you remove the unique from the df$Class you are taking account of the class weights?
  • Right, if df$Class was Dog: 10, Cat: 20, Bunny 5 then without unique you'd expect Cat to be selected twice a often as Dog and Dog to be selected twice as often as Bunny.