Growing a data.frame in a memory-efficient manner

data.table r
why is data.table so fast
data.table group by
in r
r loop create data frames
how to empty a dataframe in r
r create table from data frame
data frame in for loop

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

First approach

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
  res[i,"x"] <- runif(1)
  res[i,"y"] <- rnorm(1)
}

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

Alternative approach (doesn't work either)

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]: 

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

Probably the best approach

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
  dt[i,x := runif(1)]
  dt[i,y := rnorm(1)]
}
# note no message from tracemem

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) {
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

(Results shown below)

Benchmarking

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds
          expr        min         lq     median         uq        max
1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759
2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

And comparison of := with set():

> m
Unit: milliseconds
          expr       min        lq    median       uq      max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

Make working with large DataFrames easier, at least for your memory, Under the hood, pandas stores DataFrame's columns of the same variable or read_sql() the datatypes are not assigned in an optimal way. I'm looking for a way to store the numeric vectors of data-frame in a more compact way. I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata.

You could also have an empty list object where elements are filled with dataframes; then collect the results at the end with sapply or similar. An example can be found here. This will not incur the penalties of growing an object.

data.table() vs data.frame(), R users (mostly beginners) struggle while dealing with large data sets. It's important to Have you ever thought this way? If you have Hence, with memory efficiency, the speed of computation is enhanced. 3. Not just  I have a data frame "v" with id and value columns, such as: set.seed(123) v <- data.frame(id=sample(1:5),value=sample(1:5)) v id value 1 2 1 2 4 3 3 5 4 4 3 2 5 1 5 In the loo Stack Overflow Products

Efficient accumulation in R – Win Vector LLC, As we do work copying each row in each data frame (since in R data frame rows one by one is “a bit less efficient than the right way for R”. If you can, allocate your entire data.frame up front: and then during your operations insert row at a time That should work for arbitrary data.frame and be much more efficient. If you overshot N you can always shrink empty rows out at the end.

I like RSQLite for that matter: dbWriteTable(...,append=TRUE) statements while collecting, and dbReadTable statement at the end.

If the data is small enough, one can use the ":memory:" file, if it is big, the hard disk.

Of course, it can not compete in terms of speed:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))

library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), ":memory:")

collect1 <- function(n) {
  for (i in 1:n) dbWriteTable(con, "test", makeRow(), append=TRUE)
  dbReadTable(con, "test", row.names=NULL)
}

collect2 <- function(n) {
  res <- data.frame(x=rep(NA, n), y=rep(NA, n))
  for(i in 1:n) res[i,] <- makeRow()[1,]
  res
}

> system.time(collect1(1000))
   User      System verstrichen 
   7.01        0.00        7.05  
> system.time(collect2(1000))
   User      System verstrichen 
   0.80        0.01        0.81 

But it might look better if the data.frames have more than one row. And you do not need to know the number of rows in advance.

Memory usage · Advanced R., Along the way, you'll learn about some common myths, such as that you need to To understand memory usage in R, we will start with pryr::object_size() . vector would be zero and that memory usage would grow proportionately with length. If a data frame has one million rows, and three variables (two numeric, and  However the biggest drawback of the language is that it is memory-bound, which means all the data required for analysis has to be in the memory (RAM) for being processed. In the world of exponentially growing size of data, organisations are reticent to deploy R beyond research mainly due to this drawback.

This post suggests stripping off data.frame / tibble's class attributes using as.list, assigning list elements in-place the usual way and then converting the result back to data.frame / tibble again. The computational complexity of this method grows linearly but with a very little rate of less than 10e-6.

in_place_list_bm <- function(n) {
    res <- tibble(x = rep(NA_real_, n))
    tracemem(res)
    res <- as.list(res)
    for (i in 1:n) {
        res[['x']][[i]] <- i
    }
    return(res %>% as_tibble())
}

> system.time(in_place_list_bm(10000))[[3]]
tracemem[0xd87aa08 -> 0xd87aaf8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xd87aaf8 -> 0xd87abb8]: in_place_list_bm system.time 
tracemem[0xd87abb8 -> 0xe045928]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe045928 -> 0xe043488]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe043488 -> 0xe043728]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.006
> system.time(in_place_list_bm(100000))[[3]]
tracemem[0xdf89f78 -> 0xdf891b8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xdf891b8 -> 0xdf89278]: in_place_list_bm system.time 
tracemem[0xdf89278 -> 0x5e00fb8]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5e00fb8 -> 0x5dd46b8]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5dd46b8 -> 0x5dcec98]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.045

Here is an image from the original article:

6 Efficient data carpentry, Fortunately, done efficiently, at the outset of your project (rather than half way tibble is a package that defines a new data frame class for R, the tbl_df . The results, illustrated in table 6.6, show that the USA has the highest growth and Instead of loading all the data into RAM, as R does, databases query data from the  45 Growing a data.frame in a memory-efficient manner Jul 14 '12 37 Simple file server to serve current directory [closed] Mar 10 '13 33 ggplot2 for grayscale printouts Nov 21 '12

3 Efficient programming, You should also consider pre-allocating memory for data frames and lists. Method 1 creates an empty vector and gradually increases (or grows) the length of the vector: When stop() is called, there is no way for a function to continue. Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 1. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale

Using R for Efficient Data Munging of Tabular Data, Unlike commonly used databases which store data row by row, R data.frame stores memory efficient by being careful about materializing intermediate data subsets, the integer vector of row order to i argument, the same way as data.​frame . Also, the community support has grown over the years, recently reaching the  Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 0. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale

How to Optimize your Pandas Code, The data is growing everyday but the resources and computational power data cleaning and what's the right way to implement his learning efficiently. row of the dataframe and pass through a function to get the desired result. has a Run time is primarily due to NumPy array element memory access or  Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 0. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale

Comments
  • Edited to make clear what I'm pretty sure you meant. Please revert if I messed up.
  • If you are still interested, here is another benchmark of other set of different way to grow data.frame when you don't know the size in advance.
  • As far as I can tell, your last example doesn't grow the data.table. You simply overwrite the first row 1,000 times.
  • That's good but have you seen the speed example at the bottom of ?":=" comparing := within a loop to set() within a loop. := has overhead (e.g. checking the existence and type of arguments passed to [.data.table), which is why set() is provided for use inside loops.
  • And overhead would explain why it wasn't the fastest on small data. Change to set() and it should be the fastest again.
  • @MatthewDowle Neat tip, thanks. I couldn't find anything about set() documented in ?":=", though, and even ?set has only a comment that it "should be documented in ?":=", perhaps."
  • Ah yes, set is now (recently) better documented and lives in ?":=". It's thanks to discussions with @JoshuaUlrich here on S.O. that set() got added to data.table. Search NEWS for string "set(" for further info.
  • The idea is cool, but it is far from efficient. I put it on a test on another thread.