Growing a data.frame in a memory-efficient manner
why is data.table so fast
data.table group by
in r
r loop create data frames
how to empty a dataframe in r
r create table from data frame
data frame in for loop
According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame
using rbind
, as it creates a copy of the whole data.frame each time. How do I accumulate data in R
resulting in a data.frame
without incurring this penalty? The intermediate format doesn't need to be a data.frame
.
First approach
I tried accessing each element of a pre-allocated data.frame:
res <- data.frame(x=rep(NA,1000), y=rep(NA,1000)) tracemem(res) for(i in 1:1000) { res[i,"x"] <- runif(1) res[i,"y"] <- rnorm(1) }
But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).
Alternative approach (doesn't work either)
One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack
them all together:
makeRow <- function() data.frame(x=runif(1),y=rnorm(1)) res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames library(taRifx) res.df <- stack(res)
Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:
> tracemem(res) [1] "<0x79b98b0>" > res[[2]] <- data.frame() tracemem[0x79b98b0 -> 0x71da500]:
In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.
Probably the best approach
As with many speed or memory-limited processes these days, the best approach may well be to use data.table
instead of a data.frame
. Since data.table
has the :=
assign by reference operator, it can update without re-copying:
library(data.table) dt <- data.table(x=rep(0,1000), y=rep(0,1000)) tracemem(dt) for(i in 1:1000) { dt[i,x := runif(1)] dt[i,y := rnorm(1)] } # note no message from tracemem
But as @MatthewDowle points out, set()
is the appropriate way to do this inside a loop. Doing so makes it faster still:
library(data.table) n <- 10^6 dt <- data.table(x=rep(0,n), y=rep(0,n)) dt.colon <- function(dt) { for(i in 1:n) { dt[i,x := runif(1)] dt[i,y := rnorm(1)] } } dt.set <- function(dt) { for(i in 1:n) { set(dt,i,1L, runif(1) ) set(dt,i,2L, rnorm(1) ) } } library(microbenchmark) m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)
(Results shown below)
Benchmarking
With the loop run 10,000 times, data table is almost a full order of magnitude faster:
Unit: seconds expr min lq median uq max 1 test.df() 523.49057 523.49057 524.52408 525.55759 525.55759 2 test.dt() 62.06398 62.06398 62.98622 63.90845 63.90845 3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622
And comparison of :=
with set()
:
> m Unit: milliseconds expr min lq median uq max 1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186 2 dt.set(dt) 13.29612 13.29612 15.02891 16.7617 16.7617
Note that n
here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.
Make working with large DataFrames easier, at least for your memory, Under the hood, pandas stores DataFrame's columns of the same variable or read_sql() the datatypes are not assigned in an optimal way. I'm looking for a way to store the numeric vectors of data-frame in a more compact way. I use data from a household survey (PNAD in Brazil) with ~400k observations and ~200 questions. Once imported, the data uses ~500Mb of memory in R, but only 180Mb in Stata.
You could also have an empty list object where elements are filled with dataframes; then collect the results at the end with sapply or similar. An example can be found here. This will not incur the penalties of growing an object.
data.table() vs data.frame(), R users (mostly beginners) struggle while dealing with large data sets. It's important to Have you ever thought this way? If you have Hence, with memory efficiency, the speed of computation is enhanced. 3. Not just I have a data frame "v" with id and value columns, such as: set.seed(123) v <- data.frame(id=sample(1:5),value=sample(1:5)) v id value 1 2 1 2 4 3 3 5 4 4 3 2 5 1 5 In the loo Stack Overflow Products
Efficient accumulation in R – Win Vector LLC, As we do work copying each row in each data frame (since in R data frame rows one by one is “a bit less efficient than the right way for R”. If you can, allocate your entire data.frame up front: and then during your operations insert row at a time That should work for arbitrary data.frame and be much more efficient. If you overshot N you can always shrink empty rows out at the end.
I like RSQLite
for that matter: dbWriteTable(...,append=TRUE)
statements while collecting, and dbReadTable
statement at the end.
If the data is small enough, one can use the ":memory:" file, if it is big, the hard disk.
Of course, it can not compete in terms of speed:
makeRow <- function() data.frame(x=runif(1),y=rnorm(1)) library(RSQLite) con <- dbConnect(RSQLite::SQLite(), ":memory:") collect1 <- function(n) { for (i in 1:n) dbWriteTable(con, "test", makeRow(), append=TRUE) dbReadTable(con, "test", row.names=NULL) } collect2 <- function(n) { res <- data.frame(x=rep(NA, n), y=rep(NA, n)) for(i in 1:n) res[i,] <- makeRow()[1,] res } > system.time(collect1(1000)) User System verstrichen 7.01 0.00 7.05 > system.time(collect2(1000)) User System verstrichen 0.80 0.01 0.81
But it might look better if the data.frame
s have more than one row. And you do not need to know the number of rows in advance.
Memory usage · Advanced R., Along the way, you'll learn about some common myths, such as that you need to To understand memory usage in R, we will start with pryr::object_size() . vector would be zero and that memory usage would grow proportionately with length. If a data frame has one million rows, and three variables (two numeric, and However the biggest drawback of the language is that it is memory-bound, which means all the data required for analysis has to be in the memory (RAM) for being processed. In the world of exponentially growing size of data, organisations are reticent to deploy R beyond research mainly due to this drawback.
This post suggests stripping off data.frame
/ tibble
's class attributes using as.list
, assigning list elements in-place the usual way and then converting the result back to data.frame
/ tibble
again. The computational complexity of this method grows linearly but with a very little rate of less than 10e-6.
in_place_list_bm <- function(n) { res <- tibble(x = rep(NA_real_, n)) tracemem(res) res <- as.list(res) for (i in 1:n) { res[['x']][[i]] <- i } return(res %>% as_tibble()) } > system.time(in_place_list_bm(10000))[[3]] tracemem[0xd87aa08 -> 0xd87aaf8]: as.list.data.frame as.list in_place_list_bm system.time tracemem[0xd87aaf8 -> 0xd87abb8]: in_place_list_bm system.time tracemem[0xd87abb8 -> 0xe045928]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time tracemem[0xe045928 -> 0xe043488]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time tracemem[0xe043488 -> 0xe043728]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time [1] 0.006 > system.time(in_place_list_bm(100000))[[3]] tracemem[0xdf89f78 -> 0xdf891b8]: as.list.data.frame as.list in_place_list_bm system.time tracemem[0xdf891b8 -> 0xdf89278]: in_place_list_bm system.time tracemem[0xdf89278 -> 0x5e00fb8]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time tracemem[0x5e00fb8 -> 0x5dd46b8]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time tracemem[0x5dd46b8 -> 0x5dcec98]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time [1] 0.045
Here is an image from the original article:
6 Efficient data carpentry, Fortunately, done efficiently, at the outset of your project (rather than half way tibble is a package that defines a new data frame class for R, the tbl_df . The results, illustrated in table 6.6, show that the USA has the highest growth and Instead of loading all the data into RAM, as R does, databases query data from the 45 Growing a data.frame in a memory-efficient manner Jul 14 '12 37 Simple file server to serve current directory [closed] Mar 10 '13 33 ggplot2 for grayscale printouts Nov 21 '12
3 Efficient programming, You should also consider pre-allocating memory for data frames and lists. Method 1 creates an empty vector and gradually increases (or grows) the length of the vector: When stop() is called, there is no way for a function to continue. Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 1. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale
Using R for Efficient Data Munging of Tabular Data, Unlike commonly used databases which store data row by row, R data.frame stores memory efficient by being careful about materializing intermediate data subsets, the integer vector of row order to i argument, the same way as data.frame . Also, the community support has grown over the years, recently reaching the Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 0. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale
How to Optimize your Pandas Code, The data is growing everyday but the resources and computational power data cleaning and what's the right way to implement his learning efficiently. row of the dataframe and pass through a function to get the desired result. has a Run time is primarily due to NumPy array element memory access or Reactormonk. Apparently, this user prefers to keep an air of mystery about them. 0. 45 Growing a data.frame in a memory-efficient manner; 33 ggplot2 for grayscale
Comments
- Edited to make clear what I'm pretty sure you meant. Please revert if I messed up.
- If you are still interested, here is another benchmark of other set of different way to grow data.frame when you don't know the size in advance.
- As far as I can tell, your last example doesn't grow the data.table. You simply overwrite the first row 1,000 times.
- That's good but have you seen the speed example at the bottom of
?":="
comparing:=
within a loop toset()
within a loop.:=
has overhead (e.g. checking the existence and type of arguments passed to[.data.table
), which is whyset()
is provided for use inside loops. - And overhead would explain why it wasn't the fastest on small data. Change to
set()
and it should be the fastest again. - @MatthewDowle Neat tip, thanks. I couldn't find anything about
set()
documented in?":="
, though, and even?set
has only a comment that it "should be documented in ?":=", perhaps." - Ah yes,
set
is now (recently) better documented and lives in?":="
. It's thanks to discussions with @JoshuaUlrich here on S.O. thatset()
got added todata.table
. Search NEWS for string "set(" for further info. - The idea is cool, but it is far from efficient. I put it on a test on another thread.