Quickly reading very large tables as dataframes
I have very large tables (30 million rows) that I would like to load as a dataframes in R.
read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using
scan() can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking
read.table to run a bit faster has precious little benefit. Your options are:
readr(on CRAN from April 2015). This works much like
freadabove. The readme in the link explains the difference between the two functions (
readrcurrently claims to be "1.5-2x slower" than
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
sqldfpackage, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the
RODBCpackage, and the reverse depends section of the
MonetDB.Rgives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its
dplyrallows you to work directly with data stored in several types of database.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
nrows=the number of records in your data (
Make sure that
comment.char=""to turn off interpretation of comments.
Explicitly define the classes of each column using
multi.line=FALSEmay also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of
read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with
saveRDS, then next time you can retrieve it faster with
Quickly reading very large tables as dataframes, You can use the fread function from the data.table package in R to import large tables in a very short time. For example: To check the time I have very large tables (30 million rows) that I would like to load as a data frame in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down.
Here is an example that utilizes
The examples come from the help page to
fread, with the timings on my windows XP Core 2 duo E8400.
library(data.table) # Demo speedup n=1e6 DT = data.table( a=sample(1:1000,n,replace=TRUE), b=sample(1:1000,n,replace=TRUE), c=rnorm(n), d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE), e=rnorm(n), f=sample(1:1000,n,replace=TRUE) ) DT[2,b:=NA_integer_] DT[4,c:=NA_real_] DT[3,d:=NA_character_] DT[5,d:=""] DT[2,e:=+Inf] DT[3,e:=-Inf]
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE) cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n") ## File size (MB): 51 system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE)) ## user system elapsed ## 24.71 0.15 25.42 # second run will be faster system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE)) ## user system elapsed ## 17.85 0.07 17.98
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="", stringsAsFactors=FALSE,comment.char="",nrows=n, colClasses=c("integer","integer","numeric", "character","numeric","integer"))) ## user system elapsed ## 10.20 0.03 10.32
require(data.table) system.time(DT <- fread("test.csv")) ## user system elapsed ## 3.12 0.01 3.22
require(sqldf) system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL)) ## user system elapsed ## 12.49 0.09 12.69 # sqldf as on SO f <- file("test.csv") system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))) ## user system elapsed ## 10.21 0.47 10.73
ff / ffdf
require(ff) system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n)) ## user system elapsed ## 10.85 0.10 10.99
## user system elapsed Method ## 24.71 0.15 25.42 read.csv (first time) ## 17.85 0.07 17.98 read.csv (second time) ## 10.20 0.03 10.32 Optimized read.table ## 3.12 0.01 3.22 fread ## 12.49 0.09 12.69 sqldf ## 10.21 0.47 10.73 sqldf on SO ## 10.85 0.10 10.99 ffdf
Reading large tables into R, table' run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are " sets - Quickly reading very large tables as dataframes read.table r example (6) I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down.
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used
sqldf() to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using
sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the
read.csv command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50)) write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf) f <- file("bigdf.csv") system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
Quickly reading very large tables as dataframes, I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot Solve common R problems efficiently with data.table. Quickly reading very large tables as dataframes in R: Quickly reading very large tables as dataframes in R.
Strangely, no one answered the bottom part of the question for years even though this is an important one --
data.frames are simply lists with the right attributes, so if you have large data you don't want to use
as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:
attr(df, "row.names") <- .set_row_names(length(df[])) class(df) <- "data.frame"
This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set
names() on the list accordingly.
[As for loading large data into R -- personally, I dump them by column into binary files and use
readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]
Handling large datasets in R, Very Large files - ( > 10 GB) that needs distributed large scale computing. We will go Following table shows optimization steps while reading the file and relative List of 3 ## $ virtual: 'data.frame': 5 obs. of 7 variables: ## . Also, dplyr creates deep copies of the entire data frame where as data.table does a shallow copy of the data frame. Shallow copy means that the data is not physically copied in system’s memory. It’s just a copy of column pointers (names). Deep copy copies the entire data to another location in the memory.
This was previously asked on R-Help, so that's worth reviewing.
One suggestion there was to use
readChar() and then do string manipulation on the result with
substr(). You can see the logic involved in readChar is much less than read.table.
I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\" cat(str) cols = list(key='',val=0) con <- textConnection(str, open = "r") hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE) close(con)
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
Reading large data files in R • INBO Tutorials, However, fread from the data.table package is a lot faster. A tibble is very similar to a data.frame , but provides more convenience when j'ai de très grandes tables (30 millions de lignes) que je voudrais charger comme un dataframes dans R. read.table() a beaucoup de fonctionnalités pratiques, mais il semble qu'il ya beaucoup de logique dans la mise en œuvre qui ralentirait les choses.
How to read quickly large dataset in R?, Here, or there, I read many techniques to import a large dataset in R. The option the methods for loading in R. - Using read.table read.csv() performs a lot of is require for use the table as dataframe using ddply from plyr package. So. The point is : the package Sqldf is very useful to read quickly a large I'll give some little pointers here. See Tyler's answer a few questions back for a couple links to materials for getting started: convert data.frame column format from character to factor
How to import huge .csv files in R studio?, fread() from data.table package is blazing fast for reading large files. questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r. Most efficient way of exporting large (3.9 mill obs) data.frames to text file? Though I only use it to read very large files Quickly reading very large tables
fread: Fast and friendly file finagler in data.table: Extension of data , Similar to read.table but faster and more convenient. http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r The word large and big are in themselves ‘relative’ and in my humble opinion, large data is data sets that are less than 100GB. Pandas is very efficient with small data (usually from 100MB up to 1GB) and performance is rarely a concern. However, if you’re in data science or big data field,