Read all files in a nested folder in Spark

spark read multiple files into dataframe
spark read all files in directory recursively
spark read all parquet files in directory
spark read multiple files in parallel
spark wholetextfiles
spark.read.parquet multiple files
scala read all files in directory
spark.read.json multiple files

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']

Spark, This method reads the whole file at once. Try to use textFile method. read more. Second, if you need to get all files recursively in one directory,  Further we can read data from all files in data folder: val df = spark.read.format("parquet").load("/tmp/data1/")

if you want use only files which start with name "a" ,you can use

sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")

as well. We can use * as wildcard.

Solved: How can I read all files in a directory using scal, Solved: I have 1 CSV (comma separated) and 1 PSV ( pipe separated ) files in the same dir /data/dev/spark How can I read each file and convert them. 5/10/30 14:57:47 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 6, hadoop1): java.io.FileNotFoundException: Path is not a file: /folder/subfolder Is there anyway I can read all the avros (even in subdirectories) into an RDD? all avros have same schema and I am on spark 1.3.0. Edit::

sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use

sc.textFile("/directory/201910*/part-*.lzo")

and setting reading directory recursive!

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

TIPS: scala differ with python, below set use to scala!

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

How can I use Spark to read a whole directory instead of a single file?, Spark has provided different ways for reading different format of files. It includes avro, parquet, text, tsv etc. For text file format, rdd = sc.textFile(“inputPath”) //  This read file text01.txt & text02.txt files and outputs below content. One,1 Two,2 Read all text files matching a pattern to single RDD. textFile() method also accepts pattern matching and wild characters. For example below snippet read all files start with text and with the extension “.txt” and creates single RDD.

Spark, Make sure you do not have a nested directory If it finds one Spark process  To read multiple text files to single RDD in Spark, use SparkContext.textFile() method. In this tutorial, we shall look into examples addressing different scenarios of reading multiple text files to single RDD. Read multiple text files to single RDD [Java Example] [Python Example] Read all text files in a directory to single RDD [Java Example] [Python Example]

Spark Scala: How to list all folders in directory, Spark Scala: How to list all folders in directory. 0 votes What is the HDFS command to list all the files in HDFS according to the timestamp? Spark Streaming uses readStream to monitors the folder and process files that arrive in the directory real-time and uses writeStream to write DataFrame or Dataset. Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads.

Read avro from multiple nested directories, I confirmed that the avro files do exist. But when I try to create a dataframe using: sourceDf = spark.read. Find below the description from Spark docs: SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file

Comments
  • This solved my particular issue. Btw, what if the directory structure is not regular?
  • Then things start getting messy :) Idea is more or less the same but it is unlikely you can prepare patterns that can be easily reused. You can always you normal tools to traverse filesystem and collect paths instead of hardcoding.
  • Why does this not work with /folder/**/*.txt? I have basically the exact same directory structure and I'd like to open all with sc.wholeTextFiles('data/**/*.json') but that does not seem to work ..?
  • @zero323, I couldn't use the whilecard in wholetextfile as getting the error llegalArgumentException: 'java.net.URISyntaxException: Expected scheme-specific part at index.