If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

If directory structure is regular, lets say something like this:

├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()


if you want use only files which start with name "a" ,you can use

sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")

as well. We can use * as wildcard.

sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use


and setting reading directory recursive!

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

TIPS: scala differ with python, below set use to scala!

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

  • This solved my particular issue. Btw, what if the directory structure is not regular?
  • Then things start getting messy :) Idea is more or less the same but it is unlikely you can prepare patterns that can be easily reused. You can always you normal tools to traverse filesystem and collect paths instead of hardcoding.
  • Why does this not work with /folder/**/*.txt? I have basically the exact same directory structure and I'd like to open all with sc.wholeTextFiles('data/**/*.json') but that does not seem to work ..?
  • @zero323, I couldn't use the whilecard in wholetextfile as getting the error llegalArgumentException: ' Expected scheme-specific part at index.