Loading millions of small files from Azure Data Lake Store to Data Bricks

azure data lake folder structure best practices
azure data lake storage gen2
azure data lake storage gen1
azure data lake storage gen1 vs gen2
azure data lake partitioning
azure data lake analytics deprecated
azure data lake vs blob storage
azure data lake store gen 2 documentation

I've got a partitioned folder structure in the Azure Data Lake Store containing roughly 6 million json files (size couple of kb's to 2 mb). I'm trying to extract some fields from these files using Python code in Data Bricks.

Currently I'm trying the following:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.credential", "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx/oauth2/token")

df = spark.read.json("adl://xxxxxxx.azuredatalakestore.net/staging/filetype/category/2017/*/")

This example even reads only a part of the files since it points to "staging/filetype/category/2017/". It seems to work and there are some jobs starting when I run these commands. It's just very slow.

Job 40 indexes all of the subfolders and is relatively fast

Job 41 checks a set of the files and seems a bit to fast to be true

Then comes job 42, and that's where the slowness starts. It seems to do the same activities as job 41, just... slow

I have a feeling that I have a similar problem to this thread. But the speed of job 41 makes me doubtful. Are there faster ways to do this?

To add to Jason's answer:

We have run some test jobs in Azure Data Lake operating on about 1.7m files with U-SQL and were able to complete the processing in about 20 hours with 10 AUs. The job was generating several thousand extract vertices, so with a larger number of AUs, it could have finished in a fraction of the time.

We have not tested 6m files, but if you are willing to try, please let us know.

In any case, I do concur with Jason's suggestion to reduce the number and make the files larger.

Best practices for using Azure Data Lake Storage Gen1, Loading millions of small files from Azure Data Lake Store to Data Bricks - apache-spark. Extract data from the Azure Data Lake Storage Gen2 account. You can now load the sample json file as a data frame in Azure Databricks. Paste the following code in a new cell. Replace the placeholders shown in brackets with your values.

we combine files at hourly basis using Azure function and that brings down file processing significantly. So, try combining files before you send it to ADB cluster for processing. IF NOT - either you have a very high number of worker nodes and that might increase your cost.

Load Files from Azure Blob storage or Azure Data Lake Storage , As a best practice, you must batch your data into larger files versus writing thousands or millions of small files to Data Lake Storage Gen1. Azure Data Factory offers a scale-out, managed data movement solution. Due to the scale-out architecture of ADF, it can ingest data at a high throughput. For details, see Copy activity performance. This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3 service into Azure Data Lake Storage Gen2

I think you will need to look at combining the files before processing. Both to increase size and reduce the number of files. The optimal file size is about 250mb. There are a number of ways to do this perhaps the easiest would be to use azure data lake analytics jobs or even use spark to iterate over a subset of the files

Azure Databricks, Learn how to use Auto Loader to ingest data from Azure Blob storage or Azure Data Lake Storage Gen2 in Azure Databricks. This approach is scalable even with millions of files in a directory. Automatically sets up Azure  Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Azure Databricks allows you to

Processing Petabytes of Data in Seconds with Databricks Delta , Azure Databricks is a fast, easy, and collaborative Apache Spark-based big movement using Azure Data Factory, load data into Azure Data Lake Storage,  In a recent release, Azure Data Lake Analytics (ADLA) takes the capability to process large amounts of files of many different formats to the next level.This blog post is showing you an end to end walk-through of generating many Parquet files from a rowset, and process them at scale with ADLA as well as accessing them from a Spark Notebook.

Using SQL to Query Your Data Lake with Delta Lake, Automate data movement using Azure Data Factory, load data into Azure Data Lake Storage, transform and clean it using Azure Databricks and then make it  Azure Data Lake Analytics (ADLA) is a serverless PaaS service in Azure to prepare and transform large amounts of data stored in Azure Data Lake Store or Azure Blob Storage at unparalleled scale. ADLA now offers some new, unparalleled capabilities for processing files of any formats including Parquet at tremendous scale.

Speeding Up Incremental Data Loads into Delta Lake using File , In particular, we discuss Data Skipping and ZORDER Clustering. that brings data reliability and fast analytics to cloud data lakes. Having data split across many small files brings up the following Databricks Delta Product Page; Databricks Delta User Guide AWS or Azure · Databricks Engineering Blog  When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns. PolyBase can't load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data Lake Store