Spark very slow performance with wide dataset

spark large number of columns
spark read csv slow
spark job taking too long
spark performance issues
spark job running slow
why your spark applications are slow or failing, part 3
spark group by very slow
pyspark out of memory

I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.

scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]

scala> df.registerTempTable("table")

scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
|    1300|
+--------+

Time taken: 18402 ms

Can anything be done to improve performance of wide files?


Hey Glad you are here on the community,

Count is a lazy operation.Count,Show all these operations are costly in spark as they run over each and every record so using them will always take a lot of time instead you can write the results back to a file or database to make it fast, if you want to check out the result you can use DF.printSchema() A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty.

Spark csv reading speed is very slow although I increased the , Using very wide data with Spark SQL causes additional problems. narrow (< 1K columns), but can easily become a bottleneck with wider datasets. using input format which is not well suited for high performance analytics  Spark very slow performance with wide dataset. 2. Wide dataframe operation in Pyspark too slow. Hot Network Questions How diphenylcarbazide is used to detect Mercury


When operating on the data frame, you may want to consider selecting only those columns that are of interest to you (i.e. df.select(columns...)) before performing any aggregation. This may trim down the size of your set considerably. Also, if any filtering needs to be done, do that first as well.

why spark very slow with large number of dataframe columns , why spark very slow with large number of dataframe columns scala Spark App: I have a dataset of 130x14000. You can of course optimize performance using columnar data based file formats like parquet and tuning spark  Spark very slow performance with wide dataset. I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.


I find this answer which may be helpful to you.

Spark SQL is not suitable to process wide data (column number > 1K). If it's possible, you can use vector or map column to solve this problem.

Why Your Spark Applications Are Slow or Failing, Part 1: Memory , A driver in Spark is the JVM where the application's main control flow runs. use partition column(s), then it will reduce data movement to a large If your application uses Spark caching to store some datasets, then Therefore, effective memory management is a critical factor to get the best performance,  Hi, We have a small 6 node cluster with 3 masters (2 HA and 1 with CM services) and 3 data nodes. All nodes have 16 cores and 128GB RAM. Each data node have 6 SSD disks with 2TB each for HDFS, so 12TB per node and 36TB in total. Currently in use is half of the HDFS space (18TB) and we also inges


Why Your Spark Apps Are Slow Or Failing, Part II: Data Skew and , Like many performance challenges with Spark, the symptoms increase as the If a single partition becomes very large it will cause data skew, which will be In all likelihood, this is an indication that your dataset is skewed. Handling Large Queries in Interactive Workflows. A challenge with interactive data workflows is handling large queries. This includes queries that generate too many output rows, fetch many external partitions, or compute on extremely large data sets. These queries can be extremely slow, saturate cluster resources,


Tips and Best Practices to Take Advantage of Spark 2.x, Parquet is slower for writing but gives the best performance for reading;this MapR Database is multi-model: wide-column, key-value with the  The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me. Currently I have set the following in my spark-defaults.conf:


Optimizing Spark jobs for maximum performance, Development of Spark jobs seems easy enough on the surface and for the most The most frequent performance problem, when working with the RDD API, is using The Dataset API uses Scala's type inference and implicits-based to divide our data into a large enough number of partitions that are as  System is Win7 64-bit, using EG7.1 32-bit and Base SAS 9.4 64-bit. I have a source SAS dataset, which is compressed, of around 40 million records and 70 columns. Using BASE engine (not SPD). In EG5.1, the table opens when double-clicked from the process flow almost instantly.