How can I read from S3 in pyspark running in local mode?

spark read parquet from s3 scala
pyspark read csv from s3
spark s3 example java
spark read from s3
spark read multiple files from s3
spark read s3 folder
emr spark read from s3
spark write to s3

I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)

When I try this:

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())

And the output is below

python, So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual​  Accessing AWS S3 from PySpark Standalone Cluster. Before you proceed, ensure that you have installed and configured PySpark and Hadoop correctly. To cross-check, you can visit this link. While reading from AWS EMR is quite simple, this was not the case using a standalone cluster.

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

Accessing AWS S3 from PySpark Standalone Cluster, Using s3a to read: Currently, there are three ways one can read files: s3, s3n and s3a. In this post, we would enableV4=true”)). Without these, the job would not run in a cluster mode. If running on a local cluster, you need not specify these. A typical Spark workflow is to read data from an S3 bucket or another source, perform some transformations, and write the processed data back to another S3 bucket. Amazon EMR. Amazon EMR (Elastic Map Reduce) is a big data platform that synchronizes multiple nodes into a scaleable cluster that can process large amounts of data. As mentioned

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function
import os

from pyspark import SparkConf
from pyspark import SparkContext

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


if __name__ == "__main__":

    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
    sc = SparkContext(conf=conf)

    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk:1.7.4,\
org.apache.hadoop:hadoop-common:2.7.3 \
<path to the py file above>

Working with S3 and Spark Locally - Matthew Powers, It is nice to work with Spark locally when… Start the Spark shell with the dataframes spark-csv package. $ . Reading Data From S3 into a DataFrame. Similar to reading data with Spark, it’s not recommended to write data to local storage when using PySpark. Instead, you should used a distributed file system such as S3 or HDFS. If you going to be processing the results with Spark, then parquet is a good format to use for saving data frames.

Accessing Data Stored in Amazon S3 through Spark, Before You Install CDH 5 on a Cluster · Creating a Local Yum Repository · Installing the You can read and write Spark SQL DataFrames using the Data Source API. This mode of operation associates the authorization with individual EC2 to Run Spark S3 Jobs - Set spark.hadoop.security.credential.provider.path to the  Never run spark in local or standalone mode from a gateway. All instructions below presume that you have first connected to such a machine. Running the pyspark shell. To run the pyspark interpreter in local mode from an interactive node, type: pyspark --master local[N] where N is the number or cores you want to use.

Spark Programming Guide, However, for local testing and unit tests, you can pass “local” to run Spark in-​process. including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. wholeTextFiles lets you read a directory containing multiple small text files, and A common example of this is when running Spark in local mode ( --​master  I have a dataframe with the s3 paths. I run a python function in a map which uses boto3 to directly grab the file from s3 on the worker, decode the image data, and assemble the same type of dataframe as readImages. Here's the code, more or less in its entirety, to read and decode the images and then just write them to parquet.

Get Spark to use your AWS credentials file for S3, Spark can access files in S3, even when running in local mode, given AWS credentials. By default, with s3a URLs, Spark will search for  Now that we've installed Spark, we'll look at two different modes in which you can run Pyspark code. 1. Running Pyspark In Local Mode: The fastest way to to get your Spark code to run is to run in local mode. To do this we tell the Spark configuration to use the special 'local' mode.

Comments
  • Possible duplicate of How can I access S3/S3n from a local Hadoop 2.6 installation?
  • There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip.
  • Try to use s3a protocol: inputFile = sparkContext.textFile("s3a://somebucket/file.csv")
  • What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one).
  • On my laptop the relative path in environment is venv/Lib/site-packages/pyspark/jars
  • Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one?
  • I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help
  • Here's the new question - thanks for any help you can give! stackoverflow.com/questions/50242843/…
  • With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies.
  • If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think)