PySpark in iPython notebook raises Py4JJavaError when using count() and first()

py4jjavaerror databricks
py4j.protocol.py4jjavaerror pyspark
pyspark org apache$spark sparkexception job aborted due to stage failure
py4jjavaerror import
py4j/protocol py4jjavaerror: an error occurred while calling o57 count
check pyspark version
pyspark rdd count error
catch py4jjavaerror

I am using PySpark(v.2.1.0) in iPython notebook (python v.3.6) over virtualenv in my Mac(Sierra 10.12.3 Beta).

1.I launched iPython notebook by shooting this in Terminal -

 PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /Applications/spark-2.1.0-bin-hadoop2.7/bin/pyspark

2.Loaded my file to Spark Context and ensured its loaded-

>>>lines = sc.textFile("/Users/PanchusMac/Dropbox/Learn_py/Virtual_Env/pyspark/README.md") 

>>>for i in lines.collect(): 
    print(i)

And it worked fine and printed the result over my console as shown:

# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
and Spark Streaming for stream processing.

<http://spark.apache.org/>


## Online Documentation

You can find the latest Spark documentation, including a programming
guide, on the [project web page](http://spark.apache.org/documentation.html).
This README file only contains basic setup instructions. 

Also checked the sc -

>>>print(sc)

<pyspark.context.SparkContext object at 0x101ce4cc0>
  1. Now when I am trying to execute lines.count() or lines.first() functions over my RDD, I ended up with following error -


    Py4JJavaError                             Traceback (most recent call last)
    <ipython-input-33-44aeefde846d> in <module>()
    ----> 1 lines.count()
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in count(self)
       1039         3
       1040         """
    -> 1041         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
       1042 
       1043     def stats(self):
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in sum(self)
       1030         6.0
       1031         """
    -> 1032         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
       1033 
       1034     def count(self):
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in fold(self, zeroValue, op)
        904         # zeroValue provided to each partition is unique from the one provided
        905         # to the final reduce call
    --> 906         vals = self.mapPartitions(func).collect()
        907         return reduce(op, vals, zeroValue)
        908 
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in collect(self)
        807         """
        808         with SCCallSiteSync(self.context) as css:
    --> 809             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
        810         return list(_load_from_socket(port, self._jrdd_deserializer))
        811 
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
       1131         answer = self.gateway_client.send_command(command)
       1132         return_value = get_return_value(
    -> 1133             answer, self.gateway_client, self.target_id, self.name)
       1134 
       1135         for temp_arg in temp_args:
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
         61     def deco(*a, **kw):
         62         try:
    ---> 63             return f(*a, **kw)
         64         except py4j.protocol.Py4JJavaError as e:
         65             s = e.java_exception.toString()
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
        317                 raise Py4JJavaError(
        318                     "An error occurred while calling {0}{1}{2}.\n".
    --> 319                     format(target_id, ".", name), value)
        320             else:
        321                 raise Py4JError(
    
    Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 22, localhost, executor driver): org.apache.spark.SparkException: 
    Error from python worker:
      Traceback (most recent call last):
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
          mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
          __import__(pkg_name)
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 36, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/java_gateway.py", line 25, in <module>
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module>
          "system node release version machine processor")
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in namedtuple
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
    PYTHONPATH was:
      /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/:
    java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
        at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    
    Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.spark.SparkException: 
    Error from python worker:
      Traceback (most recent call last):
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
          mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
          __import__(pkg_name)
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/__init__.py", line 44, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 36, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/java_gateway.py", line 25, in <module>
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module>
          "system node release version machine processor")
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in namedtuple
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
    PYTHONPATH was:
      /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/:
    java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
        at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
    

Could someone explain me where does it went wrong ? Note: When I performed the same operations in my Mac Terminal, they worked just as expected.

Pyspark 2.1.0 is not compatible with python 3.6, see https://issues.apache.org/jira/browse/SPARK-19019.

You need to use earlier python version or you can try building master or 2.1 branch from github and it should work.

PySpark, Use the Jupyter PySpark notebook Docker container: Basically, here are the first few lines of a standalone notebook: seed) 477 return [] 478 --> 479 initialCount = self.count() 480 if initialCount == 0: 481 return gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {​0}{1}{2}.\n". I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration. Installing Ipython notebook with pyspark 1.4 on AWS. and. Configuring IPython notebook support for Pyspark. After fisnish the configuration, following is several code in the related files: 1.ipython_notebook_config.py

Yeah I had the same problem long time ago in Pyspark in Anaconda I tried several ways to rectify this finally I found on my own by installing Java for anaconda separately afterwards there is no Py4jerror.

https://anaconda.org/cyclus/java-jdk

I am running simple count and I am getting an error, Py4JJavaError: An error occurred while calling o358.showString. count = Data.​map(lambda x: len(x)).distinct().collect(). print count. (1) Spark  So, I am trying to initialize SparkSession and SparkContext in python 3.6 using the following code: from pyspark.sql import SparkSession from pyspark import SparkContext #Create a Spark Session

If you are using Anaconda, try to install java-jdk for Anaconda:

conda install -c cyclus java-jdk

Unable to create array literal in spark/pyspark, PySpark in iPython notebook raises Py4JJavaError when using count() and first() · stackoverflow.com · 10 · Pyspark Dataframe Apply function to two columns. ipython notebook spark example spark,ipython notebook,Use IPython Notebook with Apache Spark,Configure IPython Notebook for PySpark harpreet varma 5,635 views. 7:08. Word Count using

pyspark.sql module, The entry point to programming Spark with the Dataset and DataFrame API. createDataFrame(l).collect() [Row(_1='Alice', _2=1)] >>> spark. To register a nondeterministic Python function, users need to first build a df.count() 2 this DataFrame; attempting to add a column from some other dataframe will raise an error. This is the first time I have been using it and have been playing around with it for 3 weeks. I am able to successfully run Hive queries. I also use Jupyter Notebook for data analysis and it is run from cloud. I see that I am able to use PySpark directly without having to download Spark in desktop.

How to Get Started with PySpark, PySpark is a Python API to using Spark, which is a parallel and such as the following when I tried to run collect() or count() in my Spark cluster: py4j.protocol.​Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python. You can install jupyter notebook using pip install jupyter notebook  Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook. Load a regular Jupyter Notebook and load PySpark using findSpark package. First option is quicker but specific to Jupyter Notebook, second option is a broader approach to get PySpark available in your favorite IDE.

Unable to launch pyspark shell [duplicate] - apache-spark - html, I am using PySpark(v.2.1.0) in iPython notebook (python v.3.6) over virtualenv in my to execute lines.count() or lines.first() functions over my RDD, I ended up with gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error  I'm new to Spark and I'm using Pyspark 2.3.1 to read in a csv file into a dataframe. I'm able to read in the file and print values in a Jupyter notebook running within an anaconda environment. This is the code I'm using:

Comments
  • Thank you Mariusz. Issue sorted when I pointed the iPython kernel to Python 2.6.
  • i am surprised your answer got a -1. i tried all other solutions and non of them worked (it's prob my own fault) until i did what you suggested. thanks!
  • This answer should considered as the right solution, imo.
  • is this thing still maintained? "Last upload: 4 years and 6 months ago" as of end of 2019