I am trying to use pyspark to preprocess data for the prediction model. I get an error when I try spark.createDataFrame out of my preprocessing.Is there a way to check how processedRDD look like before making it to dataframe?

    import findspark
    import pyspark
    from pyspark.sql import SQLContext
    import os
    import pandas as pd
    import geohash2

    sc = pyspark.SparkContext('local', 'sentinel')
    spark = pyspark.SQLContext(sc)
    sql = SQLContext(sc)
    working_dir = os.getcwd()

    df = sql.createDataFrame(data)

    df =['starttime', 'latstart','lonstart', 'latfinish', 'lonfinish', 'trip_type']), False)
    processedRDD = df.rdd
    processedRDD = processedRDD \
                    .map(lambda row: (row, g, b, minutes_per_bin)) \
                    .map(data_cleaner) \
                    .filter(lambda row: row != None)
    featuredDf = spark.createDataFrame(processedRDD, ['year', 'month', 'day', 'time_cat', 'time_num', 'time_cos', \
                                              'time_sin', 'day_cat', 'day_num', 'day_cos', 'day_sin', 'weekend', \
                                              'x_start', 'y_start', 'z_start','location_start', 'location_end', 'trip_type'])

I am getting this error:

[Stage 1:>                                                          (0 + 1) / 1]2019-10-24 15:37:56 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 (TID 1)

raise AppRegistryNotReady("Apps aren't loaded yet.") django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.

I do not understand what this have to do with importing an app

Basically, you need to load your settings and populate Django’s application registry before doing anything else. You have all the information required in the Django docs.

I don't know what this script has to do with Django exactly, but adding the following lines at the top of the script will probably fix this issue:

import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
import django

Instead of manualy running Hadoop I am making a python server which is using pyspack and calculate 10 times faster heavy AI algorithms on Django server. The problem I had came from SPARK-LOCAL-IP, different IP was used (the one I use to connect to a remote database vis sshtunnel). I import and use pyspark. I had to rename a file and add the correct IP.

 cd /usr/local/spark/conf
 mv -i
 paste: SPARK-LOCAL_IP=""

Then I had to add to my sc.setLogLevel("ERROR") To see what was the real problem .Debuging of java in python can be problematic sometimes. A column was datetime instead of string and I fixed it.

  • I don't even see where django is referred to in this script... What is findspark?
  • it something like a handle to use spark in my Django server so calculations can be done 10 times faster and cheaper