Hot questions for Using Neural networks in apache spark mllib


I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().

val formula = new RFormula()
  .setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")

Note: The features is a vector and label is a Double

 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

I define my MLP estimator as follows:

val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong

val mlp = new MultilayerPerceptronClassifier()

// train the model
val model =

Unfortunately, I got the following error:

Using Spark's default log4j profile: org/apache/spark/

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11 at$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121) at$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at scala.collection.Iterator$$anon$ at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950) ...


This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:

def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
  val output = Array.fill(labelCount)(0.0)
  output(labeledPoint.label.toInt) = 1.0
  (labeledPoint.features, Vectors.dense(output))

It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.

By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:

The number of nodes N in the output layer corresponds to the number of classes.

The solution is therefore to either:

  1. Change the number of nodes in the final layer of the network (output nodes)

  2. Reconstruct the data to have the same number of classes as your network output nodes.

Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.


I'm trying to decide on the best architecture for a multilayerPerceptron in Apache Spark and am wondering whether I can use cross-validation for that.

Some code:

// define layers
int[] layers = new int[] {784, 78, 35, 10};
int[] layers2 = new int[] {784, 28, 28, 10};
int[] layers3 = new int[] {784, 84, 10};
int[] layers4 = new int[] {784, 392, 171, 78, 10};

MultilayerPerceptronClassifier mlp = new MultilayerPerceptronClassifier()

ParamMap[] paramGrid = new ParamGridBuilder()
        .addGrid(mlp.seed(), new long[] {895L, 12345L})
        //.addGrid(mlp.layers(), new int[][] {layers, layers2, layers3})

CrossValidator cv = new CrossValidator()
        .setEvaluator(new MulticlassClassificationEvaluator())

CrossValidatorModel model =;

As you can see I've defined some architectures in integer arrays (layers-layers4).

As is, I have to fit the model multiple times, manually changing the layers parameter for the learning algorithm.

What I want is to provide the different architectures in a ParamMap that I pass to a CrossValidator (the commented out line in the ParamMap).

I suspect this beeing possible since the layers() method seems to be known to the ParamGridBuilder, but it doesn't accept the provided arguments.

If I am correct in this assumption, what am I doing wrong and how can I get this to work as intended?


Looking at the code it seems syntactically correct. It not working may be a bug or intended, since it'd be rather expensive computationally. So I guess no, you can't use cv for that.

I ended up using the following formula:

Number of units in hidden-layer = ceil((Number of inputs + outputs) * (2/3))



Looking at this code

for (i <- (L - 2) to (0, -1)) {
    layerModels(i + 1).computePrevDelta(deltas(i + 1), outputs(i + 1), deltas(i))

I want to understand why are we passing outputs(i+1) instead of outputs(i) in the code snippet above. As far as I understand this is only needed for sigmoid activation layer which has a derivative as f'(x) = f(x) * (1-f(x)) = outputs(i) * (1-outputs(i))

Which means in order to find prevDelta we should be using outputs(i).


I figured why it is so. I will answer here if someone like me stumbles here by chance.

You have to notice that we are calculating delta for layer i which only depends on next (i+1 th) layer's delta and gradient. You have to notice that we are using layerModels(i + 1) as needed and not layerModels(i)