DBSCAN on spark : which implementation

dbscan spark python
spark clustering algorithm
spark lda
dbscan latitude, longitude python
hadoop dbscan
dbscan hadoop github
how to find eps in dbscan python
dbscan predict

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

I have tested the first one with the sbt configuration given in its github but:

  • functions in the jar are not the same as those in the doc or in the source on github. For example, I cannot find the train function in the jar

  • I manage to run a test with the fit function (found in the jar) but a bad configuration of epsilon (a little to big) put the code in an infinite loop.

code :

val model = DBSCAN.fit(eps, minPoints, values, parallelism)

Has someone managed to do someting with the first library?

Has someone tested the second one?

Please try ELKI. As this is Java, it should be easy to call from Scala.

ELKI is very well optimized, and with indexes it will scale to quite large data sets.

We tried to include one of these Spark implementations in our benchmarking study - but it ran out of memory (and it was the only implementation to run out of memory ... the k-means of Spark and Mahout were also among the slowest):

Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? In: Knowledge and Information Systems (KAIS). 2016, 1–38

Professor Neukirchen benchmarked parallel implementations of DBSCAN in this technical report:

Helmut Neukirchen Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms

apparently he got some of the Spark implementations working, but noted that:

The result is devastating: none of the implementations for Apache Spark is anywhere near to the HPC implementations. In particular on bigger (but still rather small) data sets, most of them fail completely and do not even deliver correct results.

and earlier:

When running any of the "Spark DBSCAN" implementations while making use of all available cores of our cluster, we experienced out-of-memory exceptions.

(also, "Spark DBSCAN" took 2406 seconds on 928 cores, ELKI took 997 seconds on 1 core for the smaller benchmark - the other Spark implementation didn't fare too well either, in particular it did not return the correct result...)

"DBSCAN on Spark" did not crash, but returned completely wrong clusters.

While "DBSCAN on Spark" finishes faster, it delivered completely wrong clustering results. Due to the hopelessly long run-times of the DBSCAN implementations for Spark already with the maximum number of cores, we did not perform measurements with a lower number of cores.

You can wrap a double[][] array as ELKI database:

// Adapter to load data from an existing array.
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();

// Squared Euclidean is faster than Euclidean.
Clustering<Model> c = new DBSCAN<NumberVector>(
  SquaredEuclideanDistanceFunction.STATIC, eps*eps, minpts).run(db);

for(Cluster<KMeansModel> clu : c.getAllClusters()) {
  // Process clusters
}

See also: Java API example (in particular, how to map DBIDs back to row indexes). For better performance, pass an index factory (such as new CoverTree.Factory(...)) as second parameter to the StaticArrayDatabase constructor.

irvingc/dbscan-on-spark: An implementation of DBSCAN , An implementation of DBSCAN runing on top of Apache Spark - irvingc/dbscan-​on-spark. DBSCAN on Spark Overview. This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et al. "MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data". I have also created a visual guide that explains how the algorithm works.

I successfully use the second library (https://github.com/alitouka/spark_dbscan) in my project.Actually,I can't use it as follows:

libraryDependencies += "org.alitouka" % "spark_dbscan_2.10" % "0.0.4"

resolvers += "Aliaksei Litouka's repository" at "http://alitouka-public.s3-website-us-east-1.amazonaws.com/"

Instead, I download the code and update it to spark 2.2.1 version.Besides,some of the libraries should be added.Finally, add the code to my project, it works!

[PDF] A Fast DBSCAN Algorithm with Spark Implementation, DBSCAN is a clustering algorithm proposed by Ester et al. [6]. And it has become one of the most common clustering algorithms because it is capable of  DBSCAN On Spark. DBSCAN implementation on Apache Spark. Update 2018-01-27. Output Change the output now includes noisy data and will have a clusterID of "0". Update 2017-12-17. I've update the core DBSCAN code (DBSCAN2) to include noise data that is close to a cluster as part of the cluster. Thanks to Randall W. and Erik H.

I tested https://github.com/irvingc/dbscan-on-spark and can say that it consumes a lot of memory. For 400K dataset with smooth distribution i used -Xmx12084m and even in this case it works too long (>20 min). In addition, it is only fo 2D. I used project with maven, not sbt.

I tested also second implementation. This is still the best that I found. Unfortunately, the author does not support it since 2015. It really took some time to raise the version of the Spark and resolve the version conflicts. I needed it to deploy on aws.

A Fast DBSCAN Algorithm with Spark Implementation, Request PDF | A Fast DBSCAN Algorithm with Spark Implementation | DBSCAN is a well-known clustering algorithm which is based on density and is able to  DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work A Fast DBSCAN Algorithm with Spark Implementation | SpringerLink

You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:

  import smile.clustering._

  val dataset: Array[Array[Double]] = Array(
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),

    Array(0, 0),
    Array(1, 0),
    Array(1, 2),
    Array(1, 1)
  )

  val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
  println(dbscanResult)

  // output
  DBSCAN clusters of 10 data points:
  0     6 (60.0%)
  1     4 (40.0%)
  Noise     0 ( 0.0%)

You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.

I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.

I was inspired by the following article to do this

Clustering geolocated data using Spark and DBSCAN – O'Reilly, The DBSCAN algorithm is available in several languages and packages. The following code snippet is based on DBSCAN as implemented in the  A Fast DBSCAN Algorithm with Spark Implementation 181. If we take a look at Line 14, this operation should beput(key, value), which is usually O(1 +n/K) where K is the hash table size. If K is large enough, the result is effectively O(1). MethodcontainsKey(key)is performed in Line 5, Line 7, and Line 20.

Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with , Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient. We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. Abstract In this project, we aim at proposing the parallel implementation of Density-based spatial clustering of applications with noise (DBSCAN) algorithm based on Spark in Python.

Perform dbscan with spark nodes - Big Data, we currently do not have k-medoids or dbscan for Spark, because spark-​kmedoids - Spark implementation of k-medoids clustering algorithm. Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

DBSCAN on Spark, Overview. This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et  To validate the merit of our approach, we implement RP-DBSCAN on Spark and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores).

Comments
  • Use ELKI with Cover tree. It's much faster, despite being single-node only. I tried one of the Spark versions and it went out of memory, but ELKI still worked fine and was fasr.
  • Another incomplete implementation of DBSCAN on Spark is github.com/mraad/dbscan-spark
  • Can you put your code for running the first package here? I test the one-line code, it does not work for me. I use intelliJ but the package cannot be automatically downloaded using sbt. So I manually downloaded the jar file from here: dl.bintray.com/irvingc/maven/com/irvingc/spark/dbscan_2.10/…
  • For this example, could you give how exactly the new CoverTree.Factory looks like?
  • @PaulZWu You can find a full example with index in the regular ELKI tutorial sources: github.com/elki-project/elki/blob/master/addons/tutorial/src/…
  • Thank you. I figured it out about a year ago. Your work is awesome -- we are using this package for an interesting project and its performance is impressive (with the indexing).
  • Some libraries should update,for example, change "import org.apache.spark.Logging" to "import org.apache.spark.internal.Logging"
  • do you have a link to a build file that you used for smile?