Spark - sort by value with a JavaPairRDD
apache spark using
Java. I got an
JavaPairRDD<String,Long> and I want to sort this dataset by its value. However, it seems that there only is
sortByKey method in it. How could I sort it by the value of
'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details).
As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual.
In Scala would be something like:
val kv:RDD[String, Long] = ??? // swap key and value val vk = kv.map(_.swap) val vkSorted = vk.sortByKey
How to sort elements by values with Spark in Java, Spark only allows sorting by keys and not by values. for sorting JavaPairRDD<Integer, Tuple2<String, Stats0>> sortingRDD = baseKeyPair . 'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details). As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual. In Scala would be something like: val kv:RDD[String, Long] = ??? // swap key and value val vk = kv.map(_.swap) val vkSorted = vk.sortByKey
dataset.mapToPair(x -> x.swap()).sortByKey(false).mapToPair(x -> x.swap()).take(100)
Apache Spark: MapReduce and RDD manipulations with keys, Sorting results by value. The JavaPairRDD class has a sortByKey() method, but there is no sortByValue() method. To sort by value, we then have We can sort an RDD with key/value pairs provided that there is an ordering defined on the key. Once we have sorted our data, any subsequent call on the sorted data to collect() or save() will result in ordered data. Using Sort by on normal RDD. We will use the below dataset
I did this using a List, which now has a
sort(Comparator c) method
List<Tuple2<String,Long>> touples = new ArrayList<>();
touples.sort((Tuple2<String, Long> o1, Tuple2<String, Long> o2) -> o2._2.compareTo(o1._2));
It is longer than @Atul solution and i dont know if performance wise is better, on an RDD with 500 items shows no difference, i wonder how does it work with a million records RDD.
You can also use
Collections.sort and pass in the list provided by the
collect and the lambda based
spark sortby and sortbykey example in java and scala – tutorial 7 , spark sortby and sortbykey example in java and scala – tutorial 7 We can sort an RDD with key/value pairs provided that there is an ordering data\\movies_data_2");; JavaPairRDD<String, AverageRating> pairRdd = rdd. org.apache.spark.api.java.JavaPairRDD<K,V> Convert a JavaRDD of key-value pairs to JavaPairRDD. Sort the RDD by key, so that each partition contains a sorted
Spark - sort by value with a JavaPairRDD - sorting - android, Working with apache spark using Java. I got an JavaPairRDD<String,Long> and I want to sort this dataset by its value. However, it seems that there only is Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map, via simple random sampling with one pass over the RDD, to produce a sample of size that's approximately equal to the sum of math.ceil(numItems * samplingRate) over all key values.
Producing a sorted wordcount with Spark, Similarly it holds that: JavaPairRDD<String, Integer> pairs = words. Entry<String, Long>> . Sort by the reverse order of the value of the entry. Motivation. Spark provides special types of operations on RDDs that contain key/value pairs (Paired RDDs). These operations are called paired RDDs operations. Paired RDDs are a useful building block in many programming languages, as they expose operations that allow us to act on each key operation in parallel or re-group data across the network.
org.apache.spark.api.java.JavaRDD.sortBy java code examples , valueOf(datum), true, allData.partitions().size()); JavaPairRDD<String rdd = sc.parallelize(pairs); // compare on first value JavaRDD<Tuple2<Integer, To sort by value, we then have to reverse our tuples so that values become keys. Since a JavaPairRDD does not impose unique keys, we can have redundant values. We reverse tuples with mapToPair():.mapToPair(t -> new Tuple2<Long , String>(t._2, t._1)) We can then sort the RDD by descending order (highest values first) and save the 10 first