Get all record from nth bucket in Hive sql

bucketing in hive
hive sql commands
hive tablesample
hive tutorial
hive array
hive practice examples
hive architecture
hive queries

How to get all record from nth bucket in hive.

Select * from bucketTable from bucket 9;

You can achieve this with different ways:

Approach-1: By getting the table stored location from desc formatted <db>.<tab_name>

Then read the 9th bucket file directly from HDFS filesystem.

(or)

Approach-2: Using input_file_name()

Then filter only 9th bucket data by using filename

Example:

Approach-1:

Scala:

val df = spark.sql("desc formatted <db>.<tab_name>")

//get table location in hdfs path
val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString

//based on your table format change the read format
val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")

//display the data
ninth_buk.show()

Pyspark:

from pyspark.sql.functions import *

df = spark.sql("desc formatted <db>.<tab_name>")

loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].__getattr__("data_type")

ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")

ninth_buk.show()

Approach-2:

Scala:

 val df = spark.read.table("<db>.<tab_name>")

//add input_file_name 
 val df1 = df.withColumn("filename",input_file_name())

#filter only the 9th bucket filename and select only required columns
val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)

ninth_buk.show()

pyspark:

from pyspark.sql.functions import *

 df = spark.read.table("<db>.<tab_name>")

df1 = df.withColumn("filename",input_file_name())

ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)

ninth_buk.show()

Approach-2 will not be recommended if you have huge data as we need to filter through whole dataframe..!!


In Hive:

set hive.support.quoted.identifiers=none;
select `(fn)?+.+` from (
                        select *,input__file__name fn from table_name)e 
 where e.fn like '%000008_0%';

Get all record from nth bucket in Hive sql, Author: Full Stack. Front end web development requires a solid understanding of many languages, and the scale of questions asked during the  Hive Sampling Bucketized Table. The sampling Bucketized table allows you to get sample records using the number of buckets. The Bucketized sampling method can be used when your tables are bucketed. You can provide the bucket number starting from 1 along with colname on which to sample each row in the Hive table.

If it is a ORC table

SELECT * FROM orc.<bucket_HDFS_path>

Tutorial - Apache Hive, Books about Hive lists some books that may also be helpful for getting started with An example of a table could be page_views table, where each row could For example, all "US" data from "2009-12-23" is a partition of the page_views table. In the following example we choose 3rd bucket out of the 32 buckets of the  HI, I have 5 records in my database and i want to select the 3rd records.Minus is not possible in hive.So i'm trying Left outer join to perform the query.It's giving me some random results. Id,Codes 1 100 1 200 2 200 3 200 3 300 select a.id,b.id from analytical a inner join (select id from analytica

select * from bucketing_table tablesample(bucket n out of y on clustered_criteria_column);

where bucketing_table is your bucket table name

n => nth bucket
y => total no. of buckets

LanguageManual Sampling - Apache Hive, The rows of the table are 'bucketed' on the colname randomly into y In the following example the 3rd bucket out of the 32 buckets of the For example, if block size is 256MB, even if n% of input size is only 100MB, you get  How can this be automated for all tables in a database. I am actually interested in a feature similar to information_schema.tables like feature in hive which would enlist record count in all tables in a database using HQL. Any thoughts – Raunak Jhawar Feb 20 '14 at 10:47

5 Tips for efficient Hive queries with Hive Query Language, The table Airline Bookings All contains 276 million records of complete air travel Subsequently, queries filtering by origin state, e.g. SELECT * FROM Example Hive query table bucketing Bucketing requires us to tell Hive at Kenya​, Kiribati, Korea North, Korea South, Kosovo, Kuwait, Kyrgyzstan, Laos  Get all record from nth bucket in Hive sql You can achieve this with different ways: Approach-1: By getting the table stored location from desc formatted <db>.<tab_name> Then read the 9th bucket file directly from HDFS filesystem.

Bucketing in Hive : Querying from a particular bucket, Hive provides a feature that allows for the querying of data from a given bucket. The result set can be all the records in that particular bucket or a random sample data. table and you would like to fetch records from bucket 2, for example. SELECT col_name FROM table_name TABLESAMPLE([param]);. The SQL NTILE() is a window function that allows you to break the result set into a specified number of approximately equal groups, or buckets. It assigns each group a bucket number starting from one. For each row in a group, the NTILE() function assigns a bucket number representing the group to which the row belongs.

How to select distinct rows from hive table, If you don't want to specify individual column names in your query then Select distinct * from table_name; or If you How do I get count of all rows I table which has hundreds of millions rows? How do I determine number of buckets in hive? You can also use CTE to get the same result. SQL Query to Find Nth Highest Salary in each Group. In this example we are going to find the second highest salary in each group. This is the same query that we used in our previous example but we added the PARTITION BY Clause to separate the groups.

Comments
  • Will this approach hold good if the table is stored as....df.write .bucketBy(50, "id") .saveAsTable("myHiveTable")
  • @baidyas, I think when we write data from spark -> hive bucketed table there are still some issues (issues.apache.org/jira/browse/SPARK-17729), it's better to write data using hive job to bucketed tables :)