Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

how hive distributes the rows into buckets
how to decide number of buckets in hive
hive bucketing
how to query bucketed table in hive
how to choose bucketing column in hive
hive bucketing multiple columns
bucket table
hive 3 bucketing

I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?

Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

This will never work, because of the following restrictions on CTAS:

  1. The target table cannot be a partitioned table.
  2. The target table cannot be an external table.
  3. The target table cannot be a list bucketing table.

Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ... [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] ... [AS select_statement];

Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

Optionally, you can ALTER the table and add buckets, but this does not change existing data.

CREATE TABLE BUCKET_TABLE 
STORED AS ORC AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;

What is Bucketing and Clustering in Hive?, Bucketing and Clustering is the process in Hive, to decompose table data sets into No. of buckets is mentioned while creating bucket table. Bucketed tables will create almost equally distributed data file parts. Unlike in partitioning, where tables are divided into partitions via creating a directory for� We use CLUSTERED BY clause to divide the table into buckets. Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based. Bucketing can be done along with Partitioning on Hive tables and even without partitioning. Bucketed tables will create almost equally distributed data file parts. Advantages

Looks like it is impossible. When trying to submit such query to ambari:

CREATE TABLE ready_requests 
CLUSTERED BY (device) INTO 64 BUCKETS
as
SELECT ...;

I got something like: "CREATE-TABLE-AS-SELECT does not support partitioning in the target table" Even it is not correct message, looks like the bucketing is also not supported

There is documentation that partitions are not supported, please see link, for buckets there is not such information, but looks like we have the same problem

CLUSTER BY and CLUSTERED BY in Spark SQL, Lets load data using CLUSTER BY in this table. DDL: DROP TABLE IF Each file created will be mapped to each bucket 1:1. 2. CLUSTER BY� ) PARTITIONED BY (col4 date) CLUSTERED BY (col1) INTO 32 BUCKETS STORED AS TEXTFILE; You can create buckets on only one column, you cannot specify more than one column. You can create Hive buckets on Hive managed tables or hive external tables based on your requirement and type of select statement that you run on those tables. The high

Bucketing in Hive : Querying from a particular bucket, Buckets can be created on a table even without the table being in the CLUSTERED BY clause, TABLESAMPLE will only scan through the� Unlike in partitioning, where tables are divided into partitions via creating a directory for each partition, whereas, in Bucketing, buckets are divided as files. In partitioning, tables are not equally partitioned, whereas in Bucketing, buckets are almost equally divided, and the no. of buckets can be specified while creation of bucketed table.

You have to use a non-null column in your clustered by clause.

Bucketing in Hive | What is Hive Bucketing?, Here, CLUSTERED BY clause is used to divide the table into buckets. In Hive But in Hive Buckets, each bucket will be created as file. the data into more manageable parts or equal parts using Bucketing technique in Hive. Now we will create a custom cluster table and will try to INSERT/UPDATE a data cluster into same as well as try to perform DELETE operation and see how this all works. 1. First, we will create a cluster table that will hold our data clusters, I have created our custom table with same field and technical properties as PCL4 cluster table.

HIVE TABLE USING PARTITION BUCKETING, Hive Partition can be further subdivided into Clusters or Buckets. Hive Buckets is Step-3: Create a table in hive with partition and bucketing. create table� So, what can go wrong? As long as you use the syntax above and set hive.enforce.bucketing = true (for Hive 0.x and 1.x), the tables should be populated properly. Things can go wrong if the bucketing column type is different during the insert and on read, or if you manually cluster by a value that's different from the table definition.

Bucketing In Hive, Physically, each bucket is just a file in the table directory, and Bucket numbering We can create bucketed tables with the help of CLUSTERED BY case, by using set mapred.reduce.tasks=32) and CLUSTER BY (state) and� You had created view cluster using SE54, this view cluster contain table to be maintained by user. then you must create Z transaction or tcode to attach these cluster view. 1.Create Transaction code using SE93 then select Transaction with parameters.

Handling bucketed tables, You can divide tables or partitions into buckets, which are stored in the following key int, value int) CLUSTERED BY(key) INTO 3 BUCKETS STORED AS ORC Performance of ACID V2 tables is on a par with non-ACID tables using buckets. or two creates problems if you do not promptly tune the buckets and partitions. Using grouping. To use grouping, select two or more elements on a visual by using Ctrl+click to select multiple elements. Then right-click one of the multiple selection elements and choose Group from the context menu. Once it's created, the group is added to the Legend bucket for the visual. The group also appears in the Fields list.

Comments
  • try CREATE TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS AS SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll ;
  • @patrick it throws error FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table ..did you try this?
  • Too bad that Hive doesn't support it because Cloudera Impala does it
  • That gives me the same syntax error (FAILED: ParseException line 1:235 missing EOF at 'clustered' near 'NULL' (state=42000,code=40000))
  • @Andrew this does not work for you as you are selecting rows from table1 where the keys are null and then using clusterd by on those null keys. Clustered by should run on a non-null column.
  • No, table1 keys aren't null in the query. table2 keys are.
  • But it looks like you are trying to pull all records with key as null.
  • 1. Please try to change NUll to NULL 2. Try this query FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) INSERT OVERWRITE TABLE BUCKET_TABLE SELECT a.* WHERE b.key IS NULL CLUSTERED BY (key) INTO 1000 BUCKETS;