Custom SerDe not supported by Impala, what's the best way to query files in CSV w/double quotes?

I have a CSV data with each field surronded with double quotes. When I created Hive table used serde 'com.bizo.hive.serde.csv.CSVSerde' When above table is queried in Impala I am getting error SerDe not found.

I added the CSV Serde JAR file in /usr/lib/impala/lib folder.

Later studied in Impala documentation that Impala does not support custom SERDE. In such case how I can overcome this issue such that my CSV data with quotes is taken care. I want to use CSV Serde because it takes of commas in values which is a legitimate field vavlue.

Thanks a lot

Can you use Hive? If so, here is an approach that might work. CREATE your table as an EXTERNAL TABLE in Hive and use your SERDE in the right place of the CREATE Statement (I think you need something like ROW FORMAT SERDE your_serde_here at the end of the CREATE TABLE statement). Before this you might need to do:

ADD JAR 'hdfs:///path/to/your_serde.jar' 

Note that the jar should be somewhere in hdfs and triple /// needed for it to work...

Then, still in Hive, duplicate the table into another table that is stored in a format with which Impala can easily work, such as PARQUET. Something like the following does this copying:

CREATE TABLE copy_of_table 
   STORED AS PARQUET AS
   SELECT * FROM your_original_table

Now in Impala use INVALIDATE METADATA to mark the metadata as stale:

INVALIDATE METADATA copy_of_table

You should be all set to happily work with copy_of_table in Impala now.

Let me know whether this works, as I might have do to something like this in the near future.

Impala Frequently Asked Questions | 5.9.x, Does "Apache Impala (incubating)" mean Impala is not production-ready? Custom Hive Serializer/Deserializer classes (SerDes). Impala supports a set of common native file formats that have built-in SerDes in CDH. See How Impala Works� I am trying to figure out since yesterday why my table creation is not working. Since I can't link my Impala to my Hbase I can't make queries on my twitter stream :/ Do I need a special JAR like Hive for the SerDe properties ? Here is my command: CREATE EXTERNAL TABLE HB_IMPALA_TWEETS ( id int, id_s

Within Hive

CREATE TABLE mydb.my_serde_table_impala AS SELECT FROM mydb.my_serde_table

Within Impala

INVALIDATE METADATA mydb.my_serde_table_impala

Add these steps to include dropping the _impala table first with whatever populates or ingests files for the serde table.

Impala bypasses MapReduce, unlike Hive. So Impala can't/doesn't use the SerDe the way MapReduce does.

Impala support for custom serde, Any comments on when impala is planning to support custom serde ? Also, can these be written in Java, rather than C++ ? Thanks, Impala does not support custom SerDe, but it natively support text files with any seperator you defined. For example: create table tsv(id int, s string, n int, t timestamp, b boolean) stored as textfile fields terminated by '\t'; Here's the relevant doc:

Now the default SerDe class is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe and it is supported by Impala. Unfortunately it has much less features, for exmaple, it does not support escape chars.

Known Issues and Workarounds in Impala, business intelligence tools or custom-written applications in languages such as Java or C++. ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the Heimdal Kerberos is not supported in Impala. not readable by Impala, for example, the table was created in Hive in the Open CSV Serde format. Semantic Differences in Impala Statements vs HiveQL. Different syntax and names for query hints. MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY are not exposed. Queries do not need a FROM clause. Impala does not allow: Implicit cast between string and numeric or Boolean types

Fixed Issues in Apache Impala, IMPALA-941- Impala supports fully qualified table names that start with a number. statement: FAILED: SemanticException Class not found: org.apache.impala. hive.serde. Support specifying a custom AuthorizationProvider in Impala. 5 Custom SerDe not supported by Impala, what's the best way to query files in CSV w/double quotes? Dec 16 '14 5 Python: Replace typographical quotes, dashes, etc. with their ascii counterparts Jan 6 '17

Supported and Unsupported SQL/HiveQL Language Features, Impala only supports the INSERT and LOAD DATA statements which modify data CROSS JOIN (Use as the join operator for Cartesian joins; does not use any ON mechanisms such as TRANSFORM, custom file formats, or custom SerDes � I added the CSV Serde JAR file in /usr/lib/impala/lib folder. Later studied in Impala documentation that Impala does not support custom SERDE. In such case how I can overcome this issue such that my CSV data with quotes is taken care. I want to use CSV Serde because it takes of commas in values which is a legitimate field vavlue.

Using Apache Parquet Data Files with CDH, See Using a Custom MapReduce Job. Currently, Impala does not support RLE_DICTIONARY encoding. require a one-time ALTER TABLE statement to update the metadata for the SerDe class name before they can be used with Hive . that Impala does not support. You should be able to query tokenized_access_logs in Impala. The point of Exercise 2 is to demonstrate the different strengths between Hive and Impala: Impala queries execute *way* faster than Hive queries, but Hive has a bigger community of more dynamic user-defined functions and data formats, etc. This makes Hive a

Comments
  • The correct query is CREATE TABLE copy_of_table STORED AS PARQUET AS SELECT * FROM your_original_table otherwise Hive return a syntax exception.