Loading JSON data to AWS Redshift results in NULL values

redshift spectrum json example
invalid jsonpath format: member is not an object
redshift copy csv double quote
redshift copy json
redshift json array to rows
load json data in redshift
aws redshift copy multiple files
truncatecolumns redshift

I am trying to perform a load/copy operation to import data from JSON files in an S3 bucket directly to Redshift. The COPY operation succeeds, and after the COPY, the table has the correct number of rows/records, but every record is NULL !

It takes the expected amount of time for the load, the COPY command returns OK, the Redshift console reports successful and no errors... but if I perform a simple query from the table, it returns only NULL values.

The JSON is very simple + flat, and formatted correctly (according to examples I found here: http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html)

Basically, it is one row per line, formatted like:

{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }

I have tried things like rewriting the schema based on values and data types found in the JSON objects and also copying from uncompressed files. I thought perhaps the JSON was not being parsed correctly upon load, but it should presumably raise an error if the objects cannot be parsed.

My COPY command looks like this:

copy events from 's3://mybucket/json/prefix' 
with credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
json 'auto' gzip;

Any guidance would be appreciated! Thanks.

So I have discovered the cause - This would not have been evident from the description I provided in my original post.

When you create a table in Redshift, the column names are converted to lowercase. When you perform a COPY operation, the column names are case sensitive.

The input data that I have been trying to load is using camelCase for column names, and so when I perform the COPY, the columns do not match up with the defined schema (which now uses all lowercase column names)

The operation does not raise an error, though. It just leaves NULLs in all the columns that did not match (in this case, all of them)

Hope this helps somebody to avoid the same confusion!

Build Redshift Data Warehouse, Deliver Redshift cloud data warehouse in record time by automating development lifecycle. A value can be a string in double quotation marks, a number, a Boolean true or false, null, a JSON object, or an array. JSON objects and arrays can be nested, enabling a hierarchical data structure. The following example shows a JSON data structure with two valid objects.

For cases when JSON data objects don't correspond directly to column names you can use a JSONPaths file to map the JSON elements to columns as mentioned by TimZ and described here

Loading JSON data to AWS Redshift results in NULL values, So I have discovered the cause - This would not have been evident from the description I provided in my original post. When you create a table  JSON is not a good choice for storing larger datasets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture. JSON uses UTF-8 encoded text strings, so JSON strings can be stored as CHAR or VARCHAR data types.

COPY maps the data elements in the JSON source data to the columns in the target table by matching object keys, or names, in the source name/value pairs to the names of columns in the target table. The matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase, so when you use the ‘auto’ option, matching JSON field names must also be lowercase. If the JSON field name keys aren't all lowercase, you can use a JSONPaths file to explicitly map column names to JSON field name keys.

The solution would be to use jsonpath

Example json:

{
"Name": "Major",
"Age": 19,
"Add": {
"street":{
"st":"5 maint st",
"ci":"Dub"
},
"city":"Dublin"
},

"Category_Name": ["MLB","GBM"]

}

Example table:

(
name varchar,
age int,
address varchar,
catname varchar
);

Example jsonpath:

{
"jsonpaths": [
"$['Name']",
"$['Age']",
"$['Add']",
"$['Category_Name']"
]
}

Example copy code:

copy customer --redshift code
from 's3://mybucket/customer.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json from 's3://mybucket/jpath.json' ; -- Jsonpath file to map fields

Examples are taken from here

COPY from JSON format - Amazon Redshift, A value can be a string in double quotation marks, a number, a Boolean true or false, null, a JSON object, or an array. JSON objects and arrays As a result, Amazon Redshift can't parse complex, multi-level data structures. The following is an  AWS Glue took all the inputs from the previous screens to generate this Python script, which loads our JSON file into Redshift. In the real world ( and on Moon Base One ), importing JSON data into

Data conversion parameters - Amazon Redshift, If you attempt to load nulls into a column defined as NOT NULL, the COPY command will fail. REMOVEQUOTES. Removes surrounding quotation marks from  We can load whole JSON documents into Redshift and transform and query them with the JSON-SQL functions offered in Redshift. This has other limitations which will also be covered in this post. COPY from JSON Format . The COPY command loads data into Redshift tables from JSON data files in an S3 bucket or on a remote host accessed via SSH.

Nulls - Amazon Redshift, Describes the rules for working with database nulls supported by Amazon Redshift. AWS provides a set of utilities for loading data from different sources to Redshift. AWS Glue and AWS Data pipeline are two of the easiest to use services for loading data from AWS table. AWS Data Pipeline. AWS data pipeline is a web service that offers extraction, transformation, and loading of data as a service.

COPY - Amazon Redshift, Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket,  Take the loading data tutorial; Take the tuning table design tutorial; Use a COPY command to load data; Use a single COPY command; Split your load data into multiple files; Compress your data files; Use a manifest file; Verify data files before and after a load; Use a multi-row insert; Use a bulk insert; Load data in sort key order; Load data

Comments
  • This is the same issue I found after some digging. But I was wondering if there was documentation / solution where you could tell it to ignore case, or convert it. Changing the json key format will be quite a pain with the volume I'm dealing with. edit: never mind, you'll have to use the jsonPaths solution
  • I stumbled across this error because a NOT NULL column was saying my JSON had no value for it, which was wrong. A quick Google search landed here. I'd say accept this answer as it was a huge help for a crucial component of a pipeline I'm working on. I will see about forwarding a request to the Amazon team via ticket to support case insensitive column names (as would be SQL standard anyways).
  • I can confirm this to be the case also. It's SUCH a shame for Amazon to NOT mention case at all in their docs and a HUGE miss for the Redshift "Auto" copy. Essentially, if your JSON property names use anything other than lowercase characters, you must use a JSONPaths file!