Feature Request: shred nested JSON keys and remove these keys from JSON at the same time #13103

ericsun2 · 2024-07-21T19:28:58Z

ericsun2
Jul 21, 2024

Problem

We want to convert SOME of the nested keys inside JSON field into normal fields in a STRUCT to take advantage of columnar, while keeping the remaining keys (because these keys are very flexible/sparse) inside the original JSON. The final result is a STRUCT with the selected keys, and the original JSON is trimmed/reduced to a slimmer variant with the common keys removed (so that the remaining JSON is much smaller).

create table example (id int, j json);
insert into example values 
  ( 1, '{"name": "Apple", "envelope": { "vendor": "Snowflake", "area": "Cloud Database" }, "score": 3.14 }'),
  ( 2, '{"name": "Orange", "envelope": { "vendor": "Datadog", "category": "Observability" }, "score": 2.98 }');

  select * from example;
┌───────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  id   │                                                 j                                                  │
│ int32 │                                                json                                                │
├───────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
│     1 │ {"name": "Apple", "envelope": { "vendor": "Snowflake", "area": "Cloud Database" }, "score": 3.14 } │
│     2 │ {"name": "Orange", "envelope": { "vendor": "Datadog", "category": "Observability" }, "score": 2.98 }│
└───────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘

create table example2 as select id, json_transform(j, '{"name": "VARCHAR", "envelope": "JSON", "score": "FLOAT"}') xj from example;
select * from example2;
┌───────┬──────────────────────────────────────────────────────────────────────────────────────────────┐
│  id   │                                             xj                                               │
│ int32 │                     struct("name" varchar, envelope json, score float)                       │
├───────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│     1 │ {'name': Apple, 'envelope': {"vendor":"Snowflake","area":"Cloud Database"}, 'score': 3.14}   │
│     2 │ {'name': Orange, 'envelope': {"vendor":"Datadog","category":"Observability"}, 'score': 2.98} │
└───────┴──────────────────────────────────────────────────────────────────────────────────────────────┘

Currently, DuckDB does not support JSONPath in json_transform():

  select id, json_transform(j, '{"name": "VARCHAR", "envelope.name": "VARCHAR", "score": "FLOAT"}') x from example;
┌───────┬──────────────────────────────────────────────────────────────┐
│  id   │                              x                               │
│ int32 │ struct("name" varchar, "envelope.name" varchar, score float) │
├───────┼──────────────────────────────────────────────────────────────┤
│     1 │ {'name': Apple, 'envelope.name': NULL, 'score': 3.14}        │
│     2 │ {'name': Orange, 'envelope.name': NULL, 'score': 2.98}       │
└───────┴──────────────────────────────────────────────────────────────┘

  select id, json_transform(j, '{"name": "VARCHAR", "$.envelope.name": "VARCHAR", "score": "FLOAT"}') from example;
┌───────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│  id   │ json_transform(j, '{"name": "VARCHAR", "$.envelope.name": "VARCHAR", "score": "FLOAT"}') │
│ int32 │              struct("name" varchar, "$.envelope.name" varchar, score float)              │
├───────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│     1 │ {'name': Apple, '$.envelope.name': NULL, 'score': 3.14}                                  │
│     2 │ {'name': Orange, '$.envelope.name': NULL, 'score': 2.98}                                 │
└───────┴──────────────────────────────────────────────────────────────────────────────────────────┘

It is a common ELT feature to schematize the JSON/Variant into a more structural schema, but we can only do so much because the input JSON can be messy. So the reality is to take good control of the commonly-used and frequently-appeared keys, but leave the other keys in the JSON.

Proposal

json_transform() is similar to https://docs.snowflake.com/en/sql-reference/functions/object_pick

But https://docs.snowflake.com/en/sql-reference/functions/object_delete can effectively remove all the top-level keys that have been flattened/shredded. DuckDB probably needs a similar function as well.

The ultimate solution contains 2 more innovations:

support JSON Path in the structure parameter for json_transform(), so that we can extract the 2nd and 3rd level of nested keys - there is a performance penalty for the JSON Path, so we can consider a different function json_transformation_ext() to support JSON Path and keep json_transform() to handle only top-level keys only
support the 3rd parameter - remove_selected_keys=true to also remove those transformed keys from the original JSON. So the return contains a new STRUCT and a new JSON

This feature will be quite useful for the data preparation and schematization.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: shred nested JSON keys and remove these keys from JSON at the same time #13103

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Feature Request: shred nested JSON keys and remove these keys from JSON at the same time #13103

ericsun2 Jul 21, 2024

Problem

Proposal

Replies: 0 comments

ericsun2
Jul 21, 2024