-
Notifications
You must be signed in to change notification settings - Fork 1.2k
4 Loading shredded types
HOME » SNOWPLOW SETUP GUIDE » Step 4: setting up alternative data stores » 1: Installing the StorageLoader » 2: Using the StorageLoader » 3: Scheduling the StorageLoader » 4: Loading shredded types
🚧 This page refers to an old version of RDB Loader. The documentation for the latest version can be found on the Snowplow documentation site.
WARNING: StorageLoader is deprecated and replaced by RDB Loader since R90 Lascaux. RDB Loader must be scheduled via EmrEtlRunner
- Overview
- Loading Snowplow-authored JSONs
- Defining and installing a new table
- Creating and uploading a JSON Paths file
- Configuring loading
- Next steps
Snowplow has a Shredding process for Redshift which consists of three phases:
- Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
- Removing endogenous duplicate records, which are sometimes introduced within the Snowplow pipeline (feature added to r76)
- Loading those files into corresponding tables in Redshift
The third phase is instrumented by RDB Loader and is documented on this wiki page; to configure the first phase, visit the Configuring shredding wiki page (second phase requires no configuration).
Loading Snowplow-authored JSONs is straightforward: Snowplow provides pre-made Redshift table definitions for all Snowplow-authored JSONs. You can find these here:
https://github.com/snowplow/snowplow/tree/master/4-storage/redshift-storage/sql
For example, if you have link click tracking enabled in the JavaScript Tracker, then install com.snowplowanalytics.snowplow/link_click_1.sql
into your Snowplow database.
Each table needs to be loaded using a JSON Paths file. Snowplow hosts JSON Paths files for all Snowplow-authored JSONs. RDB Loader will automatically locate these JSON Paths files and use them to load shredded types into Redshift.
RDB Loader loads each shredded type into its own table in Redshift. You need to create a Redshift table for each new shredded type you have defined.
The table name must be a SQL-friendly compound of the schema's vendor, name and model version, converted from CamelCase to snake_case and with any periods or hyphens replaced with underscores. For example, with the Iglu schema key:
iglu:com.acme.website/anonymous-customer/jsonschema/1-0-2
The table name would be:
com_acme_website_anonymous_customer_1
With the Iglu schema key:
iglu:de.company/AddToBasket/jsonschema/2-1-0
The table name would be:
de_company_add_to_basket_2
Note that only the model version is included - do not include the remaining portions of the version (SchemaVer revision or addition).
Each table definition starts with a set of standard "boilerplate" fields. These fields help to document the type hierarchy which has been shredded and are very useful for later analysis. These fields are as follows:
CREATE TABLE atomic.com_snowplowanalytics_snowplow_link_click_1 (
-- Schema of this type
schema_vendor varchar(128) encode runlength not null,
schema_name varchar(128) encode runlength not null,
schema_format varchar(128) encode runlength not null,
schema_version varchar(128) encode runlength not null,
-- Parentage of this type
root_id char(36) encode raw not null,
root_tstamp timestamp encode raw not null,
ref_root varchar(255) encode runlength not null,
ref_tree varchar(1500) encode runlength not null,
ref_parent varchar(255) encode runlength not null,
...
Now you can add the fields required for your JSON:
...
-- Properties of this type
element_id varchar(255) encode text32k,
element_classes varchar(2048) encode raw,
element_target varchar(255) encode text255,
target_url varchar(4096) encode text32k not null
...
Note that, in the example above, element_classes
was originally a JSON array in the source JSON. Because our Shredding process does not yet support nested shredding, we simply set this field to a large varchar; an analyst can use Redshift's in-built JSON support to explore this field's contents.
And finally, all tables should have a standard DISTKEY
and SORTKEY
:
...
)
DISTSTYLE KEY
-- Optimized join to atomic.events
DISTKEY (root_id)
SORTKEY (root_tstamp);
These keys are designed to make JOINs from these tables back to atomic.events
as performant as possible.
Install the table into your Redshift database. The table must be stored in the same schema as your events
table.
You need to create a JSON Paths file which RDB Loader will use to load your shredded type into Redshift.
The format is simple - a JSON Paths file consists of a JSON array, where each element corresponds to a column in the target table. For full details, see the Copy from JSON documentation from Amazon.
To correspond to the table definition, each JSON Path file must start with the following elements:
{
"jsonpaths": [
"$.schema.vendor",
"$.schema.name",
"$.schema.format",
"$.schema.version",
"$.hierarchy.rootId",
"$.hierarchy.rootTstamp",
"$.hierarchy.refRoot",
"$.hierarchy.refTree",
"$.hierarchy.refParent",
...
Then finish your array of JSON Paths with an element for each custom field in your table. For example, here are the remaining fields for the Snowplow link click event:
...
"$.data.elementId",
"$.data.elementClasses",
"$.data.elementTarget",
"$.data.targetUrl"
]
}
A few things to note:
- Relational Database Shredder will nest all of your JSON's properties into a
data
property, hence the namespacing seen above - Currently the Shredding process does not support nested tables. A nested property such as an array (like
elementClasses
above) should be loaded into a single field
The JSON Paths file should be named the same as the table created in 3, minus the shredded type's vendor. For example, if your table is called:
com_acme_website_anonymous_customer_1
Then your JSON Paths file should be called:
anonymous_customer_1.json
Upload the JSON Paths file to a private S3 bucket which is accessible using the AWS credentials provided in your config.yml
file.
Store the JSON Paths file in a sub-folder named after the vendor, for example:
s3://acme-jsonpaths-files/com.acme.website/anonymous_customer_1.json
RDB Loader step uses same config.yml
as EmrEtlRunner.
Now you need to update config.yml
to load the shredded types. First, make sure that your jsonpath_assets:
points to the private S3 bucket you stored the JSON Paths file in section 4.
buckets:
jsonpath_assets: s3://acme-jsonpaths-file
Next, make sure that you have populated the shredded:
section correctly:
shredded:
good: s3://my-data-bucket/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://my-data-bucket/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://my-data-bucket/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://my-data-bucket/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
If you are using separate configuration files make sure this cross-checks with the corresponding paths in your EmrEtlRunner's config.yml
.
That's it for configuring pipeline for shredding. You should be ready to load shredded types into Redshift.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.
HOME » SNOWPLOW SETUP GUIDE » Step 4: Setup alternative data stores
- Step 1: Setup a Collector
- Step 2a: Setup a Tracker
- Step 2b: Setup a Webhook
- Step 3: Setup Enrich
-
Step 4: Setup alternative data stores
- 4.1: setup Redshift
- 4.2: setup PostgreSQL
- 4.3: installing the StorageLoader
- 4.4: using the StorageLoader
- 4.5: scheduling the StorageLoader
- 4.6: loading shredded types
- 4.7: setup Elasticsearch
- 4.8: setup the Kinesis LZO S3 Sink
- 4.9: setup Druid
- 4.10: setup Amazon DynamoDB
- 4.11: configuring storage targets
- 4.12: setup Google Cloud Storage Loader
- Step 5: Data modeling
- Step 6: Analyze your data!
Useful resources
- Troubleshooting
- IAM Setup
- Hosted assets
- Glossary of Terms
- Upgrade Guide
- Snowplow Version Matrix
- Batch Pipeline Steps (block dataflow diagram)