Batch pipeline steps

This documentation is outdated!

🚧 The latest setup guidance for Snowplow can be found on the Snowplow documentation site.

This page refers to Snowplow R104

Click here for the corresponding documentation for R91-R103

Click here for the corresponding documentation for R90

Click here for the corresponding documentation for R89

Click here for the corresponding documentation for R87-R88

Click here for the corresponding documentation for R86 and earlier

Dataflow diagram

Recovery steps

The below table summarizes the actions to be taken at each particular step failure from the dataflow diagram above.

Failed step	Recovery actions
1	If no files have been moved yet (`raw:processing` [A] is empty), rerun the EmrEtlRunner as usual. If (on the other hand) some files have already been moved, rerun the EmrEtlRunner with `--skip staging` option to proceed with processing of those log files.
2	Rerun the EmrEtlRunner with `--skip staging` option.
3	Rerun the EmrEtlRunner with `--skip staging` option. Note: The `enriched:bad` [D] and `enriched:error` [E] could contain the files produced as a result of the step 3. Therefore rerunning the EmrEtlRunner could result in duplicated `bad`/`error` files. This could be significant if `elasticsearch` step [8-9] is engaged for examining `bad` data [D]. The outcome would be the same data timestamped with different time values by different EMR runs.
4	Delete `enriched:good` files [F] and rerun the EmrEtlRunner with either `--skip staging` option or with `--resume-from enrich`.
5	Delete `enriched:good` files [F] and rerun the EmrEtlRunner with either `--skip staging` option or with `--resume-from enrich`.
6	Rerun the EmrEtlRunner with `--resume-from shred` (the enriched files will be copied from `enriched:good` [F]).
7	Rerun the EmrEtlRunner with `--resume-from shred` (the enriched files will be copied from `enriched:good` [F]).
8	Delete `shredded:good` [K] if any file has been moved. Rerun the EmrEtlRunner with `--resume-from shred` (the enriched files will be copied from `enriched:good` [F]).
9	Rerun the EmrEtlRunner with either `--skip staging,enrich,shred` option or with `--resume-from elasticsearch` (Elasticsearch is used) or `--resume-from archive_raw`.
10	If duplicated `bad` data is not critical rerun the EmrEtlRunner with `--skip staging,enrich,shred` option. If duplicated bad data is critical, instructions to come (#2593). WARNING: In R90/R91, if you pass `--skip shred` to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
11	If duplicated `bad` data is not critical rerun the EmrEtlRunner with `--skip staging,enrich,shred` option. If duplicated bad data is critical, instructions to come (#2593). WARNING: In R90/R91, if you pass `--skip shred` to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
12	Rerun the EmrEtlRunner with `--skip staging,enrich,shred,elasticsearch` option or `--resume-from archive_raw`. WARNING: In R90/R91, if you pass `--skip shred` to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92.
13	The data loads are wrapped in a single transaction, so an RDB Loader failure will not result in a partial load. However, if multiple data targets are used and some targets already been loaded, you may need to temporarily remove those from `config.yml` during your recovery process. There are 3 stages in `rdb_load` step, namely "discover", "load", and "analize" (in that order). At the "discover" stage the availability of JSONPaths files are checked. After the data is loaded at "load" stage, the tables are analized to update table statistics for use by the query planner. To start RDB Loader from the beginning, use the `--resume-from rdb_load` option. If the failure occurred at the analyze stage (i.e. after the data was successfully loaded), you can skip the analyze with the `--resume-from archive_enriched` option. To analyze, resume with `--resume-from analyze`.
14	Rerun the EmrEtlRunner with `--resume-from archive_enriched` option.
15	Rerun the EmrEtlRunner with `--resume-from archive_shredded` option.

Dataflow diagram for Stream Enrich mode

Recovery steps for Stream Enrich mode

Failed step	Recovery actions
1	If no files have been moved yet (`enriched:good` [A] is empty), rerun the EmrEtlRunner as usual. If, on the other hand, some files have already been moved, rerun the EmrEtlRunner with `--skip staging_stream_enrich` option to proceed with processing of those enriched files files.
2	Rerun the EmrEtlRunner with `--skip staging_stream_enrich` option.
3	Rerun the EmrEtlRunner with `--skip staging_stream_enrich` option.
4	Delete `shredded:good` [D]. Rerun the EmrEtlRunner with `--skip staging_stream_enrich` option.
5	You can ignore moving `_SUCCESS` file. Resume from step 6.
6	The data load cannot result in partial load due to the use of `COMMIT`. However, if more than one data target is used you would need to rerun the EmrEtlRunner with the successfully loaded target removed from the `config.yml` configuration file to retry loading the "failed" target. Note: If the failure occurred at `analyze` stage, you can skip it with `--skip staging_stream_enrich,shred,rdb_load` option.
7	Rerun the EmrEtlRunner with `--skip staging_stream_enrich,shred,rdb_load` option.
8	Rerun the EmrEtlRunner with `--resume-from archive_shredded` option.

Dataflow diagram for RDB Loader v1

Dataflow diagram for RDB Loader v2

RDB Loader 2.0.0 adds SNS as a new shredding complete message destination to RDB Shredder. It has a new capability to run two or more loaders in parallel, each processing the same shredded data, and the same shredding complete messages. Admittedly running multiple loaders in parallel has limited benefit for now, but it is on our roadmap to add new alternative destinations to the RDB loader framework. When this is done, it will be possible to run a separate Redshift loader and, say, a Databricks loader from the same data.

HOME > SNOWPLOW SETUP GUIDE

Setup Snowplow

Useful resources

Troubleshooting
AWS sub-account setup
IAM Setup
Hosted assets
Glossary of Terms
Upgrade Guide
Snowplow Version Matrix
Batch Pipeline Steps (block dataflow diagram)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly