-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Batch pipeline steps
🚧 The latest setup guidance for Snowplow can be found on the Snowplow documentation site.
This page refers to Snowplow R104
Click here for the corresponding documentation for R91-R103
Click here for the corresponding documentation for R90
Click here for the corresponding documentation for R89
Click here for the corresponding documentation for R87-R88
Click here for the corresponding documentation for R86 and earlier
The below table summarizes the actions to be taken at each particular step failure from the dataflow diagram above.
Failed step | Recovery actions |
---|---|
1 | If no files have been moved yet (raw:processing [A] is empty), rerun the EmrEtlRunner as usual. If (on the other hand) some files have already been moved, rerun the EmrEtlRunner with --skip staging option to proceed with processing of those log files. |
2 | Rerun the EmrEtlRunner with --skip staging option. |
3 | Rerun the EmrEtlRunner with --skip staging option.Note: The enriched:bad [D] and enriched:error [E] could contain the files produced as a result of the step 3. Therefore rerunning the EmrEtlRunner could result in duplicated bad /error files. This could be significant if elasticsearch step [8-9] is engaged for examining bad data [D]. The outcome would be the same data timestamped with different time values by different EMR runs. |
4 | Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich . |
5 | Delete enriched:good files [F] and rerun the EmrEtlRunner with either --skip staging option or with --resume-from enrich . |
6 | Rerun the EmrEtlRunner with --resume-from shred (the enriched files will be copied from enriched:good [F]). |
7 | Rerun the EmrEtlRunner with --resume-from shred (the enriched files will be copied from enriched:good [F]). |
8 | Delete shredded:good [K] if any file has been moved. Rerun the EmrEtlRunner with --resume-from shred (the enriched files will be copied from enriched:good [F]). |
9 | Rerun the EmrEtlRunner with either --skip staging,enrich,shred option or with --resume-from elasticsearch (Elasticsearch is used) or --resume-from archive_raw . |
10 | If duplicated bad data is not critical rerun the EmrEtlRunner with --skip staging,enrich,shred option. If duplicated bad data is critical, instructions to come (#2593).WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92. |
11 | If duplicated bad data is not critical rerun the EmrEtlRunner with --skip staging,enrich,shred option. If duplicated bad data is critical, instructions to come (#2593).WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92. |
12 | Rerun the EmrEtlRunner with --skip staging,enrich,shred,elasticsearch option or --resume-from archive_raw .WARNING: In R90/R91, if you pass --skip shred to EmrEtlRunner then RDB Loader does not load unstructured events and contexts. This issue is resolved in R92. |
13 | The data loads are wrapped in a single transaction, so an RDB Loader failure will not result in a partial load. However, if multiple data targets are used and some targets already been loaded, you may need to temporarily remove those from config.yml during your recovery process.There are 3 stages in rdb_load step, namely "discover", "load", and "analize" (in that order). At the "discover" stage the availability of JSONPaths files are checked. After the data is loaded at "load" stage, the tables are analized to update table statistics for use by the query planner. To start RDB Loader from the beginning, use the --resume-from rdb_load option.If the failure occurred at the analyze stage (i.e. after the data was successfully loaded), you can skip the analyze with the --resume-from archive_enriched option. To analyze, resume with --resume-from analyze . |
14 | Rerun the EmrEtlRunner with --resume-from archive_enriched option. |
15 | Rerun the EmrEtlRunner with --resume-from archive_shredded option. |
Failed step | Recovery actions |
---|---|
1 | If no files have been moved yet (enriched:good [A] is empty), rerun the EmrEtlRunner as usual. If, on the other hand, some files have already been moved, rerun the EmrEtlRunner with --skip staging_stream_enrich option to proceed with processing of those enriched files files. |
2 | Rerun the EmrEtlRunner with --skip staging_stream_enrich option. |
3 | Rerun the EmrEtlRunner with --skip staging_stream_enrich option. |
4 | Delete shredded:good [D]. Rerun the EmrEtlRunner with --skip staging_stream_enrich option. |
5 | You can ignore moving _SUCCESS file. Resume from step 6. |
6 | The data load cannot result in partial load due to the use of COMMIT . However, if more than one data target is used you would need to rerun the EmrEtlRunner with the successfully loaded target removed from the config.yml configuration file to retry loading the "failed" target.Note: If the failure occurred at analyze stage, you can skip it with --skip staging_stream_enrich,shred,rdb_load option. |
7 | Rerun the EmrEtlRunner with --skip staging_stream_enrich,shred,rdb_load option. |
8 | Rerun the EmrEtlRunner with --resume-from archive_shredded option. |
RDB Loader 2.0.0 adds SNS as a new shredding complete message destination to RDB Shredder. It has a new capability to run two or more loaders in parallel, each processing the same shredded data, and the same shredding complete messages. Admittedly running multiple loaders in parallel has limited benefit for now, but it is on our roadmap to add new alternative destinations to the RDB loader framework. When this is done, it will be possible to run a separate Redshift loader and, say, a Databricks loader from the same data.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2021 Snowplow Analytics Ltd. Documentation terms of use.