Skip to content

Upgrade Guide

Łukasz Indykiewicz edited this page Nov 27, 2019 · 119 revisions

HOME » UPGRADE GUIDE

On this page, we are posting the steps to upgrade sequentially after a Snowplow release with the latest version at the top. Here sequentially means from the previous to the following.

You can also use Snowplow Version Matrix as a guidance to the internal component dependencies for a particular release.

For easier navigation, please, follow the links below.

Snowplow 118 Morgantina

This upgrade guide is to upgrade from R116.

Although there has been a lot of refactoring in this release (mainly bad rows), the configuration almost didn't change.

Only the config of the referer parser enrichment needs to be updated to become:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "referer_parser",
    "enabled": true,
    "parameters": {
      "database": "referer-tests.json",
      "internalDomains": [
        "www.subdomain1.snowplowanalytics.com"
      ],
      "uri": "s3://snowplow-hosted-assets/third-party/referer-parser/referer-tests.json"
    }
  }
}

Following an upgrade of the library used to parse the configuration of Scala Stream Collector (pureconfig), when using it with PubSub, the configuration lines

sink {
  enabled = googlepubsub

need to become

sink {
  enabled = google-pub-sub

Snowplow 117 Biskupin

This upgrade guide is to upgrade from R116.

TLS port binding and certificate

A new version of the Scala Stream Collector can be found on our Docker Hub repository under 0.17.0 tag.

For example to start up an ssl-enabled, auto-upgrade server, following config can be used, collector configuration should contain:

ssl {
  enable = true
  redirect = true
  port = 443
}

However, this configuration will use environment-defined JVM-attached certificates. In order to override the default behaviour and use a custom certificate, the low-level section can be defined as (akka config section):

ssl-config {
  keyManager = {
    stores = [
      {type = "PKCS12", classpath = false, path = ${CERT_FILE}, password = "pass" }
    ]
  }
}

IPv6 anonymization

A new version of the Snowplow Common Enrich can be found on Maven repository

The schema for the configuration of the enrichment has been updated to version 1-0-1:

{
	"schema": "iglu:com.snowplowanalytics.snowplow/anon_ip/jsonschema/1-0-1",
	"data": {
		"name": "anon_ip",
		"vendor": "com.snowplowanalytics.snowplow",
		"enabled": true,
		"parameters": {
			"anonOctets": 1,
			"anonSegments": 1
		}
	}
}

Additional event fingerprint hashing methods

A new version of the Snowplow Common Enrich can be found on Maven repository

The schema for the configuration of the enrichment has been updated to version 1-0-1:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/event_fingerprint_config/jsonschema/1-0-1",
  "data": {
    "name": "event_fingerprint_config",
    "vendor": "com.snowplowanalytics.snowplow",
    "enabled": true,
    "parameters": {
      "excludeParameters": ["cv", "eid", "nuid", "stm"],
      "hashAlgorithm": "SHA1"
    }
  }
}

Support to the spot market for core instances

A new version of the EmrEtl Runner can be found on our Bintray repository under r117-biskupin version.

In order to enable spot instances, add a core_instance_bid setting to your config.yml file. This setting specifies a bid for an hour of EC2 spot instance in USD.

aws:
  emr:
    jobflow:
      core_instance_bid: 0.3

Beam Enrich

A new version of the Beam Enrich can be found on our Docker Hub repository under 0.4.0 tag.

It contains the newest Snowplow Common Enrich.

Stream Enrich

A new version of the Stream Enrich can be found on our Docker Hub repository under 0.22.0 tag.

It contains the newest Snowplow Common Enrich.

Spark Enrich

A new version of the Spark Enrich can be used by setting it in your EmrEtlRunner configuration:

enrich:
  version:
    spark_enrich: 1.19.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar.

It contains the newest Snowplow Common Enrich.

Read more

Snowplow 116 Madara Rider

This release focuses on adding new features to the Scala Stream Collector, including the ability to set first-party cookies server-side on multiple domains and a to use custom path mappings.

It also includes an update to EmrEtlRunner, to add support for shredded data in tsv format.

Scala Stream Collector

A new version of the Scala Stream Collector can be found on our Bintray.

You can also find the images on Docker Hub:

To make use of the new features, you'll need to update your configuration as follows:

  • Add a collector.paths section if you want to provide custom path mappings:
paths {
  "/com.acme/track" = "/com.snowplowanalytics.snowplow/tp2" # for tracker protocol 2 requests
  "/com.acme/redirect" = "/r/tp2"                           # for redirect requests
  "/com.acme/iglu" = "/com.snowplowanalytics.iglu/v1"       # for Iglu webhook requests
}
  • In collector.cookie there is no longer a domain setting. Instead, you can provide a list of domains to be used and / or a fallbackDomain in case none of the origin domains matches the ones you specified:
domains = [
  "acme.com"
  "acme.net"
]

fallbackDomain = "roadrunner.com" # no leading dot

If you don't wish to use multiple domains and want to preserve the previous behaviour, leave domains empty and specify a fallbackDomain with the same value as collector.cookie.domain from your previous configuration (but leave out any leading dots).

Both domains and fallbackDomain are optional settings, just like domain is an optional setting in earlier versions.

  • Another addition to collector.cookie are controls for extra directives to be passed in the Set-Cookie response header.
secure = false    # set to true if you want to enforce secure connections
httpOnly = false  # set to true if you want to make the cookie inaccessible to non-HTTP requests
sameSite = "None" # or `Lax`, or `Strict`. This is an optional parameter.

Read more

Snowplow 115 Sigiriya

This release includes 2 updates for EmrEtlRunner, one bug fix and one to improve its reliability.

It also includes an update to Event Manifest Populator, so that it can read the files containing the events produced by stream-enrich.

EmrEtlRunner

The latest version of EmrEtlRunner is available on our Bintray here.

Snowplow 114 Polonnaruwa

This release includes a number of new features and updates, most of which live in Scala Common Enrich. Mainly, a new user agent enrichment has been added, as well as the possibility to use a remote adapter.

Upgrading your enrichment platform

If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:

If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.

Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:

enrich:
  version:
    spark_enrich: 1.18.0 # WAS 1.17.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar

A new version of EmrEtlRunner is also available in our Bintray.

Using YAUAA enrichment

This enrichment is based on in-memory HashMaps and require roughly 400 MB of RAM (see here).

To use new YAUAA enrichment, add yauaa_enrichment_config.json to the folder with configuration files for enrichments, with the following content:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/yauaa_enrichment_config/jsonschema/1-0-0",
    "data": {
        "enabled": true,
        "vendor": "com.snowplowanalytics.snowplow.enrichments",
        "name": "yauaa_enrichment_config"
    }
}

More information about this enrichment can be found on the dedicated wiki page.

Read more

Snowplow 113 Filitosa

This release focuses on improvements to the Scala Stream Collector as well as new features for Scala Common Enrich such as HubSpot webhook integration and POST support in the API request enrichment.

Upgrading the Scala Stream Collector

A new version of the Scala Stream Collector incorporating the changes discussed above can be found on our Bintray.

To make use of this new version, you’ll need to amend your configuration in the following ways:

  • Add a collector.cors section to specify the Access-Control-Max-Age duration:
cors {
  accessControlMaxAge = 5 seconds # -1 seconds disables the cache
}
  • Add a collector.prometheusMetrics section:
prometheusMetrics {
  enabled = false
  durationBuckets = [0.1, 3, 10] # optional buckets by which to group by the `http_request_duration_seconds` metric
}
  • Modify the collector.doNotTrackCookie section if you want to make use of a regex:
doNotTrackCookie {
  enabled = true
  name = cookie-name
  value = ". cookie-value. "
}
  • Add the optional collector.streams.sink.producerConf if you want to specify additional Kafka producer configuration:
producerConf {
  acks = all
}

This also holds true for Stream Enrich enrich.streams.sourceSink.{producerConf, consumerConf}.

A full example configuration can be found in [the repository][config-ssc].

Upgrading your enrichment platform

If you are a GCP pipeline user, a new Beam Enrich can be found on Bintray:

If you are a Kinesis or Kafka pipeline user, a new Stream Enrich can be found on Bintray.

Finally, if you are a batch pipeline user, a new Spark Enrich can be used by setting the new version in your EmrEtlRunner configuration:

enrich:
  version:
    spark_enrich: 1.17.0 # WAS 1.16.0

or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.17.0.jar

A new version of EmrEtlRunner is also available in our Bintray.

Read more

Snowplow 112 Baalbek

This release focuses on reliability improvements for the batch pipeline. It also itroduces support for persistent EMR cluster.

EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

A settings is needed to enable or disable compaction of the output of the shred job.

aws:
  s3:
    consolidate_shredded_output: false

If you're not making use of any enrichment and contexts, you'll need to disable this setting.

For a complete example, see our sample config.yml template.

Clojure Collector

The new Clojure Collector is stored in S3 at: s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.3-standalone.war.

Read more

Snowplow 111 Selinunte

This small release adds CORS-related headers to POST requests as a follow-up of R110 which added them to OPTIONS requests.

Clojure Collector

The new Clojure Collector is stored in S3 at: s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.2-standalone.war.

Read more

Snowplow 110 Valle dei Templi

This release brings a new enrichment platform for Google Cloud Platform: Beam Enrich as well as a couple of bugfixes.

Beam Enrich

Beam Enrich is the latest enrichment platform released by Snowplow, it runs on Google Cloud Dataflow.

To know more, check out the following resources:

Stream Enrich

The new version of Stream Enrich can be found in our Bintray here.

It incorporates a fix for users of the PII enrichment.

Clojure Collector

The new Clojure Collector is stored in S3 at: s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.1-standalone.war.

It incorporates a fix for CORS requests.

Read more

Snowplow 109 Lambaesis

This release bring the possibility to enable end-to-end encryption for the batch pipeline as well as a way to specify the cookie path for the Clojure Collector.

UA parser enrichment

If you want to leverage the monthly-updated database of useragent regexes we host on S3, you'll need to update your enrichment configuration to the following:

{
  “schema": "iglu:com.snowplowanalytics.snowplow/ua_parser_config/jsonschema/1-0-1", # Was 1-0-0
  "data": {
    "vendor": "com.snowplowanalytics.snowplow",
    "name": "ua_parser_config",
    "enabled": true,
    "parameters": {
      "database": "regexes.yaml",                                                    # New
      "uri": "s3://snowplow-hosted-assets/third-party/ua-parser/"                    # New
    }
  }
}

Note that this change is not mandatory.

Stream Enrich

If you are a real-time pipeline user, a version of Stream Enrich can be found on our Bintray here.

Spark Enrich

If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.16.0 # WAS 1.15.0

or directly make use of the new Spark Enrich available at: s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.16.0.jar.

Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

Updating the configuration

collector {
  crossDomain {
    enabled = true
    domains = [ "*"] # WAS domain and not an array
    secure = true
  }

  doNotTrackCookie { # New section
    enabled = false
    name = cookie-name
    value = cookie-value
  }

  rootResponse {     # New section
    enabled = false
    statusCode = 200
    body = “ok”
  }
}

For a complete example, see our sample config.hocon template.

EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

We encourage people to change their S3 buckets to use the s3a scheme because usage of the s3a protocol doesn't generate empty files:

aws:
  s3:
   raw:
     in:
       - "s3a://in-bucket"
     processing: "s3a://processing-bucket"
     archive: "s3a://archive-bucket/raw"
   enriched:
     good: "s3a://enriched-bucket/good"
     bad: "s3a://enriched-bucket/bad"
     errors: "s3a://enriched-bucket/errors"
     archive: "s3a://archive-bucket/enriched"
   shredded:
     good: "s3a://shredded-bucket/good"
     bad: "s3a://shredded-bucket/bad"
     errors: "s3a://shredded-bucket/errors"
     archive: "s3a://archive-bucket/shredded"

For a complete example, see our sample config.yml template.

Read more

Snowplow 108 Val Camonica

This release brings the possibility to enable end-to-end encryption for the batch pipeline as well as a way to specify the cookie path for the Clojure Collector.

EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

This release brigs the possibility to interact with SSE-S3 (AES 256 managed by S3) encrypted buckets through aws:s3:buckets:encrypted.

Additionally, you can now specify an EMR security configuration, which lets you configure local disk encryption as well as in-transit encryption, through aws:emr:security_configuration

aws:
  s3:
    buckets:
      encrypted: false # Can be true or false depending on whether you interact with SSE-S3 encrypted buckets
  emr:
    security_configuration: name-of-the-security-configuration # Leave blank if you don't use a security configuration
monitoring:
  snowplow:
    port: 8080     # New and optional
    protocol: http # New and optional

For a complete example, see our sample config.yml template.

For more background on end-to-end encryption for the batch pipeline, you can refer to our dedicated wiki page.

Clojure Collector

The new Clojure Collector is stored in S3 at: s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.1.0-standalone.war.

By default, the cookie path will now be /. However, it can be customized by adding a SP_PATH environment property to your Elastic Beanstalk application.

Read more

Snowplow 107 Trypillia

This release introduces the IAB Spiders & Robots enrichment for detecting bots and spiders, as well as new Marketo and Vero webhook adapters and fixes to the Google Analytics enrichment.

Stream Enrich

If you are a streaming pipeline user, a version of Stream Enrich incorporating the new IAB enrichment can be found on our Bintray here.

Spark Enrich

If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.15.0 # WAS 1.14.0

or directly make use of the new Spark Enrich available at: s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.15.0.jar.

Read more

Snowplow 106 Acropolis

This release adds further capabilities to the PII Pseudonymization Enrichment to both stream and batch enrich. Specifically, it adds the capability to emit a stream of events which contain the original along with the modified value. The PII transformation event also contains information about the field and the parent event (the event whence this PII event originated).

Upgrading Spark Enrich

To upgrade, update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.14.0 # WAS 1.13.0

Upgrading Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

The following configuration is needed to enable the pii stream:

enrich {

  streams {

    in {...}                                                    # NO CHANGE

    out {
      enriched = my-enriched-output-event-without-pii           # NO CHANGE

      bad = my-events-that-failed-validation-during-enrichment  # NO CHANGE

      pii = my-output-event-that-contains-only-pii              # NEW FIELD

      partitionKey = ""                                         # NO CHANGE
    }

    sourceSink {...}                                            # NO CHANGE

    buffer {...}                                                # NO CHANGE

    appName = "some-name"                                       # NO CHANGE
  }
}

In addition you need to configure the enrichment to emit events and also use a salt in hashing:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/2-0-0", # NEW VERSION
  "data": {
    "vendor": "com.snowplowanalytics.snowplow.enrichments",                                           # NO CHANGE
    "name": "pii_enrichment_config",                                                                  # NO CHANGE
    "emitEvent": true,                                                                                # NEW FIELD
    "enabled": true,                                                                                  # NO CHANGE
    "parameters": {
      "pii": [...],                                                                                   # NO CHANGE
      "strategy": {
        "pseudonymize": {
          "hashFunction": "SHA-1",                                                                    # NO CHANGE
          "salt": "pepper123"                                                                         # NEW FIELD
        }
      }
    }
  }
}

Read more

Snowplow 105 Pompeii

This release focuses on solving an issue with the real-time pipeline which may result in duplicate events if you're using Kinesis.

More information is available in issue #3745 and the dedicated Discourse post.

Upgrading Stream Enrich

A version of Stream Enrich incorporating a fix can be found on our Bintray here.

Read more

Snowplow 104 Stoplesteinan

This release most notably solves an EmrEtlRunner Stream Enrich mode bugs introduced in R102. Information is available in issue #3717 and #3722.

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Read more

Snowplow 103 Paestum

This release upgrades the IP lookups enrichment.

IP lookups enrichment upgrade

Whether you are using the batch or streaming pipeline, it is important to perform this upgrade if you make use of the IP lookups enrichment.

To make use of the new enrichment, you will need to update your ip_lookups.json so that it conforms to the new 2-0-0 schema. An example is provided in the GitHub repository.

Stream Enrich

If you are a streaming pipeline user, a version of Stream Enrich incorporating the upgraded ip lookups enrichment can be found on our Bintray here.

Spark Enrich

If you are a batch pipeline user, you'll need to either update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.13.0 # WAS 1.12.0

or directly make use of the new Spark Enrich available at: s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.13.0.jar.

Clojure Collector

The new Clojure Collector is stored in S3 at: s3://snowplow-hosted-assets/2-collectors/clojure-collector/clojure-collector-2.0.0-standalone.war.

By default, he /crossdomain.xml route is disabled - it will have to be manually re-enabled by adding the two following environment properties to your Elastic Beanstalk application:

  • SP_CDP_DOMAIN: the domain that is granted access, *.acme.com will match both http://acme.com and http://sub.acme.com.
  • SP_CDP_SECURE: a boolean indicating whether to only grant access to HTTPS or both HTTPS and HTTP sources

Read more

Snowplow 102 Afontova Gora

This release brings stability imporovements and new "Stream Enrich" mode to EmrEtlRunner.

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Upgrading for Lambda architecture users

To turn this mode on, you need to add a new aws.s3.buckets.enriched.stream property to your config.yml file.

aws:
  s3:
    buckets:
      enriched:
        stream: s3://path-to-kinesis/output/

For a complete example, we now have a dedicated sample stream_config.yml template - this shows what you need to set, and what you can remove.

Read more

Snowplow 101 Neapolis

This release brings initial support for Google Cloud Platform to the realtime pipeline.

Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

Updating the configuration

collector {
  # Became non-optional
  crossDomain {
    enabled = true # NEW
    domain = "*"
    secure = true
  }
}

For a complete example, see our sample config.hocon template.

Launching the JAR

This release splits the JARs according to their targeted platform. As a result, you'll need to run one of the following depending on your needs:

java -jar snowplow-stream-collector-google-pubsub-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kinesis-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-kafka-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-nsq-0.13.0.jar --config config.hocon
java -jar snowplow-stream-collector-stdout-0.13.0.jar --config config.hocon

Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

Updating the configuration

enrich {
  streams {
    in { ... }                         # UNCHANGED
    out { ... }                        # UNCHANGED
    sourceSink {                       # NEW SECTION
      enabled = kinesis
      region = eu-west-1
      aws {
        accessKey = iam
        secretKey = iam
      }
      maxRecords = 10000
      initialPosition = TRIM_HORIZON
      backoffPolicy {
        minBackoff = 50
        maxBackoff = 1000
      }
    }
    buffer { ... }                     # UNCHANGED
    appName = ""                       # UNCHANGED
  }
  monitoring { ... }                   # UNCHANGED
}

For a complete example, see our sample config.hocon template.

Read more

Snowplow 100 Epidaurus

This release lets you pseudonymize PII fields in your streaming pipeline.

Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

Updating Redshift tables

If you are using Redshift as a storage target, it is important to update the atomic.events table, so that the new fields will fit using: a migration script.

Read more

Snowplow 99 Carnac

This release lets you seamlessly integrates Google Analytics events in your Snowplow batch pipeline.

Snowplow Google Analytics plugin

The Snowplow Google Analytics plugin lets you tee your Google Analytics payloads directly to a Snowplow collector to be further processed.

Check out the setup guide to know more.

Updating config.yml

To benefit from the Google Analytics integration you'll need Spark Enrich 1.12.0 or higher:

enrich:
  version:
    spark_enrich: 1.12.0      # WAS 1.11.0

For a complete example, see our sample config.yml template.

Read more

Snowplow 98 Argentomagus

This release brings support for the webhooks introduced in Release 97 to the realtime pipeline as well as some nifty features to the Scala Stream Collector.

Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

Updating the configuration

collector {
  # Optional cross domain policy configuration.
  # To disable, remove the "crossDomain" configuration and the collector will respond with a 404 to
  # the /crossdomain.xml route.
  crossDomain {  # NEW
    domain = "*"
    secure = true
  }

  cookie {
    # ...

    # Optionally, specify the name of the header containing the originating protocol for use in the
    # bounce redirect location. Use this if behind a load balancer that performs SSL termination.
    # The value of this header must be http or https. Example, if behind an AWS Classic ELB.
    forwardedProtocolHeader = "X-Forwarded-Proto"  # NEW
  }

  # When enabled, the redirect url passed via the `u` query parameter is scanned for a placeholder
  # token. All instances of that token are replaced withe the network ID. If the placeholder isn't
  # specified, the default value is `${SP_NUID}`.
  redirectMacro {  # NEW
    enabled = false
    placeholder = "[TOKEN]"
  }
}

For a complete example, see our sample config.hocon template.

Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

Read more

Snowplow 97 Knossos

This release brings 4 new webhook adapters (Mailgun, StatusGator, Unbounce, Olark) to Snowplow. Follow the corresponding webhook set-up guide in Setting up a webhook

Upgrade steps

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

enrich:
  version:
    spark_enrich: 1.11.0      # WAS 1.10.0

For a complete example, see our sample config.yml template.

Read more

Snowplow 96 Zeugma

This release brings NSQ support to the Scala Stream Collector and Stream Enrich.

Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

Updating the configuration

collector {

  #sink = kinesis                     # REMOVED

  streams {

    sink {                            # ADDED
      enabled = kinesis               # or kafka or nsq

      # only the corresponding config is needed (e.g. kinesis or kafka config)
    }
  }
}

For a complete example, see our sample config.hocon template.

Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

Read more

Snowplow 95 Ellora

This release introduces ZSTD encoding to the Redshift model as well as update the Spark components to 2.2.0 which is included in AMI 5.9.0.

Upgrade steps

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

This release updates the Spark Enrich and RDB Shredder jobs to Spark 2.2.0. As a result, an AMI bump is warranted. RDB Loader has been updated too:

aws:
  # ...
  emr:
    ami_version: 5.9.0        # WAS 5.5.0
    # ...
enrich:
  version:
    spark_enrich: 1.10.0      # WAS 1.9.0
storage:
  versions:
    rdb_loader: 0.14.0        # WAS 0.13.0
    rdb_shredder: 0.13.0      # WAS 0.12.0

For a complete example, see our sample config.yml template.

Updating Redshift tables

Unlocking ZSTD compression relies on updating the atomic.events table through a migration script.

This script assumes that you're currently on version 0.8.0 of the atomic.events table, if you're upgrading from an earlier version, please refer to the appropriate migration script to get to version 0.8.0.

Updating the Redshift storage target

If you rely on an SSH tunnel to connect the RDB Loader to your Redshift cluster, you'll need to update your Redshift storage target to 2-1-0. Refer to the schema to incorporate a properly formatted sshTunnel field.

Updating your Iglu resolver

We've set up a mirror of Iglu central on Google Cloud Platform to maintain high availability in case of S3 outages. To benefit from this mirror, you'll need to add the following repository to your Iglu resolver JSON file:

{
  "name": "Iglu Central - Mirror 01",
  "priority": 1,
  "vendorPrefixes": [ "com.snowplowanalytics" ],
  "connection": {
  "http": {
    "uri": "http://mirror01.iglucentral.com"
  }
}

Read more

Snowplow 94 Hill of Tara

This release fixes an issue in Stream Enrich introduced in R93.

The latest version of Stream Enrich is available from our Bintray here.

Read more

Snowplow 93 Virunum

This release refreshes the streaming Snowplow pipeline: the Scala Stream Collector and Stream Enrich.

Scala Stream Collector

The latest version of the Scala Stream Collector is available from our Bintray here.

Updating the configuration

collector {
  cookieBounce {                                                   # NEW
    enabled = false
    name = "n3pc"
    fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000"
  }

  sink = kinesis                                                   # WAS sink.enabled

  streams {                                                        # REORGANIZED
    good = good-stream
    bad = bad-stream

    kinesis {
      // ...
    }

    kafka {
      // ...
      retries = 0                                                  # NEW
    }
  }
}

akka {
  http.server {                                                    # WAS spray.can.server
    // ...
  }
}

For a complete example, see our sample config.hocon template.

Launching

The Scala Stream Collector is no longer an executable jar. As a result, it will have to be launched as:

java -jar snowplow-stream-collector-0.10.0.jar --config config.hocon

Stream Enrich

The latest version of Stream Enrich is available from our Bintray here.

Updating the configuration

enrich {
  // ...
  streams {
    // ...
    out {
      // ...
      partitionKey = user_ipaddress             # NEW
    }

    kinesis {                                   # REORGANIZED
      // ...
      initialTimestamp = "2017-05-17T10:00:00Z" # NEW but optional
      backoffPolicy {                           # MOVED
        // ...
      }
    }

    kafka {
      // ...
      retries = 0                               # NEW
    }
  }
}

For a complete example, see our sample config.hocon template.

Launching

Stream Enrich is no longer an executable jar. As a result, it will have to be launched as:

java -jar snowplow-stream-enrich-0.11.0.jar --config config.hocon --resolver file:resolver.json

Additionally, a new --force-ip-lookups-download flag has been introduced in order to force the download of the ip lookup database when the application starts.

Read more

Snowplow 92 Maiden Castle

This release most notably solves a bug which occurred if one were to skip the shred step, more information is available in issue #3403 and the dedicated Discourse post.

Upgrade steps

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

In order to update RDB Loader you need to make following change to your configuration YAML:

storage:
  versions:
    rdb_loader: 0.13.0        # WAS 0.12.0

For a complete example, see our sample config.yml template.

Read more

Snowplow 91 Stonehenge

This release revolves around making EmrEtlRunner, the component launching the EMR steps for the batch pipeline, significantly more robust. Most notably, this release fixes a long-standing bug in the way the staging step was performed, which affected all users of the Clojure Collector (issue #3085).

Upgrade steps

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Make sure to use the run command when launching EmrEtlRunner, for example:

./snowplow-emr-etl-runner run \
  -c config.yml \
  -r resolver.json

Additionally, it is advised to set up a local (through a file) or distributed (through Consul) lock:

./snowplow-emr-etl-runner run \
  -c       config.yml \
  -r       resolver.json \
  --lock   path/to/lock \
  --consul http://127.0.0.1:8500 # Optional address to your Consul server

Read more

Snowplow 90 Lascaux

This release introduces RDB Loader, a new EMR-run application replacing our StorageLoader, as proposed in our Splitting EmrEtlRunner RFC. This release also brings various enhancements and alterations in EmrEtlRunner.

Upgrade steps

Upgrading EmrEtlRunner

The latest version of the EmrEtlRunner is available from our Bintray here.

Updating config.yml

In order to use RDB Loader you need to make following addition in your configuration YAML:

storage:
  versions:
    rdb_loader: 0.12.0        # NEW

The following settings no longer make sense, as Postgres loading also happens on EMR node, therefore can be deleted:

storage:
  download:                   # REMOVE
    folder:                   # REMOVE

To gradually configure your EMR application you can add optional emr.configuration property:

emr:
  configuration:                                  # NEW
    yarn-site:
      yarn.resourcemanager.am.max-attempts: "1"
    spark:
      maximizeResourceAllocation: "true"

For a complete example, see our sample config.yml template.

Updating EmrEtlRunner scripts

EmrEtlRunner now accepts a new --include option with a single possible vacuum argument, which will be passed to RDB Loader.

Also, --skip now accepts new rdb_load, archive_enriched and analyze arguments. Skipping rdb_load and archive_enriched steps is identical to running R89 EmrEtlRunner without StorageLoader.

Finally, note that the StorageLoader is no more part of batch pipeline apps archive.

Creating IAM Role for Redshift

As RDB Loader is an EMR step now, we wanted to make sure that user's AWS credentials are not exposed anywhere. To load Redshift we're using IAM Roles, which allow Redshift to load data from S3.

To create an IAM Role you need to go to AWS Console » IAM » Roles » Create new role. Then you need chose Amazon Redshift » AmazonS3ReadOnlyAccess, choose a role name, for example "RedshiftLoadRole". Once created, copy the Role ARN as you will need it in the next section.

Now you need to attach new role to running Redshift cluster. Go to AWS Console » Redshift » Clusters » Manage IAM Roles » Attach just created role.

Whitelisting EMR in Redshift

Your EMR cluster’s master node will need to be whitelisted in Redshift in order to perform the load.

If you are using an "EC2 Classic" environment, from the Redshift UI you will need to create a Cluster Security Group and add the relevant EC2 Security Group, most likely called ElasticMapReduce-master. Make sure to enable this Cluster Security Group against your Redshift cluster.

If you are using modern VPC-based environment, you will need to modify the Redshift cluster, and add a VPC security group, most likely called ElasticMapReduce-Master-Private.

In both cases, you only need to whitelist access from the EMR master node, because RDB Loader runs exclusively from the master node.

Updating Storage configs

We have updated the Redshift storage target config - the new version requires the Role ARN that you noted down above:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-0-0",       // WAS 1-0-0
    "data": {
        "name": "AWS Redshift enriched events storage",
        ...
        "roleArn": "arn:aws:iam::719197435995:role/RedshiftLoadRole",                               // NEW
        ...
    }
}

Read more

Snowplow 89 Plain of Jars

This release ports the batch pipeline from Twitter Scalding to Apache Spark.

Upgrade steps

Upgrading EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

Updating config.yml

  1. Update ami_version to 5.5.0
  2. Move job_name to aws -> emr -> jobflow
  3. Remove hadoop_shred from enrich -> versions
  4. Add rdb_shredder to a newly created storage -> versions
  5. Move hadoop_elasticsearch to storage -> version
  6. Replace hadoop_enrich by spark_enrich
aws:
  emr:
    ami_version: 5.5.0          # WAS 4.5.0
    . . .
    jobflow:
      job_name: Snowplow ETL    # MOVED FROM enrich:
enrich:
  versions:
    spark_enrich: 1.9.0         # WAS 1.8.0
storage:
  versions:
    rdb_shredder: 0.12.0        # WAS 0.11.0
    hadoop_elasticsearch: 0.1.0 # UNCHANGED BUT MOVED

For a complete example, see our sample config.yml template.

Note that using the Spark artifacts is incompatible with instances types having only one virtual CPU such as m1.medium.

Read more

Snowplow 88 Angkor Wat

This release introduces event de-duplication across different pipeline runs, powered by DynamoDB, along with an important refactoring of the batch pipeline configuration.

Upgrade steps

Upgrading EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

Creating new targets configuration

Storage targets configuration JSONs can be generated from your existing config.yml, using the 3-enrich/emr-etl-runner/config/convert_targets.rb script. These files should be stored in a folder, for example called targets, alongside your existing enrichments folder.

When complete, your folder layout will look something like this:

snowplow_config
├── config.yml
├── enrichments
│   ├── campaign_attribution.json
│   ├── ...
│   ├── user_agent_utils_config.json
├── iglu_resolver.json
├── targets
│   ├── duplicate_dynamodb.json
│   ├── enriched_redshift.json

For complete examples, see our storage target configuration JSONs. The explanation of the properties are on the wiki page.

Updating config.yml

  1. Remove whole storage.targets section (leaving storage.download.folder) from your config.yml file
  2. Update the hadoop_shred job version in your configuration YAML like so:
versions:
  hadoop_enrich: 1.8.0        # UNCHANGED
  hadoop_shred: 0.11.0        # WAS 0.10.0
  hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

Update EmrEtlRunner and StorageLoader scripts

  1. Append the option --targets $TARGETS_DIR to both snowplow-emr-etl-runner and snowplow-storage-loader applications
  2. Append the option --resolver $IGLU_RESOLVER to snowplow-storage-loader application. This is required to validate the storage target configurations

Enabling cross-batch de-duplication (optional)

Please be aware that enabling this will have a potentially high cost and performance impact on your Snowplow batch pipeline.

If you want to start to deduplicate events across batches you need to add a new DynamoDB config target to your newly created targets directory.

Optionally, before first run of Shred job with cross-batch deduplication, you may want to run Event Manifest Populator to back-fill the DynamoDB table.

When Relational Database Shredder runs, if the table doesn’t exist then it will be automatically created with provisioned throughput by default set to 100 write capacity units and 100 read capacity units and the required schema to store and deduplicate events.

For relatively low (1M events per run) cases, the default settings will likely just work. However, we do strongly recommend monitoring the EMR job, and its AWS billing impact, closely and tweaking DynamoDB provisioned throughput and your EMR cluster specification accordingly.

Read more

Snowplow 87 Chichen Itza

This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.

Upgrade steps

Upgrading EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

Updating config.yml

To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:

    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 1
      core_instance_type: c4.2xlarge
      core_instance_ebs:   # Optional. Attach an EBS volume to each core instance.
        volume_size: 200    # Gigabytes
        volume_type: "io1"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true

The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with the performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only c4 instance types for our core nodes.

For a complete example, see our sample config.yml template.

Upgrading Redshift

You will also need to deploy the following manifest table for Redshift:

This table should be deployed into the same schema as your events and other tables.

Read more

Snowplow 86 Petra

This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. It also adds support for AWS’s newest regions: Ohio, Montreal and London.

Upgrade steps

Upgrading is simple - update the hadoop_shred job version in your configuration YAML like so:

versions:
  hadoop_enrich: 1.8.0        # UNCHANGED
  hadoop_shred: 0.10.0        # WAS 0.9.0
  hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

You will also need to deploy the following table for Redshift:

Read more

Snowplow 85 Metamorphosis

This release brings initial beta support for using Apache Kafka with the Snowplow real-time pipeline, as an alternative to Amazon Kinesis.

Please note that this Kafka support is extremely beta - we want you to use it and test it; do not use it in production.

Upgrade steps

The real-time apps for R85 Metamorphosis are available in the following zipfiles:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.10.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip

Or you can download all of the apps together in this zipfile:

https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r85_metamorphosis.zip

To upgrade the Stream Collector application:

  • Install the new Collector on each server in your auto-scaling group
  • Upgrade your config by:
    • Moving the collector.sink.kinesis.buffer section down to collector.sink.buffer; as this section will be used to configure limits for both Kinesis and Kafka.
    • Adding a new section within the collector.sink block:
collector {
  ...

  sink {
    ...

    buffer {
      byte-limit:
      record-limit:  # Not supported by Kafka; will be ignored
      time-limit:
    }
    ...

    kafka {
      brokers: ""

      # Data will be stored in the following topics
      topic {
        good: ""
        bad: ""
      }
    }
    ...

}

To upgrade the Stream Enrich application:

  • Install the new Stream Enrich on each server in your auto-scaling group
  • Upgrade your config by:
    • Adding a new section within the enrich block:
enrich {
  ...

  # Kafka configuration
  kafka {
    brokers: "localhost:9092"
  }

  ...
}

Note: The app-name defined in your config will be used as your Kafka consumer group ID.

Read more

Snowplow 84 Steller's Sea Eagle

The Kinesis apps for R84 Stellers Sea Eagle are available in the following zipfiles:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.8.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.8.0.zip

Or you can download all of the apps together in this zipfile:

https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r84_stellers_sea_eagle.zip

Only the Elasticsearch Sink app config has changed. The change does not include breaking config changes. To upgrade the Elasticsearch Sink:

  • Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
  • Update your Elasticsearch Sink config with the new elasticsearch.client.http section:
  • elasticsearch.client.http.conn-timeout
  • elasticsearch.client.http.read-timeout

NOTE: These timeouts are optional and will default to 300000 if they cannot be found in your Config.

See our sample config.hocon template.

Read more

Snowplow 83 Bald Eagle

This release introduces our powerful new SQL Query Enrichment, long-awaited support for the EU Frankfurt AWS region (eu-central-1), plus POST support for our Iglu webhook adapter.

Upgrade steps

Update the hadoop_enrich job version in your configuration YAML like so:

versions:
  hadoop_enrich: 1.8.0        # WAS 1.7.0
  hadoop_shred: 0.9.0         # UNCHANGED
  hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

Read more

Snowplow 82 Tawny Eagle

This is a real-time pipeline release. This release updates the Kinesis Elasticsearch Sink with support for sending events via HTTP, allowing us to support Amazon Elasticsearch Service.

Upgrade steps

The Kinesis apps for 82 Tawny Eagle are all available in a single zip file here:

https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r82_tawny_eagle.zip

The individual Kinesis apps for R82 Tawny Eagle are also available in the following zipfiles:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.7.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.8.1.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.7.0.zip

Only the Elasticsearch Sink app has actually changed. The change does, however, include breaking config changes, so you will need to make some minor changes to your configuration file. To upgrade the Elasticsearch Sink:

  1. Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
  2. Update your Elasticsearch Sink config with the new elasticsearch section:
  • The only new fields are elasticsearch.client.type and elasticsearch.client.port
  • The following fields have been renamed: elasticsearch.cluster-name is now elasticsearch.cluster.name elasticsearch.endpoint is now elasticsearch.client.endpoint elasticsearch.max-timeout is now elasticsearch.client.max-timeout elasticsearch.index is now elasticsearch.cluster.index elasticsearch.type is now elasticsearch.cluster.type
  1. Update your supervisor process to point to the new Kinesis Elasticsearch Sink app
  2. Restart the supervisor process on each server running the sink

Read more

Snowplow 81 Kangaroo Island Emu

This is a real-time pipeline release. At the heart of it is the Hadoop Event Recovery project, which allows you to fix up Snowplow bad rows and make them ready for reprocessing.

Upgrade steps

The Kinesis apps for R81 Kangaroo Island Emu are all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r81_kangaroo_island_emu.zip

Only the Stream Enrich app has actually changed. The change is not breaking, so you don’t have to make any changes to your configuration file. To upgrade Stream Enrich:

  • Install the new Stream Enrich app on each server in your Stream Enrich auto-scaling group
  • Update your supervisor process to point to the new Stream Enrich app
  • Restart the supervisor process on each server running Stream Enrich

Read more

Snowplow 80 Southern Cassowary

This is a real-time pipeline release which improves stability and brings the real-time pipeline up-to-date with our Hadoop pipeline.

As a result, you can now use R79 Black Swan’s API Request Enrichment and the HTTP Header Extractor Enrichment in your real-time pipeline. Also, you can now configure the number of records that the Kinesis Client Library should retrieve with each call to GetRecords.

Upgrade steps

Kinesis applications

The Kinesis apps for R80 Southern Cassowary are all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r80_southern_cassowary.zip

There are no breaking changes in this release - you can upgrade the individual Kinesis apps without worrying about having to update the configuration files or indeed the Kinesis streams.

Configuration files

If you want to configure how many records Stream Enrich should read from Kinesis at a time, update its configuration file to add a maxRecords property like so:

enrich {
  ...
  streams {
    in: {
      ...
      maxRecords: 5000 # Default is 10000
      ...

If you want to configure how many records Kinesis Elasticsearch Sink should read from Kinesis at a time, again update its configuration file to add a maxRecords property:

sink {
  ...
  kinesis {
    in: {
      ...
      maxRecords: 5000 # Default is 10000
      ...

Read more

Snowplow 79 Black Swan

This release introduces our powerful new API Request Enrichment, plus a new HTTP Header Extractor Enrichment and several other improvements on the enrichments side.

It also updates the Iglu client used by our Spark Enrich and Relational Database Shredder components. The version 1.4.0 lets you fetch your schemas from Iglu registries with authentication support, allowing you to keep your proprietary schemas private.

Upgrade steps

Configuration file

The recommended AMI version to run Snowplow is now 4.5.0 - update your configuration YAML as follows:

emr:
  ami_version: 4.5.0 # WAS 4.3.0

Next, update your hadoop_enrich and hadoop_shred job versions like so:

versions:
  hadoop_enrich: 1.7.0        # WAS 1.6.0
  hadoop_shred: 0.9.0         # WAS 0.8.0
  hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

JSON resolver

If you want to use an Iglu registry with authentication, add a private apikey to the registry’s configuration entry and set the schema version to 1-0-1 as in the example below.

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
      {
        "name": "Private Acme repository for com.acme",
        "priority": 1,
        "vendorPrefixes": [ "com.acme" ],
        "connection": {
          "http": {
            "uri": "http://iglu.acme.com/api",
            "apikey": "APIKEY-FOR-ACME"
          }
        }
      }
    ]
  }
}

Read more

Snowplow 78 Great Hornbill

This release brings our Kinesis pipeline functionally up-to-date with our Hadoop pipeline, and makes various further improvements to the Kinesis pipeline.

Upgrade steps

The Kinesis apps for R78 Great Hornbill are now all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r78_great_hornbill.zip

Scala Kinesis Enrich has been renamed to Stream Enrich. The name of the artifact has changed to "snowplow-stream-enrich".

Configuration file

Upgrading will require the following configuration changes to the applications' individual HOCON configuration files.

Scala Stream Collector

Add a collector.cookie.name field to the HOCON and set its value to "sp".

Also, note that the configuration file no longer supports loading AWS credentials from the classpath using ClasspathPropertiesFileCredentialsProvider. If your configuration looks like this:

{
    "aws": {
        "access-key": "cpf",
        "secret-key": "cpf"
    }
}

then you should change "cpf" to "default" to use the DefaultAWSCredentialsProviderChain. You will need to ensure that your credentials are available in one of the places the AWS Java SDK looks. For more information about this, see the Javadoc.

Kinesis Elasticsearch Sink

Replace the sink.kinesis.out string with an object with two fields:

{
    "sink": {
        "good": "elasticsearch",  # or "stdout"
        "bad": "kinesis"          # or "stderr" or "none"
    }
}

Additionally, move the stream-type setting from the sink.kinesis.in section to the sink section.

If you are loading Snowplow bad rows into for example Elasticsearch, please make sure to update all applications.

For a complete example, see our sample config.hocon template.

Read more

Snowplow 77 Great Auk

This release focuses on the command-line applications used to orchestrate Snowplow, bringing Snowplow up-to-date with the new 4.x series of Elastic MapReduce releases.

Upgrade steps

Running EmrEtlRunner and StorageLoader as Ruby (rather than JRuby apps) is no longer actively supported.

The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

Note that the snowplow-runner-and-loader.sh script has been also updated to use the JRuby binaries rather than the raw Ruby project.

Configuration file

The recommended AMI version to run Snowplow is now 4.3.0 - update your configuration YAML as follows:

emr:
  ami_version: 4.3.0 # WAS 3.7.0

You will need to update the jar versions in the same section:

  versions:
    hadoop_enrich: 1.6.0        # WAS 1.5.1
    hadoop_shred: 0.8.0         # WAS 0.7.0
    hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

Read more

Snowplow 76 Changeable Hawk-Eagle

This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our SendGrid webhook support.

Upgrade steps

Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.

Configuration file

In the config.yml file for your EmrEtlRunner, update your hadoop_enrich and hadoop_shred job versions like so:

  versions:
    hadoop_enrich: 1.5.1 # WAS 1.5.0
    hadoop_shred: 0.7.0 # WAS 0.6.0
    hadoop_elasticsearch: 0.1.0 # Unchanged

For a complete example, see our sample config.yml template.

Read more

Snowplow 75 Long-Legged Buzzard

This release lets you warehouse the event streams generated by Urban Airship and SendGrid, and also updates our web-recalculate data model.

Upgrade steps

EmrEtlRunner and StorageLoader

The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

In your EmrEtlRunner’s config.yml file, update your hadoop_enrich job’s version to 1.5.0, like so:

  versions:
    hadoop_enrich: 1.5.0 # WAS 1.4.0

For a complete example, see our sample config.yml template.

Redshift

You'll need to deploy the Redshift tables for any webhooks you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:

Read more

Snowplow 74 European Honey Buzzard

This release adds a Weather Enrichment to the Hadoop pipeline - making Snowplow the first event analytics platform with built-in weather analytics!

Data provider: OpenWeatherMap

Upgrade steps

Configuration files

To take advantage of this new enrichment, update the hadoop_enrich jar version in the emr section of your configuration YAML:

  versions:
    hadoop_enrich: 1.4.0 # WAS 1.3.0
    hadoop_shred: 0.6.0 # UNCHANGED
    hadoop_elasticsearch: 0.1.0 # UNCHANGED

For a complete example, see our sample config.yml template.

Make sure to add a weather_enrichment_config.json configured as explained here into your enrichments folder too. The file should conform to this JSON Schema.

The corresponding JSONPaths file could be found here.

Redshift

If you are using Snowplow with Amazon Redshift, you will need to deploy the org_openweathermap_weather_1 table into your database.

Read more

Snowplow 73 Cuban Macaw

This release adds the ability to automatically load bad rows from the Snowplow Elastic MapReduce jobflow into Elasticsearch for analysis and formally separates the Snowplow enriched event format from the TSV format used to load Redshift.

Upgrade steps

EmrEtlRunner and StorageLoader

The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.

Configuration file

You will need to update the jar versions in the emr section of your configuration YAML:

  versions:
    hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process

In order to start loading bad rows from the EMR jobflow into Elasticsearch, you will need to add an Elasticsearch target to the targets section of your configuration YAML.

  targets:
    - name: "Our Elasticsearch cluster" # Name for the target - used to label the corresponding jobflow step
      type: elasticsearch # Marks the database type as Elasticsearch
      host: "ec2-43-1-854-22.compute-1.amazonaws.com" # Elasticsearch host
      database: snowplow # The Elasticsearch index
      port: 9200 # Port used to connect to Elasticsearch
      table: bad_rows # The Elasticsearch type
      es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
      username: # Not required for Elasticsearch
      password: # Not required for Elasticsearch
      sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]
      maxerror:  # Not required for Elasticsearch
      comprows: # Not required for Elasticsearch

Note that the database and table fields actually contain the index and type respectively where bad rows will be stored.

The sources field is an array of buckets from which to load bad rows. If you leave this field blank, then the bad rows buckets created by the current run of the EmrEtlRunner will be loaded. Alternatively, you can explicitly specify an array of bad row buckets to load.

For a complete example, see our sample config.yml template.

Running EmrEtlRunner

Note these updates to EmrEtlRunner's command-line arguments:

  • You can skip loading data into Elasticsearch by running EmrEtlRunner with the --skip elasticsearch option
  • To run just the Elasticsearch load without any other EmrEtlRunner steps, explicitly skip all other steps using --skip staging,s3distcp,enrich,shred,archive_raw
  • Note that running EmrEtlRunner with --skip enrich,shred will no longer skip the EMR job, since there is still the Elasticsearch step to run
  • If you are using Postgres rather than Redshift, you should no longer pass the --skip shred option to EmrEtlRunner. This is because the shred step now removes JSON fields from the enriched event TSV.

Database

Use the appropriate migration script to update your version of the atomic.events table to the relevant schema:

If you are upgrading to this release from an older version of Snowplow, we also provide Redshift migration scripts to atomic.events version 0.8.0 from 0.4.0, 0.5.0 and 0.6.0 versions.

Warning: these migration scripts will alter your atomic.events table in-place, deleting the unstruct_event, contexts, and derived_contexts columns. We recommend that you make a full backup before running these scripts.

Read more

Snowplow 72 Great Spotted Kiwi

This release adds the ability to track clicks through the Snowplow Clojure Collector, adds a cookie extractor enrichment and introduces new de-duplication queries leveraging R71's event fingerprint

Upgrade steps

Clojure Collector

This release bumps the Clojure Collector to version 1.1.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the “Upload New Version” and upload your warfile

Configuration files

You need to update the version of the Enrich jar in your configuration file:

    hadoop_enrich: 1.2.0 # Version of the Hadoop Enrichment process

If you wish to use the new cookie extractor enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.

This default configuration is capturing the Scala Stream Collector's own sp cookie - in practice, you would probably extract other more valuable cookies available on your domain. Each extracted cookie will end up a single derived context following the JSON Schema org.ietf/http_cookie/jsonschema/1-0-0.

Note: This enrichment only works with events recorded by the Scala Stream Collector - the CloudFront and Clojure Collectors do not capture HTTP headers.

JSONPaths files

Redshift

If you are using Snowplow with Amazon Redshift and wish to use the new cookie extractor enrichment, you will need to deploy the org_ietf_http_cookie_1 table into your database.

For the new URI redirect functionality, install the com_snowplowanalytics_snowplow_uri_redirect_1 table.

Read more

Snowplow 71 Stork-Billed Kingfisher

This release significantly overhauls Snowplow's handling of time and introduces event fingerprinting to support de-duplication efforts. It also brings our validation of unstructured events and custom context JSONs "upstream" from our Hadoop Shred process into our Hadoop Enrich process.

Upgrade steps

EmrEtlRunner and StorageLoader

The latest version of the EmrEtlRunner and StorageLoadeder are available from our Bintray here.

Unzip this file to a sensible location (e.g. /opt/snowplow-r71).

Configuration files

You should update the versions of the Enrich and Shred jars in your [configuration file][https://github.com/snowplow/snowplow/blob/r71-stork-billed-kingfisher/3-enrich/emr-etl-runner/config/config.yml.sample]:

    hadoop_enrich: 1.1.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.5.0 # Version of the Hadoop Shredding process

You should also update the AMI version field:

    ami_version: 3.7.0

For each of your database targets, you must add the new ssl_mode field:

  targets:
    - name: "My Redshift database"
      ...
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full

If you wish to use the new event fingerprint enrichment, write a configuration JSON and add it to your enrichments folder. The example JSON can be found here.

Database

Use the appropriate migration script to update your version of the atomic.events table to the corresponding schema:

If you are ingesting Cloudfront access logs with Snowplow, use the Cloudfront access log migration script to update your com_amazon_aws_cloudfront_wd_access_log_1 table.

Read more

Snowplow 70 Bornean Green Magpie

This release focuses on improving our StorageLoader and EmrEtlRunner components and is the first step towards combining the two into a single CLI application.

Upgrade steps

EmrEtlRunner and StorageLoader

Download the EmrEtlRunner and StorageLoader from Bintray.

Unzip this file to a sensible location (e.g. /opt/snowplow-r70).

Check that you have a compatible JRE (1.7 ) installed:

$ ./snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0

Configuration files

Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb script to turn them into a unified configuration file and a resolver JSON.

For reference:

Note that field names in the unified configuration file no longer start with a colon - so region: us-east-1 not :region: us-east-1.

Running EmrEtlRunner and StorageLoader

The EmrEtlRunner now requires a --resolver argument which should be the path to your new resolver JSON.

Also note that when specifying steps to skip using the --skip option, the "archive" step has been renamed to "archive_raw" in the EmrEtlRunner and "archive_enriched" in the StorageLoader. This is in preparation for merging the two applications into one.

Read more

Snowplow 69 Blue-Bellied Roller

This release contains new and updated SQL data models.

The SQL data models are a standalone and optional part of the Snowplow pipeline. Users who don't use the SQL data models are therefore not affected by this release.

Upgrade steps

To implement the SQL data models, first execute the relevant setup queries in Redshift. Then use SQL Runner to execute the queries on a regular basis. SQL Runner is an open source app that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.

The web and mobile data models come in two variants: recalculate and incremental.

The recalculate models drop and recalculate the derived tables using all events, and can therefore be replaced without having to upgrade the tables.

The incremental models update the derived tables using only the events from the most recent batch. The updated incremental model comes with a migration script.

Read more

Snowplow 68 Turquoise Jay

This is a small release which adapts the EmrEtlRunner to use the new Elastic MapReduce API.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the version 0.16.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r68-turquoise-jay
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Read more

Snowplow 67 Bohemian Waxwing

This release brings a host of upgrades to our real-time Amazon Kinesis pipeline as well as the embedding of Snowplow tracking into this pipeline.

Upgrade steps

The Kinesis apps for r67 Bohemian Waxwing are now all available in a single zip file here. Upgrading will require various configuration changes to each of the three applications’ HOCON configuration files.

Scala Stream Collector

  • Change collector.sink.kinesis.stream.name to collector.sink.kinesis.stream.good in the HOCON
  • Add collector.sink.kinesis.stream.bad to the HOCON

Scala Kinesis Enrich

If you want to include Snowplow tracking for this application please append the following:

enrich {

    ...

    monitoring {
        snowplow {
            collector-uri: ""
            collector-port: 80
            app-id: ""
            method: "GET"
        }
    }
}

Note that this is a wholly optional section; if you do not want to send application events to a second Snowplow instance, simply do not add this to your configuration file.

For a complete example, see our config.hocon.sample file.

Kinesis Elasticsearch Sink

  • Add max-timeout into the elasticsearch fields
  • Merge location fields into the elasticsearch section
  • If you want to include Snowplow Tracking for this application please append the following:
sink {

    ...

    monitoring {
        snowplow {
            collector-uri: ""
            collector-port: 80
            app-id: ""
            method: "GET"
        }
    }
}

Again, note that Snowplow tracking is a wholly optional section.

For a complete example, see our config.hocon.sample file.

Read more

Snowplow 66 Oriental Skylark

This release upgrades our Hadoop Enrichment process to run on Hadoop 2.4, re-enables our Kinesis-Hadoop lambda architecture and also introduces a new scriptable enrichment powered by JavaScript.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the version 0.15.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r66-oriental-skylark
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

You need to update your EmrEtlRunner's config.yml file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:

:emr:
  :ami_version: 3.6.0 # WAS 2.4.2

And:

  :versions:
    :hadoop_enrich: 1.0.0 # WAS 0.14.1

JavaScript scripting enrichment

You can enable this enrichment by creating a self-describing JSON and adding into your enrichments folder. The configuration JSON should validate against the javascript_script_config schema.

Read more

Snowplow 65 Scarlet Rosefinch

This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.

Upgrade steps

Kinesis applications

The Kinesis apps for r65 Scarlet Rosefinch are all available in a single zip file here.

Configuration files

Upgrading will require various configuration changes to each of the four applications.

Scala Stream Collector

Add backoffPolicy and buffer fields to the configuration HOCON.

Scala Kinesis Enrich
  • Add backoffPolicy and buffer fields to the configuration HOCON
  • Extract the resolver from the configuration HOCON into its own JSON file, which can be stored locally or in DynamoDB
  • Update the command line arguments as detailed here
Kinesis LZO S3 Sink
  • Rename the outermost key in the configuration HOCON from "connector" to "sink"
  • Replace the "s3/endpoint" field with an "s3/region" field (such as us-east-1)
Kinesis Elasticsearch Sink

Rename the outermost key in the configuration HOCON from "connector" to "sink"

Read more

Snowplow 64 Palila

This is a major release which adds a new data modeling stage to the Snowplow pipeline, as well as fixes a small number of important bugs across the rest of Snowplow.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.14.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r64-palila
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

From this release onwards, you must specify IAM roles for Elastic MapReduce to use. If you have not already done so, you can create these default EMR roles using the AWS Command Line Interface, like so:

$ aws emr create-default-roles

Now update your EmrEtlRunner's config.yml file to add the default roles you just created:

:emr:
  :ami_version: 2.4.2       # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
  :region: eu-west-1        # Always set this
  :jobflow_role: EMR_EC2_DefaultRole # NEW LINE
  :service_role: EMR_DefaultRole     # NEW LINE

This release also bumps the Hadoop Enrichment process to version 0.14.1. Update config.yml like so:

  :versions:
    :hadoop_enrich: 0.14.1 # WAS 0.14.0

For a complete example, see our sample config.yml template.

Database

This release widens the mkt_clickid field in atomic.events. You need to use the appropriate migration script to update to the new table definition:

Read more

Snowplow 63 Red-Cheeked Cordon-Bleu

This is a major release which adds two new enrichments, upgrades existing enrichments and significantly extends and improves our Canonical Event Model for loading into Redshift, Elasticsearch and Postgres.

The new and upgraded enrichments are as follows:

  1. New enrichment: parsing useragent strings using the ua_parser library
  2. New enrichment: converting the money amounts in e-commerce transactions into a base currency using Open Exchange Rates
  3. Upgraded: extracting click IDs in our campaign attribution enrichment, so that Snowplow event data can be more precisely joined with campaign data
  4. Upgraded: our existing MaxMind-powered IP lookups
  5. Upgraded: useragent parsing using the user_agent_utils library can now be disabled

Upgrade steps

Enrichments

To continue parsing useragent strings using the user_agent_utils library, you must add a new JSON configuration file into your folder of enrichment JSONs:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/user_agent_utils_config/jsonschema/1-0-0",
    "data": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "user_agent_utils_config",
        "enabled": true,
        "parameters": {}
    }
}

The name of the file is not important but must end in .json.

Configuring other enrichments is at your discretion. Useful resources here are:

Elastic MapReduce Pipeline

There are two steps to upgrading the EMR pipeline:

  1. Upgrade your EmrEtlRunner to use the latest Hadoop job versions
  2. Upgrade your Redshift and/or Postgres atomic.events table to the relevant definitions
Configuration file

This release bumps:

  • The Hadoop Enrichment process to version 0.14.0
  • The Hadoop Shredding process to version 0.4.0

In your EmrEtlRunner's config.yml file, update your Hadoop jobs versions like so:

  :versions:
    :hadoop_enrich: 0.14.0 # WAS 0.13.0
    :hadoop_shred: 0.4.0 # WAS 0.3.0

For a complete example, see our sample config.yml template.

Database

You need to use the appropriate migration script to update to the new table definition:

If you want to make use of the new ua_parser based useragent parsing enrichment in Redshift, you must also deploy the new table into your atomic schema:

Kinesis pipeline

This release updates:

  • Scala Kinesis Enrich, to version 0.4.0
  • Kinesis Elasticsearch Sink, to version 0.2.0

The new version of the Kinesis pipeline is available on Bintray. The download contains the latest versions of all of the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink).

Upgrading a live Kinesis pipeline

Our recommended approach for upgrading is as follows:

  1. Kill your Scala Kinesis Enrich cluster
  2. Leave your Kinesis Elasticsearch Sink cluster running until all remaining enriched events are loaded, then kill this cluster too
  3. Upgrade your Scala Kinesis Enrich cluster to the new version
  4. Upgrade your Kinesis Elasticsearch Sink cluster to the new version
  5. Restart your Scala Kinesis Enrich cluster
  6. Restart your Kinesis Elasticsearch Sink cluster

Read more

Snowplow 62 Tropical Parula

This release is designed to fix an incompatibility issue between r61's EmrEtlRunner and some older Elastic Beanstalk configurations. It also includes some other EmrEtlRunner improvements.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.13.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r62-tropical-parula
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

You must also update your EmrEtlRunner's configuration file, or else you will get a Contract failure on start. See the next section for details.

Configuration file

Whether or not you use the new bootstrap option, you must update your EmrEtlRunner's config.yml file to include an entry for it:

In the :emr: section of your EmrEtlRunner's config.yml file, add in a :bootstrap: property like so:

:emr:
  ...
  :ec2_key_name: ADD HERE
  :bootstrap: []          # No custom boostrap actions
  :software:
    ...

For a complete example, see our sample config.yml template.

Read more

Snowplow 61 Pygmy Parrot

This release has a variety of new features, operational enhancements and bug fixes. The major additions are:

  1. You can now parse Amazon CloudFront access logs using Snowplow
  2. The latest Clojure Collector version supports Tomcat 8 and CORS, ready for cross-domain POST from JavaScript and ActionScript
  3. EmrEtlRunner's failure handling and Clojure Collector log handling have been improved

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.12.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r61-pygmy-parrot
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

If you currently use snowplow-runner-and-loader.sh, upgrade to the relevant version too.

Configuration file

This release bumps the Hadoop Enrichment process to version 0.13.0.

In your EmrEtlRunner's config.yml file, update your hadoop_enrich and hadoop_shred jobs' versions like so:

  :versions:
    :hadoop_enrich: 0.13.0 # WAS 0.12.0

For a complete example, see our sample config.yml template.

Clojure Collector

This release bumps the Clojure Collector to version 1.0.0.

You will not be able to upgrade an existing Tomcat 7 cluster to use this version. Instead, to upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting "Save As…"
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector's application
  4. Click the "Launch New Environment" action
  5. Click the "Upload New Version" and upload your warfile

When you are confident that the new collector is performing as expected, you can choose the "Swap Environment URLs" action to put the new collector live.

Read more

Snowplow 60 Bee Hummingbird

This release focuses on the Snowplow Kinesis flow, and includes:

  1. A new Kinesis “sink app” that reads the Scala Stream Collector’s Kinesis stream of raw events and stores these raw events in Amazon S3 in an optimized format
  2. An updated version of our Hadoop Enrichment process that supports as an input format the events stored in S3 by the new Kinesis sink app

Together, these two features let you robustly archive your Kinesis event stream in S3, and process and re-process it at will using our tried-and-tested Hadoop Enrichment process.

Up until now, all Snowplow releases have used semantic versioning. We will continue to use semantic versioning for Snowplow's many constituent applications and libraries, but our releases of the Snowplow platform as a whole will be known by their release number plus a codename. The codenames for 2015 will be birds in ascending order of size, starting with the Bee Hummingbird.

Upgrade steps

EmrEtlRunner

We recommend upgrading EmrEtlRunner to the version 0.11.0, given the bugs fixed in this release. You also must upgrade if you want to use Hadoop to process the events stored by the Kinesis LZO S3 Sink.

Upgrade is as follows:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r60-bee-hummingbird
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

This release bumps the Hadoop Enrichment process to version 0.12.0.

In your EmrEtlRunner's config.yml file, update your hadoop_enrich job's version like so:

  :versions:
    :hadoop_enrich: 0.12.0 # WAS 0.11.0

If you want to run the Hadoop Enrichment process against the output of the Kinesis LZO S3 Sink, you will have to change the collector_format field in the configuration file to thrift:

:collector_format: thrift

For a complete example, see our sample config.yml template.

Kinesis pipeline

We are steadily moving over to Bintray for hosting binaries and artifacts which don't have to be hosted on S3. To make deployment easier, the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink) are now all available in a single zip file.

Read more

Snowplow 0.9.14

This release contains a variety of important bug fixes, plus support for three new event streams which can be loaded into your Snowplow event warehouse and unified log:

  • Mandrill - for tracking email and email-related events delivered by Mandrill
  • PagerDuty - for tracking incidents generated by PagerDuty
  • Pingdom - for tracking site outages detected by Pingdom

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.10.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.14
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

This release bumps the Hadoop Enrichment process to version 0.11.0 and the Hadoop Shredding process to version 0.3.0.

In your EmrEtlRunner's config.yml file, update your hadoop_enrich and hadoop_shred jobs' versions like so:

  :versions:
    :hadoop_enrich: 0.11.0 # WAS 0.10.1
    :hadoop_shred: 0.3.0 # WAS 0.2.1

For a complete example, see our sample config.yml template.

Clojure Collector

This release bumps the Clojure Collector to version 0.9.1.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting "Save As…"
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector’s application
  4. Click the "Upload New Version" and upload your warfile

CloudFront Collector

You can find the new pixel in our GitHub repository as 2-collectors/cloudfront-collector/static/i - upload this to S3, overwriting your existing pixel.

Remember to invalidate the pixel in your CloudFront distribution.

Redshift

Make sure to deploy Redshift tables for any of the new webhooks that you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:

Read more

Snowplow 0.9.13

This release is fixing two bugs found in the previous release:

  1. Safer URI parsing
  2. Dependency conflict with the version of Specs2 in Kinesis Enrich

Upgrade steps

This release bumps Common Enrich to 0.9.1, Hadoop Enrich to version 0.10.1, and Kinesis Enrich to 0.2.1 with the latter two publically available on S3.

In your EmrEtlRunner's config.yml file, update your Hadoop enrich job's version to 0.10.1:

  :versions:
    :hadoop_enrich: 0.10.1

For a complete example, see our sample config.yml template.

Read more

Snowplow 0.9.12

This release significantly improves and extends our Kinesis support. The major new feature is our all new Kinesis Elasticsearch Sink, which streams event data from Kinesis into Elasticsearch in real-time. The data is then available to power real-time dashboards and analysis (e.g. using Kibana).

In addition to enabling real-time loading of data into Elasticsearch, we have made a number of other improvements to the real-time flow:

  1. Bad rows of data are now loaded into a dedicated bad rows stream in Kinesis
  2. The real-time flow now runs the latest version of Scala Common Enrich, making it possible to employ the same configurable enrichments in the real-time flow that are currently available in the batch flow.

This release also makes some improvements to Snowplow Common Enrich and Hadoop Enrich which should be invaluable for users of our batch-based event pipeline.

Upgrade steps

Kinesis pipeline

There are several changes you need to make to move to the new versions of the Scala Stream Collector and Scala Kinesis Enrich:

  • You must provide a "region" field (with a value like “us-east-1”) in the configuration files
  • You must provide a "resolver" field in the Scala Kinesis Enrich containing the data used to configure the Iglu resolver
  • If you run Scala Kinesis Enrich without the -enrichments option, the IP anonymization enrichment and the IP address lookup enrichment will not run automatically

New templates for the two configuration files can be found on GitHub (you will need to edit the AWS credentials and the stream names):

And a sample enrichment directory containing sensible configuration JSONs can be found here.

Hadoop pipeline

This release bumps the Hadoop Enrichment process to version 0.10.0.

In your EmrEtlRunner's config.yml file, update your Hadoop enrich job's version to 0.10.0, like so:

  :versions:
    :hadoop_enrich: 0.10.0 # WAS 0.9.0

For a complete example, see our sample config.yml template.

Read more

Snowplow 0.9.11

For the first time, you can now use Snowplow to collect, store and analyze event streams generated by supported third-party software.

Many Software-as-a-Service vendors publish their own internal event streams for customers to consume - these event stream APIs are often referred to as "webhooks", sometimes as "streaming APIs", "postbacks" or "HTTP response APIs". Snowplow 0.9.11 adds first-class support for an initial set of these third-party webhooks.

For our initial 0.9.11 release we are adding support for three different webhook sources:

  • MailChimp - for tracking email and email-related events delivered by MailChimp
  • CallRail - for tracking completed telephone calls recorded by CallRail
  • Iglu - for tracking Iglu-compatible self-describing events, enabling you to use schema-less webhook APIs such as AD-X Tracking

Upgrade steps

Configuration file

This release bumps the Hadoop Enrichment process to version 0.9.0.

In your EmrEtlRunner's config.yml file, update your Hadoop enrich job's version to 0.9.0, like so:

  :versions:
    :hadoop_enrich: 0.9.0 # WAS 0.8.0

For a complete example, see our sample config.yml template.

Clojure Collector

This release bumps the Clojure Collector to version 0.9.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting "Save As…"
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector's application
  4. Click the “Upload New Version” and upload your warfile

Redshift

If you have installed the com_snowplowanalytics_snowplow_change_form_1 table following the 0.9.10 release, then please upgrade it by using the upgrade script, migrate_change_form_1_r1_to_r2.sql.

Also, make sure to deploy Redshift tables for any webhooks you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:

Read more

Snowplow 0.9.10

This is a minimalistic release designed to support the new events and context of the Snowplow JavaScript Tracker v2.1.1.

This release is primarily targeted at Snowplow users of Amazon Redshift who are upgrading to the Snowplow JavaScript Tracker (v2.1.0 ).

Upgrade steps

You will need to deploy the tables for any new events/context you want to support into your Amazon Redshift database. Make sure to deploy these into the same schema as your events table resides in.

You can find all Redshift table definitions in our GitHub repository under 4-storage/redshift-storage/sql.

The StorageLoader will automatically pick up the new JSON Paths files - you do not have need to deploy these.

Read more

Snowplow 0.9.9

This is primarily a comprehensive bug fix release, although it also adds the new campaign_attribution enrichment to our enrichment registry.

Upgrade steps

EmrEtlRunner and StorageLoader

You need to update EmrEtlRunner and StorageLoader to the code 0.9.2 and 0.3.3 respectively on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.9
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

This release bumps the Hadoop Enrichment process to version 0.8.0.

In your EmrEtlRunner's config.yml file, update your Hadoop enrich job’s version to 0.8.0, like so:

  :versions:
    :hadoop_enrich: 0.8.0 # WAS 0.7.0

For a complete example, see our sample config.yml template.

Campaign attribution

If you upgrade Hadoop Enrich to version 0.8.0 as above, you must also follow these steps, or else campaign attribution will be disabled.

To use the new enrichment, add a "campaign_attribution.json" file containing a campaign_attribution enrichment JSON to your enrichments directory. Note that the previously automatic behaviour of populating the mkt_ fields based on the utm_ querystring fields no longer occurs by default. To reproduce it you must use the Google-like manual tagging configuration.

Clojure Collector

This release bumps the Clojure Collector to version 0.8.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting "Save As…"
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector's application
  4. Click the "Upload New Version" and upload your warfile

Read more

Snowplow 0.9.8

With this release, we are adding event analytics support for iOS and Android applications. Mobile event analytics is a major step in Snowplow’s journey from a web analytics tool to a general-purpose event analytics platform.

Adding mobile support for Snowplow is really a few different releases:

  • Snowplow 0.9.8, which adds POST support to our Clojure Collector and upgrades our Enrichment process to support POST payloads containing multiple events
  • A new event tracker for iOS, see today’s accompanying iOS Tracker blog post
  • A new event tracker for Android, see today’s accompanying Android Tracker blog post
  • New mobile-specific JSON Schemas available in Iglu Central, mobile_context and geolocation_context

Upgrade steps

Configuration file

This release bumps the Hadoop Enrichment process to version 0.7.0.

In your EmrEtlRunner's config.yml file, update your Hadoop enrich job's version to 0.7.0, like so:

  :versions:
    :hadoop_enrich: 0.7.0 # WAS 0.6.0

For a complete example, see our sample config.yml template.

Clojure Collector

Please make sure that you upgrade the Hadoop Enrichment process to 0.7.0 before upgrading your collector.

This release bumps the Clojure Collector to version 0.7.0.

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting "Save As…"
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector's application
  4. Click the "Upload New Version" and upload your warfile

Redshift

Both of the new trackers send mobile-related context conforming to the mobile_context JSON Schema, as a custom context automatically attached to each event.

If you are running Redshift, you can deploy the mobile_context table into your database using this this script.

The Android Tracker also optionally sends a geolocation-related context relating to the geolocation_context JSON Schema; support for this in the iOS Tracker is planned soon.

Read more

Snowplow 0.9.7

This release is a "tidy-up" release which fixes some important bugs, particularly:

  1. A bug in v0.9.5 onwards which was preventing events containing multiple JSONs from being shredded successfully
  2. Our Hive table definition falling behind Snowplow 0.9.6’s enriched event format updates
  3. A bug in EmrEtlRunner causing issues running Snowplow inside some VPC environments

As well as these important fixes, 0.9.7 comes with a set of smaller bug fixes plus two new features:

  • The ability to perform shredding without prior enrichment (i.e. shred an existing folder of enriched events)
  • The ability to load Redshift from an S3 bucket in a region different to Redshift's own region

Upgrade steps

EmrEtlRunner and StorageLoader

You need to update EmrEtlRunner and StorageLoader to the 0.9.7 code release on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.7
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

In your EmrEtlRunner's config.yml file, update your Hadoop shred job's version to 0.2.1, like so:

  :versions:
    ...
    :hadoop_shred: 0.2.1 # WAS 0.2.0

For a complete example, see our sample config.yml template.

Hive

Hive users can find the updated Hive file in our repository as 4-storage/hive-storage/hiveql/table-def.q.

Note that enriched events generated by pre-0.9.6 Snowplow are not compatible with this updated Hive definition, and will need to be re-generated.

Read more

Snowplow 0.9.6

This release does four things:

  1. It fixes some important bugs discovered in Snowplow 0.9.5, related to our new shredding functionality
  2. It introduces new JSON-based configurations for Snowplow's existing enrichments
  3. It extends our geo-IP lookup enrichment to support all five of MaxMind's commercial databases
  4. It extends our referer-parsing enrichment to support a user-configurable list of internal domains

Upgrade steps

EmrEtlRunner and StorageLoader

You need to update EmrEtlRunner and StorageLoader to the 0.9.6 code release on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.6
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment

Configuration file

Update your EmrEtlRunner's config.yml file. First update both of your Hadoop job versions to, respectively:

  :versions:
    :hadoop_enrich: 0.6.0 # WAS 0.5.0
    :hadoop_shred: 0.2.0 # WAS 0.1.0

Next, completely delete the :enrichments: section at the bottom:

:enrichments:
  :anon_ip:
    :enabled: true
    :anon_octets: 2

For a complete example, see our sample config.yml template.

Enrichments

Finally, if you wish to use any of the configurable enrichments, you need to create a directory of configuration JSONs and pass that directory to the EmrEtlRunner using the new --enrichments option.

For help on this, please read our release blog; also check out the example enrichments directory, and review the configuration guide for the new JSON-based enrichments.

Important: don’t forget to update any Bash script that you use to run your EmrEtlRunner job, to include the --enrichments argument. If you forget to do this, then all of your enrichments will be switched off. You can see updated versions of these Bash files here:

Database

You need to use the appropriate migration script to update to the new table definition:

Read more

Snowplow 0.9.5

This release makes Snowplow the first event analytics system to validate incoming event and context JSONs (using JSON Schema), and then automatically shred those JSONs into dedicated tables in Amazon Redshift.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code release 0.9.5 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.5
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

You also need to update the config.yml file for EmrEtlRunner. For more information on how to populate the new configuration file correctly, see the Configuration section of the EmrEtlRunner setup guide.

StorageLoader

You need to upgrade your StorageLoader installation to the code 0.9.5 on Github:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.5
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment

You also need to update the config.yml file for StorageLoader.

New Snowplow-authored events

If you want to add support for the new Snowplow-authored events e.g. link clicks to your Snowplow installation, this is a two-step process:

  1. Deploy the Redshift table definition available in the Snowplow repo into your Redshift database (same schema as atomic.events)
  2. (If using Looker) deploy the LookML model available in the Snowplow repo into your Looker instance

Custom events and contexts

Snowplow 0.9.5 lets you define your own custom unstructured events and contexts, and configure Snowplow to processing these from collection through into Redshift and even Looker.

Setting this up is outside of the scope of this release blog post. We have documented the process on our wiki, split into two pages:

  1. Configuring shredding in EmrEtlRunner
  2. Loading shredded types using StorageLoader

Read more

Snowplow 0.9.4

This release includes a new base LookML data model and dashboard to get Snowplow users started with Looker.

The new base model has some significant improvements over the old one:

  • Querying the data is much faster. When new Snowplow event data is loaded into Redshift, Looker automatically detects it and generates the relevant session-level and visitor-level derived tables, so that they are ready to be queried directly. We’ve tuned the derived tables with the relevant dist keys and sort keys to make sure any underlying table joins in Redshift are performant
  • New visualizations are now supported including geographic plots
  • Looker's new functionality around global filters: this makes it possible to drill into subsets of visitors by a range of dimensions, and see a wide range of different visualizations for that subset of users on the same screen, opening up new creative ways of exploring your Snowplow data
  • Metrics and dimensions have been renamed to make it easier for a new user unfamiliar with Snowplow to explore the data through Looker

Upgrade steps

To make use of the new models, you'll need to have a Looker license or be on a Looker trial.

First, you will need to load a new country codes dataset into Redshift / Postgres: this maps two character ISO country codes (outputted by our Maxmind enrichment) to three-character ISO country codes (used by Looker for geographic visualizations) and country names.

Clone the Snowplow repo:

$ git clone https://github.com/snowplow/snowplow.git

You need to run the contents of snowplow/5-data-modeling/reference-data/redshift/iso-country-codes.sql in our Redshift database. This can be done using PSQL e.g.

psql -U $username -p $port -h $host -d $database -f snowplow/5-data-modeling/reference-data/redshift/iso-country-codes.sql

Alternatively, you can copy and paste the content of the file into your favorite SQL editor.

You then need to make sure that our Looker user has access to the new data. In PSQL, execute:

GRANT USAGE ON SCHEMA reference_data TO looker;
GRANT SELECT ON TABLE reference_data.country_codes TO looker;

Assuming that the user credentials you share with Looker have username "looker".

Next, you need to transfer our LookML files from the Snowplow repo into the repo you use for Looker, either directly (via Git) or by creating the files in the Looker UI (in the models section), and then copying and pasting the contents. Note that may need to update the snowplow.model.lookml so that it references your connection in Redshift to your Snowplow dataset: the example file assumes that your connection is called "snowplow", which may not be the case.

Once copied over, you should be able to start exploring the "events", "sessions" and "visitors" views, and playing around directly with the "Traffic Pulse" dashboard.

Read more

Snowplow 0.9.3

These release deals with incremental improvements to EmrEtlRunner, plus two important bug fixes for Clojure Collector users.

The first Clojure Collector issue was a problem in the file move functionality in EmrEtlRunner, which was preventing Clojure Collector users from scaling beyond a single instance without data loss.

The second Clojure Collector issue involved the Elastic Beanstalk's Apache proxy's IP address(es) showing up in the atomic.events table in place of the expected end-user's IPs. We were unable to reproduce this issue when running multiple instances, so we do not believe this problem is as widespread.

Upgrade steps

Upgrading is a two-step process:

  1. Update EmrEtlRunner
  2. Update Clojure Collector [optional]

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.7.0 on GitHub:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.3
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

You also need to update your EmrEtlRunner's config.yml file in a few places. First add a logging section at the top:

:logging:
  :level: DEBUG # You can optionally switch to INFO for production

Next you need to replace this:

:emr:
  :hadoop_version: 1.0.3

with this:

:emr:
  :ami_version: 2.4.2

If you need to use a different Hadoop version, check out this handy table to determine the correct AMI version.

Finally, add the region in:

:emr:
  :ami_version: 2.4.2
  :region: us-east-1 # Or your region

Your :region: will be your existing :placement: without the character on the end. Note that if you are running your EMR job in an EC2 subnet, you no longer need to set the :placement: field.

Once you have made these changes, do check your final version against the updated config.yml template.

Clojure Collector

This release bumps the Clojure Collector to version 0.6.0. Upgrading to this release is only necessary if you have been encountering the issue with proxy IPs appearing in atomic.events, as discussed in this email thread (issue #719).

To upgrade to this release:

  1. Download the new warfile by right-clicking on this link and selecting “Save As…”
  2. Log in to your Amazon Elastic Beanstalk console
  3. Browse to your Clojure Collector's application
  4. Click the “Upload New Version” and upload your warfile

Read more

Snowplow 0.9.2

This release adds Snowplow support for the updated CloudFront access log file format introduced by Amazon on the morning of 29th April 2014.

If you currently use the Snowplow CloudFront-based event collector, you are recommended to upgrade to this release as soon as possible.

As well as support for the new log file format, this release also features a new standalone Scalding job to make re-processing “bad” rows easier, and also some Hive script updates to bring our Hive support in step with our Postgres and Redshift schemas.

Upgrade steps

Before upgrading, please ensure that you are on Snowplow 0.9.1 version, which introduced changes to the Snowplow enriched event format.

If you attempt to jump straight to 0.9.2 (from versions before 0.9.1), your enriched events will not load into your legacy Redshift or Postgres schema.

Configuration file

Upgrading is super simple: simply update the config.yml file for EmrEtlRunner to use the version 0.5.0 of the Hadoop ETL:

:snowplow:
  :hadoop_etl_version: 0.5.0

Recover missing data

Important: since releasing this version of Snowplow, we have learnt that the suggested upgrade process listed below has the unfortunate side effect of URL-encoding all string columns in the recovered data. For that reason, we recommend updating to Snowplow 0.9.3, where this bug is addressed.

Any Snowplow batch runs after the CloudFront change but before your upgrade to 0.9.2 will have resulted in valid events ending up in your bad rows bucket. Happily, we can use the Snowplow Hadoop Bad Rows job to recover them.

For every run to recover data from, you can run the Hadoop Bad Rows job using the Amazon Ruby EMR client:

$ elastic-mapreduce --create --name "Extract raw events from Snowplow bad row JSONs" \
  --instance-type m1.xlarge --instance-count 3 \
  --jar s3://snowplow-hosted-assets/3-enrich/scala-bad-rows/snowplow-bad-rows-0.1.0.jar \
  --arg com.snowplowanalytics.hadoop.scalding.SnowplowBadRowsJob \
  --arg --hdfs \
  --arg --input --arg s3n://[[PATH_TO_YOUR_FIXABLE_BAD_ROWS]] \
  --arg --output --arg s3n://[[PATH_WILL_BE_STAGING_FOR_EMRETLRUNNER]]

Replace the [[...]] placeholders above with the appropriate bucket paths. Please note: if you have multiple runs to fix, then we suggest running the above multiple times, one per run to fix, rather than running it against your whole bad rows bucket - it should be much faster.

Now you are ready to process the recovered raw events with Snowplow. Unfortunately, the filenames generated by the Bad Rows job are not compatible with the EmrEtlRunner currently (we will fix this in a future release). In the meantime, here is a workaround:

  1. Edit config.yml and change :collector_format: cloudfront to :collector_format: clj-tomcat
  2. Edit config.yml and point the :processing: bucket setting to wherever your extracted bad rows are located
  3. Run EmrEtlRunner with --skip staging

If you are a Qubole and/or Hive user, you can find an alternative approach to recovering the bad rows in our blog post, Reprocessing bad rows of Snowplow data using Hive, the JSON Serde and Qubole.

Read more

Snowplow 0.9.1

This release introduces initial support for JSON-based custom unstructured events and custom contexts in the Snowplow Enrichment and Storage processes; this is the most-requested feature from our community and a key building block for mobile and app event tracking in Snowplow.

Snowplow’s event trackers have supported custom unstructured events and custom contexts for some time, but prior to 0.9.1 there had been no way of working with these JSON-based objects “downstream” in the rest of the Snowplow data pipeline. This release adds preliminary support like this:

  1. Parse incoming custom unstructured events and contexts to ensure that they are valid JSON
  2. Where possible, clean up the JSON (e.g. remove whitespace)
  3. Store the JSON as json-type fields in Postgres, and in large varchar fields in Redshift

As well as this new JSON-based functionality, 0.9.1 also includes a host of additional features and updates.

Upgrade steps

EmrEtlRunner

You need to update EmrEtlRunner to the code 0.9.1 on Github:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment

You also need to update the config.yml file for EmrEtlRunner to use the Hadoop ETL version 0.4.0:

:snowplow:
  :hadoop_etl_version: 0.4.0

Don't forget to add in the new subnet (VPC) argument too:

:emr:
  ...
  :ec2_subnet_id: ADD HERE # Leave blank if not running in VPC

See a complete example of the EmrEtlRunner config.yml file on Github repo.

StorageLoader

You need to upgrade your StorageLoader installation to the code 0.9.1 on Github:

$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment

Database

We have updated the Redshift and Postgres table definitions for atomic.events. You can find the latest versions in the GitHub repository, along with migration scripts to handle the upgrade from recent prior versions. Please review any migration script carefully before running and check that you are happy with how it handles the upgrade.

Database Table definition Migration script
Redshift 0.3.0 Migrate from 0.2.2 to 0.3.0
Postgres 0.2.0 Migrate from 0.1.x to 0.2.0

Read more

Snowplow 0.9.0

This release introduces our initial beta support for Amazon Kinesis in the Snowplow Collector and Enrichment components.

At Snowplow we are hugely excited about Kinesis's potential, not just to enable near-real-time event analytics, but more fundamentally to serve as a business’s unified log, aka its “digital nervous system”. This is a concept we introduced recently in our blog post The three eras of business data processing, and further explored at the Inaugural Kinesis London meetup.

Upgrade steps

No upgrade steps as the release introduces the whole "new" concept. If you want to take it onboard you would need to set up a new environment.

Read more

Clone this wiki locally