Decoder Job
A job for normalizing ingestion messages. Defined in the com.mozilla.telemetry.Decoder class (source).
Transforms
These transforms are currently executed against each message in order.
GeoIP Lookup
- Extract
ipfrom thex_forwarded_forattribute - use the third-to-last value (since the second-to-last value is a forwarding rule IP added by Google load balancer, and the last value is a Google load balancer IP added by nginx)
- Execute the following steps until one fails and ignore the exception
- Parse
ipusingInetAddress.getByName - Lookup
ipin the configuredGeoIP2City.mmdb - Extract
country.iso_codeasgeo_country - Extract
city.nameasgeo_cityifcities15000.txtis not configured orcity.geo_name_idis in the configuredcities15000.txt - Extract
subdivisions[0].iso_codeasgeo_subdivision1 - Extract
subdivisions[1].iso_codeasgeo_subdivision2 - Remove the
x_forwarded_forandremote_addrattributes - Remove any
nullvalues added to attributes
Parse URI
Attempt to extract attributes from uri, on failure send messages to the
configured error output.
Decompress
Attempt to decompress payload with gzip, on failure pass the message through unmodified.
Parse Payload
- Parse the message body as a
UTF-8encoded JSON payload - Drop specific fields or entire messages that match a specific set of signatures for toxic data that we want to make sure we do not store
- Maintain counter metrics for each type of dropped message
- Validate the payload structure based on the JSON schema for the specified document type
- Invalid messages are routed to error output
- Extract some additional attributes such as
client_idandos_namebased on the payload contents
Parse User Agent
Attempt to extract browser, browser version, and os from the user_agent
attribute, drop any nulls, and remove user_agent from attributes.
Write Metadata Into the Payload
Add a nested metadata field and several normalized_* attributes into the
payload body.
Executing
Decoder jobs are executed the same way as sink jobs but with a few extra flags:
-Dexec.mainClass=com.mozilla.telemetry.Decoder- For Dataflow Flex Templates, change the
docker-composebuild argument to--build-arg FLEX_TEMPLATE_JAVA_MAIN_CLASS=com.mozilla.telemetry.Decoder --geoCityDatabase=/path/to/GeoIP2-City.mmdb--geoCityFilter=/path/to/cities15000.txt(optional)
To download the GeoLite2 database,
you need to register for a MaxMind account
to obtain a license key. After generating a new license key, set MM_LICENSE_KEY to
your license key.
Example:
# create a test input file
mkdir -p tmp/
echo '{"payload":"dGVzdA==","attributeMap":{"remote_addr":"63.245.208.195"}}' > tmp/input.json
# Download `cities15000.txt`, `GeoLite2-City.mmdb`, and `schemas.tar.gz`
./bin/download-cities15000
./bin/download-schemas
export MM_LICENSE_KEY="Your MaxMind License Key"
./bin/download-geolite2
# do geo lookup on messages to stdout
./bin/mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
--geoCityDatabase=GeoLite2-City.mmdb \
--geoCityFilter=cities15000.txt \
--schemasLocation=schemas.tar.gz \
--inputType=file \
--input=tmp/input.json \
--outputType=stdout \
--errorOutputType=stderr \
"
# check the DecoderOptions help page for options specific to Decoder
./bin/mvn compile exec:java -Dexec.args=--help=DecoderOptions
"