Decoder Job
A job for normalizing ingestion messages. Defined in the com.mozilla.telemetry.Decoder
class (source).
Transforms
These transforms are currently executed against each message in order.
GeoIP Lookup
- Extract
ip
from thex_forwarded_for
attribute - use the third-to-last value (since the second-to-last value is a forwarding rule IP added by Google load balancer, and the last value is a Google load balancer IP added by nginx)
- Execute the following steps until one fails and ignore the exception
- Parse
ip
usingInetAddress.getByName
- Lookup
ip
in the configuredGeoIP2City.mmdb
- Extract
country.iso_code
asgeo_country
- Extract
city.name
asgeo_city
ifcities15000.txt
is not configured orcity.geo_name_id
is in the configuredcities15000.txt
- Extract
subdivisions[0].iso_code
asgeo_subdivision1
- Extract
subdivisions[1].iso_code
asgeo_subdivision2
- Remove the
x_forwarded_for
andremote_addr
attributes - Remove any
null
values added to attributes
Parse URI
Attempt to extract attributes from uri
, on failure send messages to the
configured error output.
Decompress
Attempt to decompress payload with gzip, on failure pass the message through unmodified.
Parse Payload
- Parse the message body as a
UTF-8
encoded JSON payload - Drop specific fields or entire messages that match a specific set of signatures for toxic data that we want to make sure we do not store
- Maintain counter metrics for each type of dropped message
- Validate the payload structure based on the JSON schema for the specified document type
- Invalid messages are routed to error output
- Extract some additional attributes such as
client_id
andos_name
based on the payload contents
Parse User Agent
Attempt to extract browser, browser version, and os from the user_agent
attribute, drop any nulls, and remove user_agent
from attributes.
Write Metadata Into the Payload
Add a nested metadata
field and several normalized_*
attributes into the
payload body.
Executing
Decoder jobs are executed the same way as sink jobs but with a few extra flags:
-Dexec.mainClass=com.mozilla.telemetry.Decoder
- For Dataflow Flex Templates, change the
docker-compose
build argument to--build-arg FLEX_TEMPLATE_JAVA_MAIN_CLASS=com.mozilla.telemetry.Decoder
--geoCityDatabase=/path/to/GeoIP2-City.mmdb
--geoCityFilter=/path/to/cities15000.txt
(optional)
To download the GeoLite2 database,
you need to register for a MaxMind account
to obtain a license key. After generating a new license key, set MM_LICENSE_KEY
to
your license key.
Example:
# create a test input file
mkdir -p tmp/
echo '{"payload":"dGVzdA==","attributeMap":{"remote_addr":"63.245.208.195"}}' > tmp/input.json
# Download `cities15000.txt`, `GeoLite2-City.mmdb`, and `schemas.tar.gz`
./bin/download-cities15000
./bin/download-schemas
export MM_LICENSE_KEY="Your MaxMind License Key"
./bin/download-geolite2
# do geo lookup on messages to stdout
./bin/mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
--geoCityDatabase=GeoLite2-City.mmdb \
--geoCityFilter=cities15000.txt \
--schemasLocation=schemas.tar.gz \
--inputType=file \
--input=tmp/input.json \
--outputType=stdout \
--errorOutputType=stderr \
"
# check the DecoderOptions help page for options specific to Decoder
./bin/mvn compile exec:java -Dexec.args=--help=DecoderOptions
"