Apache Beam Jobs for Ingestion
This ingestion-beam java module contains our Apache Beam jobs for use in Ingestion. Google Cloud Dataflow is a Google Cloud Platform service that natively runs Apache Beam jobs.
The source code lives in the ingestion-beam subdirectory of the gcp-ingestion repository.
There are currently three jobs defined, please see the respective sections on them in the documentation:
- Sink job: A job for delivering messages between Google Cloud services
- deprecated in favor of ingestion-sink
- Decoder job: A job for normalizing ingestion messages
- Republisher job: A job for republishing subsets of decoded messages to new destinations
Move to the
ingestion-beam subdirectory of your gcp-ingestion checkout and run:
./bin/mvn clean compile
See the details below under each job for details on how to run what you've produced.
Before anything else, be sure to download the test data:
./bin/download-cities15000 ./bin/download-geolite2 ./bin/download-schemas
Run tests locally with CircleCI Local CLI
(cd .. && circleci build --job ingestion-beam)
To make more targeted test invocations, you can install Java and maven locally or
bin/mvn executable to run maven in docker:
./bin/mvn clean test
If you wish to just run a single test class or a single test case, try something like this:
# Run all tests in a single class ./bin/mvn test -Dtest=com.mozilla.telemetry.util.SnakeCaseTest # Run only a single test case ./bin/mvn test -Dtest='com.mozilla.telemetry.util.SnakeCaseTest#testSnakeCaseFormat'
To run the project in a sandbox against production data, see this document on configuring an integration testing workflow.
Use spotless to automatically reformat code:
or just check what changes it requires: