A running list of things that are suboptimal in GCP.
For network-bound applications it can be prohibitively expensive. A PubSub push
subscription application that decodes protobuf and forwards messages to the
~300 instances at
$0.06 per instance hour to handle
~5krps, which is
Replaces certain components with custom behavior that is not part of the open source Beam API, making it so they can't be extended (e.g. to expose a stream of messages that have been delivered to PubSub).
PubsubMessage.payload from JSON to a
TableRow, which gets
encoded as JSON to be sent to BigQuery.
Crashes the pipeline when the destination table does not exist.
Acknowledges messages in PubSub before they are written to accumulate data
across multiple bundles and produce reasonably sized files. Possible
workaround being investigated in #380. This also effects
in batch mode.
Does not support dynamic destinations.
Does not support [
NestedValueProvider] for destinations in streaming mode on
Dataflow, which is needed to create classic templates that accept a mapping of
document type to a predetermined number of destinations. This is because
Dataflow moves the implementation into the shuffler to improve performance.
Current workaround is to specify mapping at classic template creation time, or
use Flex Templates.
Does not use standard client library.
Does not expose an output of delivered messages, which is needed for at least
once delivery with deduplication. Current workaround is to use the deduplication
Uses HTTPS JSON API, which increases message payload size vs protobuf by 25% for base64 encoding and causes some messages to exceed the 10MB request size limit that otherwise would not.
Does not support repeated parameters via
described in Dataflow Java SDK #632.
Can be prohibitively expensive. It costs
to use PubSub with a
70MiB/s stream published or consumed 7 times (Edge to
raw topic, raw topic to Cloud Storage, raw topic to Decoder, Decoder to decoded
topic, decoded topic to Decoder for deduplication, decoded topic to Cloud
Storage, decoded topic to BigQuery).
Push Subscriptions are limited to
min(10MB, 1000 messages) in flight, making
the theoretical maximum parallel latency per message ~
62ms to achieve