Skip to content
代码片段 群组 项目
提交 67508d0c 编辑于 作者: Brian Hulette's avatar Brian Hulette
浏览文件

Remove website/www/site/content/en/documentation/runners/basics.md

上级 02140aea
No related branches found
No related tags found
无相关合并请求
---
title: "Basics of the Beam model"
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Basics of the Beam model
Suppose you have a data processing engine that can pretty easily process graphs
of operations. You want to integrate it with the Beam ecosystem to get access
to other languages, great event time processing, and a library of connectors.
You need to know the core vocabulary:
* [_Pipeline_](#pipeline) - A pipeline is a graph of transformations that a user constructs
that defines the data processing they want to do.
* [_PCollection_](#pcollections) - Data being processed in a pipeline is part of a PCollection.
* [_PTransforms_](#ptransforms) - The operations executed within a pipeline. These are best
thought of as operations on PCollections.
* _SDK_ - A language-specific library for pipeline authors (we often call them
"users" even though we have many kinds of users) to build transforms,
construct their pipelines and submit them to a runner
* _Runner_ - You are going to write a piece of software called a runner that
takes a Beam pipeline and executes it using the capabilities of your data
processing engine.
These concepts may be very similar to your processing engine's concepts. Since
Beam's design is for cross-language operation and reusable libraries of
transforms, there are some special features worth highlighting.
### Pipeline
A pipeline in Beam is a graph of PTransforms operating on PCollections. A
pipeline is constructed by a user in their SDK of choice, and makes its way to
your runner either via the SDK directly or via the Runner API's (forthcoming)
RPC interfaces.
### PTransforms
In Beam, a PTransform can be one of the five primitives or it can be a
composite transform encapsulating a subgraph. The primitives are:
* [_Read_](#implementing-the-read-primitive) - parallel connectors to external
systems
* [_ParDo_](#implementing-the-pardo-primitive) - per element processing
* [_GroupByKey_](#implementing-the-groupbykey-and-window-primitive) -
aggregating elements per key and window
* [_Flatten_](#implementing-the-flatten-primitive) - union of PCollections
* [_Window_](#implementing-the-window-primitive) - set the windowing strategy
for a PCollection
When implementing a runner, these are the operations you need to implement.
Composite transforms may or may not be important to your runner. If you expose
a UI, maintaining some of the composite structure will make the pipeline easier
for a user to understand. But the result of processing is not changed.
### PCollections
A PCollection is an unordered bag of elements. Your runner will be responsible
for storing these elements. There are some major aspects of a PCollection to
note:
#### Bounded vs Unbounded
A PCollection may be bounded or unbounded.
- _Bounded_ - it is finite and you know it, as in batch use cases
- _Unbounded_ - it may be never end, you don't know, as in streaming use cases
These derive from the intuitions of batch and stream processing, but the two
are unified in Beam and bounded and unbounded PCollections can coexist in the
same pipeline. If your runner can only support bounded PCollections, you'll
need to reject pipelines that contain unbounded PCollections. If your
runner is only really targeting streams, there are adapters in our support code
to convert everything to APIs targeting unbounded data.
#### Timestamps
Every element in a PCollection has a timestamp associated with it.
When you execute a primitive connector to some storage system, that connector
is responsible for providing initial timestamps. Your runner will need to
propagate and aggregate timestamps. If the timestamp is not important, as with
certain batch processing jobs where elements do not denote events, they will be
the minimum representable timestamp, often referred to colloquially as
"negative infinity".
#### Watermarks
Every PCollection has to have a watermark that estimates how complete the
PCollection is.
The watermark is a guess that "we'll never see an element with an earlier
timestamp". Sources of data are responsible for producing a watermark. Your
runner needs to implement watermark propagation as PCollections are processed,
merged, and partitioned.
The contents of a PCollection are complete when a watermark advances to
"infinity". In this manner, you may discover that an unbounded PCollection is
finite.
#### Windowed elements
Every element in a PCollection resides in a window. No element resides in
multiple windows (two elements can be equal except for their window, but they
are not the same).
When elements are read from the outside world they arrive in the global window.
When they are written to the outside world, they are effectively placed back
into the global window (any writing transform that doesn't take this
perspective probably risks data loss).
A window has a maximum timestamp, and when the watermark exceeds this plus
user-specified allowed lateness the window is expired. All data related
to an expired window may be discarded at any time.
#### Coder
Every PCollection has a coder, a specification of the binary format of the elements.
In Beam, the user's pipeline may be written in a language other than the
language of the runner. There is no expectation that the runner can actually
deserialize user data. So the Beam model operates principally on encoded data -
"just bytes". Each PCollection has a declared encoding for its elements, called
a coder. A coder has a URN that identifies the encoding, and may have
additional sub-coders (for example, a coder for lists may contain a coder for
the elements of the list). Language-specific serialization techniques can, and
frequently are used, but there are a few key formats - such as key-value pairs
and timestamps - that are common so your runner can understand them.
#### Windowing Strategy
Every PCollection has a windowing strategy, a specification of essential
information for grouping and triggering operations.
The details will be discussed below when we discuss the
[Window](#implementing-the-window-primitive) primitive, which sets up the
windowing strategy, and
[GroupByKey](#implementing-the-groupbykey-and-window-primitive) primitive,
which has behavior governed by the windowing strategy.
### User-Defined Functions (UDFs)
Beam has seven varieties of user-defined function (UDF). A Beam pipeline
may contain UDFs written in a language other than your runner, or even multiple
languages in the same pipeline (see the [Runner API](#the-runner-api)) so the
definitions are language-independent (see the [Fn API](#the-fn-api)).
The UDFs of Beam are:
* _DoFn_ - per-element processing function (used in ParDo)
* _WindowFn_ - places elements in windows and merges windows (used in Window
and GroupByKey)
* _Source_ - emits data read from external sources, including initial and
dynamic splitting for parallelism (used in Read)
* _ViewFn_ - adapts a materialized PCollection to a particular interface (used
in side inputs)
* _WindowMappingFn_ - maps one element's window to another, and specifies
bounds on how far in the past the result window will be (used in side
inputs)
* _CombineFn_ - associative and commutative aggregation (used in Combine and
state)
* _Coder_ - encodes user data; some coders have standard formats and are not really UDFs
The various types of user-defined functions will be described further alongside
the primitives that use them.
### Runner
The term "runner" is used for a couple of things. It generally refers to the
software that takes a Beam pipeline and executes it somehow. Often, this is the
translation code that you write. It usually also includes some customized
operators for your data processing engine, and is sometimes used to refer to
the full stack.
A runner has just a single method `run(Pipeline)`. From here on, I will often
use code font for proper nouns in our APIs, whether or not the identifiers
match across all SDKs.
The `run(Pipeline)` method should be asynchronous and results in a
PipelineResult which generally will be a job descriptor for your data
processing engine, providing methods for checking its status, canceling it, and
waiting for it to terminate.
0% 加载中 .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册