Skip to content
代码片段 群组 项目
未验证 提交 6b927423 编辑于 作者: Rebecca Szper's avatar Rebecca Szper 提交者: GitHub
浏览文件

Editorial pass on content (#29094)

上级 e7a64058
No related branches found
No related tags found
无相关合并请求
......@@ -261,7 +261,7 @@ BigQuery's exported JSON format.
{{< paragraph class="language-py" >}}
***Note:*** `BigQuerySource()` is deprecated as of Beam SDK 2.25.0. Before 2.25.0, to read from
a BigQuery table using the Beam SDK, you will apply a `Read` transform on a `BigQuerySource`. For example,
a BigQuery table using the Beam SDK, apply a `Read` transform on a `BigQuerySource`. For example,
`beam.io.Read(beam.io.BigQuerySource(table_spec))`.
{{< /paragraph >}}
......@@ -397,8 +397,8 @@ for the destination table(s):
whether the destination table must exist or can be created by the write
operation.
* The destination table's write disposition. The write disposition specifies
whether the data you write will replace an existing table, append rows to an
existing table, or write only to an empty table.
whether the data you write replaces an existing table, appends rows to an
existing table, or writes only to an empty table.
In addition, if your write operation creates a new BigQuery table, you must also
supply a table schema for the destination table.
......@@ -512,7 +512,7 @@ use a string that contains a JSON-serialized `TableSchema` object.
To create a table schema in Python, you can either use a `TableSchema` object,
or use a string that defines a list of fields. Single string based schemas do
not support nested fields, repeated fields, or specifying a BigQuery mode for
fields (the mode will always be set to `NULLABLE`).
fields (the mode is always set to `NULLABLE`).
{{< /paragraph >}}
#### Using a TableSchema
......@@ -539,7 +539,7 @@ To create and use a table schema as a `TableSchema` object, follow these steps.
2. Create and append a `TableFieldSchema` object for each field in your table.
3. Next, use the `schema` parameter to provide your table schema when you apply
3. Use the `schema` parameter to provide your table schema when you apply
a write transform. Set the parameter’s value to the `TableSchema` object.
{{< /paragraph >}}
......@@ -728,8 +728,8 @@ The following examples use this `PCollection` that contains quotes.
The `writeTableRows` method writes a `PCollection` of BigQuery `TableRow`
objects to a BigQuery table. Each element in the `PCollection` represents a
single row in the table. This example uses `writeTableRows` to write elements to a
`PCollection<TableRow>`. The write operation creates a table if needed; if the
table already exists, it will be replaced.
`PCollection<TableRow>`. The write operation creates a table if needed. If the
table already exists, it is replaced.
{{< /paragraph >}}
{{< highlight java >}}
......@@ -745,7 +745,7 @@ table already exists, it will be replaced.
{{< paragraph class="language-py" >}}
The following example code shows how to apply a `WriteToBigQuery` transform to
write a `PCollection` of dictionaries to a BigQuery table. The write operation
creates a table if needed; if the table already exists, it will be replaced.
creates a table if needed. If the table already exists, it is replaced.
{{< /paragraph >}}
{{< highlight py >}}
......@@ -759,8 +759,8 @@ The `write` transform writes a `PCollection` of custom typed objects to a BigQue
table. Use `.withFormatFunction(SerializableFunction)` to provide a formatting
function that converts each input element in the `PCollection` into a
`TableRow`. This example uses `write` to write a `PCollection<String>`. The
write operation creates a table if needed; if the table already exists, it will
be replaced.
write operation creates a table if needed. If the table already exists, it is
replaced.
{{< /paragraph >}}
{{< highlight java >}}
......@@ -786,7 +786,7 @@ BigQuery Storage Write API for Python SDK currently has some limitations on supp
{{< /paragraph >}}
{{< paragraph class="language-py" >}}
**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar will already be included.
**Note:** If you want to run WriteToBigQuery with Storage Write API from the source code, you need to run `./gradlew :sdks:java:io:google-cloud-platform:expansion-service:build` to build the expansion-service jar. If you are running from a released Beam SDK, the jar is already included.
**Note:** Auto sharding is not currently supported for Python's Storage Write API exactly-once mode on DataflowRunner.
......@@ -877,32 +877,33 @@ Similar to streaming inserts, `STORAGE_WRITE_API` supports dynamically determini
the number of parallel streams to write to BigQuery (starting 2.42.0). You can
explicitly enable this using [`withAutoSharding`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withAutoSharding--).
***Note:*** `STORAGE_WRITE_API` will default to dynamic sharding when
`STORAGE_WRITE_API` defaults to dynamic sharding when
`numStorageWriteApiStreams` is set to 0 or is unspecified.
***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported on Dataflow's legacy runner, but **not** on Runner V2
***Note:*** Auto sharding with `STORAGE_WRITE_API` is supported by Dataflow, but **not** on Runner v2.
{{< /paragraph >}}
When using `STORAGE_WRITE_API`, the `PCollection` returned by
[`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--)
will contain the rows that failed to be written to the Storage Write API sink.
contains the rows that failed to be written to the Storage Write API sink.
#### At-least-once semantics
If your use case allows for potential duplicate records in the target table, you
can use the
[`STORAGE_API_AT_LEAST_ONCE`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#STORAGE_API_AT_LEAST_ONCE)
method. Because this method doesn’t persist the records to be written to
BigQuery into its shuffle storage (needed to provide the exactly-once semantics
of the `STORAGE_WRITE_API` method), it is cheaper and results in lower latency
for most pipelines. If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to
method. This method doesn’t persist the records to be written to
BigQuery into its shuffle storage, which is needed to provide the exactly-once semantics
of the `STORAGE_WRITE_API` method. Therefore, for most pipelines, using this method is often
less expensive and results in lower latency.
If you use `STORAGE_API_AT_LEAST_ONCE`, you don’t need to
specify the number of streams, and you can’t specify the triggering frequency.
Auto sharding is not applicable for `STORAGE_API_AT_LEAST_ONCE`.
When using `STORAGE_API_AT_LEAST_ONCE`, the `PCollection` returned by
[`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--)
will contain the rows that failed to be written to the Storage Write API sink.
contains the rows that failed to be written to the Storage Write API sink.
#### Quotas
......
......@@ -16,46 +16,58 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Unrecoverable Errors in Beam Python
# Unrecoverable errors in Beam Python
## What is an Unrecoverable Error?
Unrecoverable errors are issues that occur at job start-up time and
prevent jobs from ever running successfully. The problem usually stems
from a misconfiguration. This page provides context about
common errors and troubleshooting information.
An unrecoverable error is an issue at job start-up time that will
prevent a job from ever running successfully, usually due to some kind
of misconfiguration. Solving these issues when they occur is key to
successfully running a Beam Python pipeline.
## Job submission or Python runtime version mismatch {#python-version-mismatch}
## Common Unrecoverable Errors
If the Python version that you use to submit your job doesn't match the
Python version used to build the worker container, the job doesn't run.
The job fails immediately after job submission.
### Job Submission/Runtime Python Version Mismatch
To resolve this issue, ensure that the Python version used to submit the job
matches the Python container version.
If the Python version used for job submission does not match the
Python version used to build the worker container, the job will not
execute. Ensure that the Python version being used for job submission
and the container Python version match.
## Dependency resolution failures with pip {#dependency-resolution-failures}
### PIP Dependency Resolution Failures
During worker start-up, the worker might fail and, depending on the
runner, try to restart.
During worker start-up, dependencies are checked and installed in
the worker container before accepting work. If a pipeline requires
additional dependencies not already present in the runtime environment,
they are installed here. If there’s an issue during this process
(e.g. a dependency version cannot be found, or a worker cannot
connect to PyPI) the worker will fail and may try to restart
depending on the runner. Ensure that dependency versions provided in
your requirements.txt file exist and can be installed locally before
submitting jobs.
Before workers accept work, dependencies are checked and installed in
the worker container. If a pipeline requires
dependencies not already present in the runtime environment,
they are installed at this time.
When a problem occurs during this process, you might encounter
dependency resolution failures.
### Dependency Version Mismatches
Examples of problems include the following:
When additional dependencies like `torch`, `transformers`, etc. are not
specified via a requirements_file or preinstalled in a custom container
then the worker might fail to deserialize (unpickle) the user code.
This can result in `ModuleNotFound` errors. If dependencies are installed
but their versions don't match the versions in submission environment,
pipeline might have `AttributeError` messages.
- A dependency version can't be found.
- A worker can't connect to PyPI.
Ensure that the required dependencies at runtime and in the submission
environment are the same along with their versions. For better visibility,
debug logs are added specifying the dependencies at both stages starting in
Beam 2.52.0. For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies
\ No newline at end of file
To resolve this issue, before submitting your job, ensure that the
dependency versions provided in your `requirements.txt` file exist
and that you can install them locally.
## Dependency version mismatches {#dependency-version}
When your pipeline has dependency version mismatches, you might
see `ModuleNotFound` errors or `AttributeError` messages.
- The `ModuleNotFound` errors occur when additional dependencies,
such as `torch` and `transformers`, are neither specified in a
`requirements_file` nor preinstalled in a custom container.
In this case, the worker might fail to deserialize (unpickle) the user code.
- Your pipeline might have `AttributeError` messages when dependencies
are installed but their versions don't match the versions in submission environment.
To resolve these problems, ensure that the required dependencies and their versions are the same
at runtime and in the submission environment. To help you identify these issues,
in Apache Beam 2.52.0 and later versions, debug logs specify the dependencies at both stages.
For more information, see
[Control the dependencies the pipeline uses](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#control-dependencies).
\ No newline at end of file
......@@ -21,17 +21,18 @@ limitations under the License.
# Get Started with Apache Beam
Learn to use Beam to create data processing pipelines that run on supported processing back-ends:
Learn how to use Beam to create data processing pipelines that run on supported processing back-ends.
## [Tour of Beam](https://tour.beam.apache.org)
## Tour of Beam
Learn Beam with an interactive tour with learning topics covering core Beam concepts
from simple ones to more advanced ones.
[Learn Beam with an interactive tour](https://tour.beam.apache.org).
Topics include core Beam concepts, from simple to advanced.
You can try examples, do exercises, and solve challenges along the learning journey.
## [Beam Overview](/get-started/beam-overview)
## Beam Overview
Learn about the Beam model, the currently available Beam SDKs and Runners, and Beam's native I/O connectors.
Read the [Apache Beam Overview](/get-started/beam-overview) to learn about the Beam model,
the currently available Beam SDKs and runners, and Beam's native I/O connectors.
## Quickstarts for Java, Python, Go, and TypeScript
......@@ -49,10 +50,15 @@ See detailed walkthroughs of complete Beam pipelines.
- [WordCount](/get-started/wordcount-example): Simple example pipelines that demonstrate basic Beam programming, including debugging and testing
- [Mobile Gaming](/get-started/mobile-gaming-example): A series of more advanced pipelines that demonstrate use cases in the mobile gaming domain
## [Downloads and Releases](/get-started/downloads)
## Downloads and Releases
Find download links and information on the latest Beam releases, including versioning and release notes.
Find download links and information about the latest Beam releases, including versioning and release notes,
on the [Apache Beam Downloads](/get-started/downloads) page.
## [Support](/get-started/support)
## Support
Find resources, such as mailing lists and issue tracking, to help you use Beam. Ask questions and discuss topics via [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam) or on Beam's [Slack Channel](https://apachebeam.slack.com).
- Find resources to help you use Beam, such as mailing lists and issue tracking,
on the [Support](/get-started/support) page.
- Ask questions and discuss topics on
[Stack Overflow](https://stackoverflow.com/questions/tagged/apache-beam)
or in the Beam [Slack Channel](https://apachebeam.slack.com).
......@@ -19,7 +19,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Apache Beam&#8482; Downloads
# Apache Beam<sup>®</sup> Downloads
> Beam SDK {{< param release_latest >}} is the latest released version.
......
0% 加载中 .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册