- 7月 14, 2020
-
-
由 Randall Hauch 创作于
-
由 Chris Egerton 创作于
The `testBlockInConnectorStop` test is failing semi-frequently on Jenkins. It's difficult to verify the cause without complete logs and I'm unable to reproduce locally, but I suspect the cause may be that the Connect worker hasn't completed startup yet by the time the test begins and so the initial REST request to create a connector times out with a 500 error. This isn't an issue for normal tests but we artificially reduce the REST request timeout for these tests as some requests are meant to exhaust that timeout. The changes here use a small hack to verify that the worker has started and is ready to handle all types of REST requests before tests start by querying the REST API for a non-existent connector. Reviewers: Boyang Chan <boyang@confluent.io>, Konstantine Karantasis <k.karantasis@gmail.com>
-
由 Jim Galasyn 创作于
Reviewer: Matthias J. Sax <matthias@confluent.io>
-
- 7月 12, 2020
-
-
由 John Roesler 创作于
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Matthias J. Sax <matthias@confluent.io>
-
由 Matthias J. Sax 创作于
Reviewers: A. Sophie Blee-Goldman <sohpie@confluent.io>, John Roesler <john@confluent.io>
-
- 7月 11, 2020
-
-
由 Guozhang Wang 创作于
Also piggy-back a small fix to use TreeMap other than HashMap to preserve iteration ordering. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org>
-
由 Chia-Ping Tsai 创作于
Call KafkaStreams#cleanUp to reset local state before starting application up the second run. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>, John Roesler <john@confluent.io>
-
- 7月 10, 2020
-
-
由 A. Sophie Blee-Goldman 创作于
Fixes an asymmetry in which we avoid writing checkpoints for non-persistent stores, but still expected to read them, resulting in a spurious TaskCorruptedException. Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <vvcephei@apache.org>
-
- 7月 09, 2020
-
-
由 Guozhang Wang 创作于
The intention of using poll(0) is to not block on rebalance but still return some data; however, `updateAssignmentMetadataIfNeeded` have three different logic: 1) discover coordinator if necessary, 2) join-group if necessary, 3) refresh metadata and fetch position if necessary. We only want to make 2) to be non-blocking but not others, since e.g. when the coordinator is down, then heartbeat would expire and cause the consumer to fetch with timeout 0 as well, causing unnecessarily high CPU. Since splitting this function is a rather big change to make as a last minute blocker fix for 2.6, so I made a smaller change to make updateAssignmentMetadataIfNeeded has an optional boolean flag to indicate if 2) above should wait until either expired or complete, otherwise do not wait on the join-group future and just poll with zero timer. Reviewers: Jason Gustafson <jason@confluent.io>
-
- 7月 07, 2020
-
-
由 Boyang Chen 创作于
This is a bug fix for older admin clients using static membership and call DescribeGroups. By making groupInstanceId ignorable, it would not crash upon handling the response. Added test coverages for DescribeGroups, and some side cleanups. Reviewers: Jason Gustafson <jason@confluent.io>
-
由 Jim Galasyn 创作于
Reviewers: Matthias J. Sax <matthias@confluent.io>
-
由 A. Sophie Blee-Goldman 创作于
Two more edge cases I found producing extra TaskcorruptedException while playing around with the failing eos-beta upgrade test (sadly these are unrelated problems, as the test still fails with these fixes in place). * Need to write the checkpoint when recycling a standby: although we do preserve the changelog offsets when recycling a task, and should therefore write the offsets when the new task is itself closed, we do NOT write the checkpoint for uninitialized tasks. So if the new task is ultimately closed before it gets out of the CREATED state, the offsets will not be written and we can get a TaskCorruptedException * We do not write the checkpoint file if the current offset map is empty; however for eos the checkpoint file is not only used for restoration but also for clean shutdown. Although skipping a dummy checkpoint file does not actually violate any correctness since we are going to re-bootstrap from the log-start-offset anyways, it throws unnecessary TaskCorruptedException which has an overhead itself. Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com>
-
由 Ismael Juma 创作于
Without this change, we would catch the NPE and log it. This was misleading and could cause excessive log volume. The NPE can happen after `AlterReplicaLogDirs` completes successfully and when unmapping older regions. Example stacktrace: ```text [2019-05-20 14:08:13,999] ERROR Error unmapping index /tmp/kafka-logs/test-0.567a0d8ff88b45ab95794020d0b2e66f-delete/00000000000000000000.index (kafka.log.OffsetIndex) java.lang.NullPointerException at org.apache.kafka.common.utils.MappedByteBuffers.unmap(MappedByteBuffers.java:73) at kafka.log.AbstractIndex.forceUnmap(AbstractIndex.scala:318) at kafka.log.AbstractIndex.safeForceUnmap(AbstractIndex.scala:308) at kafka.log.AbstractIndex.$anonfun$closeHandler$1(AbstractIndex.scala:257) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.log.AbstractIndex.closeHandler(AbstractIndex.scala:257) at kafka.log.AbstractIndex.deleteIfExists(AbstractIndex.scala:226) at kafka.log.LogSegment.$anonfun$deleteIfExists$6(LogSegment.scala:597) at kafka.log.LogSegment.delete$1(LogSegment.scala:585) at kafka.log.LogSegment.$anonfun$deleteIfExists$5(LogSegment.scala:597) at kafka.utils.CoreUtils$.$anonfun$tryAll$1(CoreUtils.scala:115) at kafka.utils.CoreUtils$.$anonfun$tryAll$1$adapted(CoreUtils.scala:114) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.utils.CoreUtils$.tryAll(CoreUtils.scala:114) at kafka.log.LogSegment.deleteIfExists(LogSegment.scala:599) at kafka.log.Log.$anonfun$delete$3(Log.scala:1762) at kafka.log.Log.$anonfun$delete$3$adapted(Log.scala:1762) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at kafka.log.Log.$anonfun$delete$2(Log.scala:1762) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.log.Log.maybeHandleIOException(Log.scala:2013) at kafka.log.Log.delete(Log.scala:1759) at kafka.log.LogManager.deleteLogs(LogManager.scala:761) at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:775) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Ismael Juma <ismael@juma.me.uk> Co-authored-by: Jaikiran Pai <jaikiran.pai@gmail.com>
-
由 Guozhang Wang 创作于
-
由 A. Sophie Blee-Goldman 创作于
The current failures we're seeing with this test are due to faulty assumptions that it makes and not any real bug in eos-beta (at least, from what I've seen so far). The test relies on tightly controlling the commits, which it does by setting the commit interval to MAX_VALUE and manually requesting commits on the context. In two phases, the test assumes that any pending data will be committed after a rebalance. But we actually take care to avoid unnecessary commits -- with eos-alpha, we only commit tasks that are revoked while in eos-beta we must commit all tasks if any are revoked, but only if the revoked tasks themselves need a commit. The failure we see occurs when we try to verify the committed data after a second client is started and the group rebalances. The already-running client has to give up two tasks to the newly started client, but those tasks may not need to be committed in which case none of the tasks would be. So we still have an open transaction on the partitions where we try to read committed data. Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
-
由 Jason Gustafson 创作于
Users often get confused after an unclean shutdown when log recovery takes a long time. This patch attempts to make the logging clearer and provide a simple indication of loading progress. Reviewers: Ismael Juma <ismael@juma.me.uk>
-
- 7月 04, 2020
-
-
由 Chia-Ping Tsai 创作于
There are two new configs introduced by 371f14c3 and 1c4eb1a5 so we have to update the expected configs in the connect_rest_test.py system test too. Reviewer: Konstantine Karantasis <konstantine@confluent.io>
-
- 7月 02, 2020
-
-
由 Ismael Juma 创作于
This includes important fixes. Netty is required by ZooKeeper if TLS is enabled. I verified that the netty jars were changed from 4.1.48 to 4.1.50 with this PR, `find . -name '*netty*'`: ```text ./core/build/dependant-libs-2.13.3/netty-handler-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-transport-native-epoll-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-codec-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-transport-native-unix-common-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-transport-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-resolver-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-buffer-4.1.50.Final.jar ./core/build/dependant-libs-2.13.3/netty-common-4.1.50.Final.jar ``` Note that the previous netty exclude no longer worked since we upgraded to ZooKeeper 3.5.x as it switched to Netty 4 which has different module names. Also, the Netty dependency is needed by ZooKeeper for TLS support so we cannot exclude it. Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
-
- 7月 01, 2020
-
-
由 Aakash Shah 创作于
Added a section about error reporting in Connect documentation, and another about how to safely use the new errant record reporter in SinkTask implementations. Author: Aakash Shah <ashah@confluent.io> Reviewer: Randall Hauch <rhauch@gmail.com>
-
由 Chia-Ping Tsai 创作于
After 3661f981 security_config is cached. Hence, the later changes to security flag can't impact the security_config used by later tests. issue: https://issues.apache.org/jira/browse/KAFKA-10214 Author: Chia-Ping Tsai <chia7712@gmail.com> Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com> Closes #8949 from chia7712/KAFKA-10214 (cherry picked from commit 6094af89) Signed-off-by: Manikumar Reddy <mkumar@xtreems.local>
-
- 6月 30, 2020
-
-
由 David Jacot 创作于
KAFKA-10212: Describing a topic with the TopicCommand fails if unauthorised to use ListPartitionReassignments API Since https://issues.apache.org/jira/browse/KAFKA-8834 , describing topics with the TopicCommand requires privileges to use ListPartitionReassignments or fails to describe the topics with the following error: > Error while executing topic command : Cluster authorization failed. This is a quite hard restriction has most of the secure clusters do not authorize non admin members to access ListPartitionReassignments. This patch catches the `ClusterAuthorizationException` exception and gracefully fails back. We already do this when the API is not available so it remains consistent. Author: David Jacot <djacot@confluent.io> Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com> Closes #8947 from dajac/KAFKA-10212 (cherry picked from commit 4be4420b) Signed-off-by: Manikumar Reddy <mkumar@xtreems.local>
-
由 showuon 创作于
A simple increase in the timeout of the consumer that verifies that records have been replicated seems to fix the integration tests in `MirrorConnectorsIntegrationTest` that have been failing more often recently. Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Sanjana Kaundinya <skaundinya@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Konstantine Karantasis <konstantine@confluent.io>
-
- 6月 28, 2020
-
-
由 Nikolay 创作于
Reviewers: Jun Rao <junrao@gmail.com>
-
- 6月 27, 2020
-
-
由 John Roesler 创作于
We inadvertently changed the binary schema of the suppress buffer changelog in 2.4.0 without bumping the schema version number. As a result, it is impossible to upgrade from 2.3.x to 2.4+ if you are using suppression. * Refactor the schema compatibility test to use serialized data from older versions as a more foolproof compatibility test. * Refactor the upgrade system test to use the smoke test application so that we actually exercise a significant portion of the Streams API during upgrade testing * Add more recent versions to the upgrade system test matrix * Fix the compatibility bug by bumping the schema version to 3 Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
-
由 A. Sophie Blee-Goldman 创作于
This should address at least some of the excessive TaskCorruptedExceptions we've been seeing lately. Basically, at the moment we only commit tasks if commitNeeded is true -- this seems obvious by definition. But the problem is we do some essential cleanup in postCommit that should always be done before a task is closed: * clear the PartitionGroup * write the checkpoint The second is actually fine to skip when commitNeeded = false with ALOS, as we will have already written a checkpoint during the last commit. But for EOS, we only write the checkpoint before a close -- so even if there is no new pending data since the last commit, we have to write the current offsets. If we don't, the task will be assumed dirty and we will run into our friend the TaskCorruptedException during (re)initialization. To fix this, we should just always call prepareCommit and postCommit at the TaskManager level. Within the task, it can decide whether or not to actually do something in those methods based on commitNeeded. One subtle issue is that we still need to avoid checkpointing a task that was still in CREATED, to avoid potentially overwriting an existing checkpoint with uninitialized empty offsets. Unfortunately we always suspend a task before closing and committing, so we lose the information about whether the task as in CREATED or RUNNING/RESTORING by the time we get to the checkpoint. For this we introduce a special flag to keep track of whether a suspended task should actually be checkpointed or not Reviewers: Guozhang Wang <wangguoz@gmail.com>
-
- 6月 25, 2020
-
-
由 A. Sophie Blee-Goldman 创作于
We just needed to add the check in StreamTask#closeClean to closeAndRecycleState as well. I also renamed closeAndRecycleState to closeCleanAndRecycleState to drive this point home: it needs to be clean. This should be cherry-picked back to the 2.6 branch Reviewers: Matthias J. Sax <matthias@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>,
-
- 6月 24, 2020
-
-
由 A. Sophie Blee-Goldman 创作于
KAFKA-10169: swallow non-fatal KafkaException and don't abort transaction during clean close (#8900) If there's any pending data and we haven't flushed the producer when we abort a transaction, a KafkaException is returned for the previous send. This is a bit misleading, since the situation is not an unrecoverable error and so the Kafka Exception is really non-fatal. For now, we should just catch and swallow this in the RecordCollector (see also: KAFKA-10169) The reason we ended up aborting an un-flushed transaction was due to the combination of a. always aborting the ongoing transaction when any task is closed/revoked b. only committing (and flushing) if at least one of the revoked tasks needs to be committed (regardless of whether any non-revoked tasks have data/transaction in flight) Given the above, we can end up with an ongoing transaction that isn't committed since none of the revoked tasks have any data in the transaction. We then abort the transaction anyway, when those tasks are closed. So in addition to the above (swallowing this exception), we should avoid unnecessarily aborting data for tasks that haven't been revoked. We can handle this by splitting the RecordCollector's close into a dirty and clean flavor: if dirty, we need to abort the transaction since it may be dirty due to the commit attempt failing. But if clean, we can skip aborting the transaction since we know that either we just committed and thus there is no ongoing transaction to abort, or else the transaction in flight contains no data from the tasks being closed Note that this means we still abort the transaction any time a task is closed dirty, so we must close/reinitialize any active task with pending data (that was aborted). In sum: * hackily check the KafkaException message and swallow * only abort the transaction during a dirty close * refactor shutdown to make sure we don't closeClean a task whose data was actually aborted Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
-
由 feyman2016 创作于
KAFKA-10135: Extract Task#executeAndMaybeSwallow to be a general utility function into TaskManager… (#8887) Extract Task#executeAndMaybeSwallow to be a general utility function into TaskManager
-
- 6月 20, 2020
-
-
由 John Roesler 创作于
Reviewers: Guozhang Wang <wangguoz@gmail.com>
-
由 Matthias J. Sax 创作于
Ports the test from #8886 to trunk -- this should be merged to 2.6 branch. One open question. In 2.6 and trunk we rely on the active tasks to wipe out the store if it crashes. However, assume there is a hard JVM crash and we don't call closeDirty() the store would not be wiped out. Thus, I am wondering, if we would need to fix this (for both active and standby tasks) and do a check on startup if a local store must be wiped out? The current test passes, as we do a proper cleanup after the exception is thrown. Reviewers: Guozhang Wang <wangguoz@gmail.com>
-
- 6月 19, 2020
-
-
由 Jason Gustafson 创作于
This patch fixes a bug in the constructor of `LogTruncationException`. We were passing the divergent offsets to the super constructor as the fetch offsets. There is no way to fix this without breaking compatibility, but the harm is probably minimal since this exception was not getting raised properly until KAFKA-9840 anyway. Note that I have also moved the check for unknown offset and epoch into `SubscriptionState`, which ensures that the partition is still awaiting validation and that the fetch offset hasn't changed. Finally, I made some minor improvements to the logging and exception messages to ensure that we always have the fetch offset and epoch as well as the divergent offset and epoch included. Reviewers: Boyang Chen <boyang@confluent.io>, David Arthur <mumrah@gmail.com>
-
由 Guozhang Wang 创作于
Since admin client allows use to use flexible offset-spec, we can always set to use read-uncommitted regardless of the EOS config. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
-
- 6月 18, 2020
-
-
由 Ego 创作于
Newer version of ducktape that updates some dependencies and adds some features. You can see that diff here: https://github.com/confluentinc/ducktape/compare/v0.7.7...v0.7.8 Reviewer: Konstantine Karantasis <konstantine@confluent.io>
-
由 David Arthur 创作于
## Background When a partition subscription is initialized it has a `null` position and is in the INITIALIZING state. Depending on the consumer, it will then transition to one of the other states. Typically a consumer will either reset the offset to earliest/latest, or it will provide an offset (with or without offset metadata). For the reset case, we still have no position to act on so fetches should not occur. Recently we made changes for KAFKA-9724 (#8376) to prevent clients from entering the AWAIT_VALIDATION state when targeting older brokers. New logic to bypass offset validation as part of this change exposed this new issue. ## Bug and Fix In the partition subscriptions, the AWAIT_RESET state was incorrectly reporting that it had a position. In some cases a position might actually exist (e.g., if we were resetting offsets during a fetch after a truncation), but in the initialization case no position had been set. We saw this issue in system tests where there is a race between the offset reset completing and the first fetch request being issued. Since AWAIT_RESET#hasPosition was incorrectly returning `true`, the new logic to bypass offset validation was transitioning the subscription to FETCHING (even though no position existed). The fix was simply to have AWAIT_RESET#hasPosition to return `false` which should have been the case from the start. Additionally, this fix includes some guards against NPE when reading the position from the subscription. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
-
由 Chia-Ping Tsai 创作于
Author: Chia-Ping Tsai <chia7712@gmail.com> Reviewers: Randall Hauch <rhauch@gmail.com>, David Jacot <david.jacot@gmail.com>
-
- 6月 17, 2020
-
-
由 Chia-Ping Tsai 创作于
KAFKA-10147 MockAdminClient#describeConfigs(Collection<ConfigResource>) is unable to handle broker resource (#8853) Author: Chia-Ping Tsai <chia7712@gmail.com> Reviewers: Boyang Chen <boyang@confluent.io>, Randall Hauch <rhauch@gmail.com>
-
由 John Roesler 创作于
Reduce test data set from 1000 records to 500. Some recent test failures indicate that the Jenkins runners aren't able to process all 1000 records in two minutes. Also add sanity check that all the test data are readable from the input topic. Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
-
由 John Roesler 创作于
* Remove problematic Percentiles measurements until the implementation is fixed * Fix leaking e2e metrics when task is closed * Fix leaking metrics when tasks are recycled Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
-
由 A. Sophie Blee-Goldman 创作于
* KAFKA-10150: always transition to SUSPENDED during suspend, no matter the current state only call prepareCommit before closing if task.commitNeeded is true * Don't commit any consumed offsets during handleAssignment -- revoked active tasks (and any others that need committing) will be committed during handleRevocation so we only need to worry about cleaning them up in handleAssignment * KAFKA-10152: when recycling a task we should always commit consumed offsets (if any), but don't need to write the checkpoint (since changelog offsets are preserved across task transitions) * Make sure we close all tasks during shutdown, even if an exception is thrown during commit Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
-
由 Guozhang Wang 创作于
Reviewers: John Roesler <vvcephei@apache.org>
-