Skip to content
代码片段 群组 项目
派生自 oss-mirrors / ICV / kafka
源项目有可见性限制。
用户头像
Michael Li 编辑于
[RCCA-8435] Remove Explicit Checks in Number of Brokers and Partitions in Subset Partitioner (#7151)

In [#inc-rcca-8435-dropoff-in-metrics-during-telemetry-cluster-roll](https://confluent.slack.com/archives/C03TFE3NGUF), we saw around 9% of metrics being dropped
during the metrics cluster roll. We also noticed that a **single** broker was recalculating partitions to
produce to up around 150 times during a roll. This means that all brokers in the fleet were constantly connecting/reconnecting which may leave a server side broker temporarily unavailable.

The reconnecting logic occurs because the number of brokers change during a roll. We really
shouldn't be considering the number of brokers changing as a topic topology change. For example,
If we add a broker and no partitions elect it as the preferred leader, we shouldn’t be recalculating
partitions to write to. If partitions do elect the new node as a preferred leader, we already capture
this case.

No changes in the unit tests also demonstrates that we don't need to make this check. We also remove
the check for number of partitions as this case is also covered by the preferred partition leader check.

Note: We're getting the 150 number from [this log message](https://prd.logs.aws.confluent.cloud/_dashboards/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2022-08-15T07:00:00.000Z',to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',key:clusterId,negate:!f,params:(query:pkc-688z3),type:phrase),query:(match_phrase:(clusterId:pkc-688z3))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',key:mdc.brokerId,negate:!f,params:(query:'1'),type:phrase),query:(match_phrase:(mdc.brokerId:'1')))),index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',interval:auto,query:(language:kuery,query:'message:%20%22Kafka%20Producer%20producing%20to%20the%20following%20subset%20partitions:%20%7B_confluent-telemetry-metrics%22'),sort:!())). This number may be inflated due to a 2nd telemetry reporter running as discovered in [#inc-rcca-8423-drop-in-tr-records-sent-rate-after-3787-upgrade](https://confluent.slack.com/archives/C03TK0C72KY)

Reviewers: Eric Sirianni <sirianni@confluent.io>
619e2f59
历史
名称 最后提交 最后更新