[RCCA-8435] Remove Explicit Checks in Number of Brokers and Partitions in...
[RCCA-8435] Remove Explicit Checks in Number of Brokers and Partitions in Subset Partitioner (#7151) In [#inc-rcca-8435-dropoff-in-metrics-during-telemetry-cluster-roll](https://confluent.slack.com/archives/C03TFE3NGUF), we saw around 9% of metrics being dropped during the metrics cluster roll. We also noticed that a **single** broker was recalculating partitions to produce to up around 150 times during a roll. This means that all brokers in the fleet were constantly connecting/reconnecting which may leave a server side broker temporarily unavailable. The reconnecting logic occurs because the number of brokers change during a roll. We really shouldn't be considering the number of brokers changing as a topic topology change. For example, If we add a broker and no partitions elect it as the preferred leader, we shouldn’t be recalculating partitions to write to. If partitions do elect the new node as a preferred leader, we already capture this case. No changes in the unit tests also demonstrates that we don't need to make this check. We also remove the check for number of partitions as this case is also covered by the preferred partition leader check. Note: We're getting the 150 number from [this log message](https://prd.logs.aws.confluent.cloud/_dashboards/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2022-08-15T07:00:00.000Z',to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',key:clusterId,negate:!f,params:(query:pkc-688z3),type:phrase),query:(match_phrase:(clusterId:pkc-688z3))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',key:mdc.brokerId,negate:!f,params:(query:'1'),type:phrase),query:(match_phrase:(mdc.brokerId:'1')))),index:'7ae0cc50-dcdc-11ea-b484-556ef92a2241',interval:auto,query:(language:kuery,query:'message:%20%22Kafka%20Producer%20producing%20to%20the%20following%20subset%20partitions:%20%7B_confluent-telemetry-metrics%22'),sort:!())). This number may be inflated due to a 2nd telemetry reporter running as discovered in [#inc-rcca-8423-drop-in-tr-records-sent-rate-after-3787-upgrade](https://confluent.slack.com/archives/C03TK0C72KY) Reviewers: Eric Sirianni <sirianni@confluent.io>
加载中
想要评论请 注册 或 登录