iidp平台在有些项目中依然采用的是单机版的redis,存在高可用方面的问题,所以经过调研决定采用redis sentinel 哨兵模式,提高redis的可用性,以对外提供稳定的存储和消息服务。
本文主要从两方面来测试redis sentinel模式的高可用性,一是redis本身的高可用,二是实际场景的高可用性。
集群搭建
包含1个主节点和2个从节点以及3个哨兵节点的Redis集群,配置如下:
主节点 redis-master 192.168.184.122 6379,由replication信息可知该节点角色为master,并有两个已经连接的从节点。
> redis-cli -p 6379 -a snest123 info replication
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Replication
role:master # 角色为master
connected_slaves:2 # 有2个从节点
slave0:ip=192.168.184.122,port=6381,state=online,offset=60270,lag=1 # 第一个从节点信息
slave1:ip=192.168.184.122,port=6380,state=online,offset=60270,lag=1 # 第二个从节点信息
master_failover_state:no-failover # 主节点没有故障转移状态
master_replid:ebd8e552d0ac04310950e95263d63961acf1a51c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:60270
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:60270
````
2个从节点 redis-slave-1 和 redis-slave-2,ip都是192.168.184.122,端口分别是6381和6380
```bash
> redis-cli -p 6380 -a snest123 info replication
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Replication
role:slave # 角色为slave
master_host:192.168.184.122 # 主节点IP
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_read_repl_offset:72389
slave_repl_offset:72389
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:ebd8e552d0ac04310950e95263d63961acf1a51c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:72389
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:12494
repl_backlog_histlen:59896
3个哨兵节点 redis-sentinel-1、redis-sentinel-2、redis-sentinel-3,ip都是192.168.184.122,端口分别是26379、26380和26381
> redis-cli -p 26379 sentinel masters
1) 1) "name"
2) "mymaster"
3) "ip"
4) "192.168.184.122" # 主节点IP
5) "port"
6) "6379"
7) "runid"
8) "804b4d520bc6d6bb4673f10084fd59ef9bdd518e"
9) "flags"
10) "master" # 主节点标志
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "352"
19) "last-ping-reply"
20) "352"
21) "down-after-milliseconds"
22) "60000"
23) "info-refresh"
24) "2495"
25) "role-reported"
26) "master"
27) "role-reported-time"
28) "351514574"
29) "config-epoch"
30) "0"
31) "num-slaves" # 从节点数量
32) "2"
33) "num-other-sentinels" # 其他哨兵数量
34) "2"
35) "quorum" # 哨兵选举的法定人数
36) "2"
37) "failover-timeout"
38) "180000"
39) "parallel-syncs"
40) "1"
查看master节点地址
127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "192.168.184.122"
2) "6379"
到这里整个集群已经搭建完成,接下来我们将进行一些redis哨兵模式下的异常测试。
停止主节点
观察sentinel输出日志:
1:X 12 Aug 2025 03:23:12.998 # +sdown master mymaster 192.168.184.122 6379 ---> Sentinel 检测到主节点不可用
1:X 12 Aug 2025 03:23:13.082 # +odown master mymaster 192.168.184.122 6379 #quorum 2/2 ---> Sentinel 确认主节点不可用
1:X 12 Aug 2025 03:23:13.082 # +new-epoch 3
1:X 12 Aug 2025 03:23:13.082 # +try-failover master mymaster 192.168.184.122 6379 ---> Sentinel 尝试进行故障转移
1:X 12 Aug 2025 03:23:13.093 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:23:13.093 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:23:13.093 # +vote-for-leader 247379459a8d917fdf61931a5111f921dfa47408 3 ---> Sentinel 247379459a8d917fdf61931a5111f921dfa47408 投票选举新的主节点
1:X 12 Aug 2025 03:23:13.116 # e39828966d5ae85ef8292ce9aa085c84ef5d6203 voted for 247379459a8d917fdf61931a5111f921dfa47408 3
1:X 12 Aug 2025 03:23:13.116 # ef17ddf007aa36aa2d91f5a7896b8141adca2e07 voted for 247379459a8d917fdf61931a5111f921dfa47408 3
1:X 12 Aug 2025 03:23:13.156 # +elected-leader master mymaster 192.168.184.122 6379 ---> Sentinel 247379459a8d917fdf61931a5111f921dfa47408 成为新的主节点
1:X 12 Aug 2025 03:23:13.156 # +failover-state-select-slave master mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:13.247 # +selected-slave slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:13.247 * +failover-state-send-slaveof-noone slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379 --> Sentinel 选择了一个从节点作为新的主节点
1:X 12 Aug 2025 03:23:13.309 * +failover-state-wait-promotion slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:13.339 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:23:13.339 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:23:13.339 # +promoted-slave slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:13.339 # +failover-state-reconf-slaves master mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:13.410 * +slave-reconf-sent slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:14.216 # -odown master mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:14.270 * +slave-reconf-inprog slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:14.270 * +slave-reconf-done slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:14.322 # +failover-end master mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:23:14.322 # +switch-master mymaster 192.168.184.122 6379 192.168.184.122 6380
1:X 12 Aug 2025 03:23:14.323 * +slave slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:23:14.323 * +slave slave 192.168.184.122:6379 192.168.184.122 6379 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:23:14.336 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:23:14.337 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:24:14.350 # +sdown slave 192.168.184.122:6379 192.168.184.122 6379 @ mymaster 192.168.184.122 6380
查看sentinel 状态:
> redis-cli -p 26379 sentinel masters
1) 1) "name"
2) "mymaster"
3) "ip"
4) "192.168.184.122" # 主节点IP
5) "port"
6) "6380" # 新的主节点端口, 之前的主节点6379已经停止, 现在6380成为新的主节点, 说明故障转移成功
7) "runid"
8) "539781d1f63dd363c26c6e37f632d3ced0a535f5"
9) "flags"
10) "master"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "302"
19) "last-ping-reply"
20) "302"
21) "down-after-milliseconds"
22) "60000"
23) "info-refresh"
24) "7943"
25) "role-reported"
26) "master"
27) "role-reported-time"
28) "149174"
29) "config-epoch"
30) "3"
31) "num-slaves"
32) "2"
33) "num-other-sentinels"
34) "2"
35) "quorum"
36) "2"
37) "failover-timeout"
38) "180000"
39) "parallel-syncs"
40) "1"
查看主节点地址:
127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "192.168.184.122"
2) "6380"
进一步验证 192.168.184.122:6380 是否真的成为了新的主节点:
> redis-cli -p 6380 -a snest123 info replication
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Replication
role:master # 确实角色为master
connected_slaves:1 # 现在只有一个从节点
slave0:ip=192.168.184.122,port=6381,state=online,offset=201890,lag=0 # 之前的从节点6381现在仍然在线,依然作为一个从节点存在
master_failover_state:no-failover
master_replid:ad6a01d6a123ea956ec35f0d80e3d1e83191bb7b
master_replid2:ebd8e552d0ac04310950e95263d63961acf1a51c
master_repl_offset:202035
second_repl_offset:154594
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:12494
repl_backlog_histlen:189542
获取之前master写入的数据测试数据是否还存在:
redis-cli -p 6380 -a snest123 get foo
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
"bar" # 数据仍然存在
由上述测试结果可以看出,Redis哨兵模式下的故障转移功能正常工作,新的主节点成功接管了之前主节点的角色,并且数据保持一致性。
重新启动之前的主节点 Redis 进程,作为6380的slave而存在:
1:X 12 Aug 2025 03:24:14.350 # +sdown slave 192.168.184.122:6379 192.168.184.122 6379 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:30:50.866 # -sdown slave 192.168.184.122:6379 192.168.184.122 6379 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:31:00.799 * +convert-to-slave slave 192.168.184.122:6379 192.168.184.122 6379 @ mymaster 192.168.184.122 6380
模拟哨兵节点故障
停止其中一个哨兵节点 redis-sentinel-1 进程,观察另外两个哨兵节点的日志输出,发现没输出任何关于故障的日志信息。
停掉2个哨兵节点 redis-sentinel-1 和 redis-sentinel-2 进程,观察剩余一个哨兵节点 redis-sentinel-3 的日志输出:
1:X 12 Aug 2025 03:38:08.991 # +sdown sentinel e39828966d5ae85ef8292ce9aa085c84ef5d6203 192.168.184.122 26381 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:38:44.972 # +sdown sentinel ef17ddf007aa36aa2d91f5a7896b8141adca2e07 192.168.184.122 26380 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:39:05.514 # +sdown master mymaster 192.168.184.122 6380
此时在程序中尝试连接redis集群,观察连接情况:
redis: 2025/08/12 11:43:54 sentinel.go:770: sentinel: selected addr=192.168.184.122:26379 masterAddr=192.168.184.122:6380
redis: 2025/08/12 11:43:54 sentinel.go:759: sentinel: GetMasterAddrByName addr=192.168.184.122:26381, master="mymaster" failed: context canceled
redis: 2025/08/12 11:43:54 sentinel.go:759: sentinel: GetMasterAddrByName addr=192.168.184.122:26380, master="mymaster" failed: context canceled
redis: 2025/08/12 11:43:54 sentinel.go:920: sentinel: new master="mymaster" addr="192.168.184.122:6380"
panic: dial tcp 192.168.184.122:6380: connectex: No connection could be made because the target machine actively refused it.
发现在半数以上哨兵节点故障时,程序无法连接到Redis集群的主节点,导致panic错误。这是因为哨兵节点的故障导致无法获取主节点的地址,从而无法建立连接。
恢复哨兵节点
重新启动之前停止的哨兵节点 redis-sentinel-1 和 redis-sentinel-2 进程,观察日志输出:
1:X 12 Aug 2025 03:47:09.266 # -sdown sentinel ef17ddf007aa36aa2d91f5a7896b8141adca2e07 192.168.184.122 26380 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:47:11.234 * +sentinel-invalid-addr sentinel ef17ddf007aa36aa2d91f5a7896b8141adca2e07 192.168.184.122 26380 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:47:11.234 * +sentinel sentinel 96ac7bcab8868b94de33ab08a10761545b6d42c6 192.168.184.122 26380 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:47:11.242 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:47:11.242 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:48:09.284 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:48:09.284 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:48:09.284 # +new-epoch 4
1:X 12 Aug 2025 03:48:09.291 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:48:09.292 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:48:09.292 # +vote-for-leader 96ac7bcab8868b94de33ab08a10761545b6d42c6 4
1:X 12 Aug 2025 03:48:09.497 # +odown master mymaster 192.168.184.122 6380 #quorum 2/2
1:X 12 Aug 2025 03:48:09.497 # Next failover delay: I will not start a failover before Tue Aug 12 03:54:09 2025
1:X 12 Aug 2025 03:48:09.633 # +config-update-from sentinel 96ac7bcab8868b94de33ab08a10761545b6d42c6 192.168.184.122 26380 @ mymaster 192.168.184.122 6380
1:X 12 Aug 2025 03:48:09.633 # +switch-master mymaster 192.168.184.122 6380 192.168.184.122 6379
1:X 12 Aug 2025 03:48:09.633 * +slave slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:48:09.633 * +slave slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:48:09.640 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:48:09.640 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
发现哨兵节点重新启动后,能够自动检测到主节点的变化,并进行相应的配置更新和故障转移操作。此时,Redis集群恢复了正常的工作状态。 再次尝试在程序中连接Redis集群,都能正常访问。
继续全部恢复所有redis节点,重新启动之前停止的主节点 redis-master 进程:
1:X 12 Aug 2025 03:50:14.541 # -sdown slave 192.168.184.122:6380 192.168.184.122 6380 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:50:20.838 * +reboot master mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:50:36.213 # -sdown sentinel e39828966d5ae85ef8292ce9aa085c84ef5d6203 192.168.184.122 26381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:50:37.284 * +sentinel-invalid-addr sentinel e39828966d5ae85ef8292ce9aa085c84ef5d6203 192.168.184.122 26381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:50:37.284 * +sentinel sentinel e4213f7cc46cbac21727dd4ec930e65cd2c1dba2 192.168.184.122 26381 @ mymaster 192.168.184.122 6379
1:X 12 Aug 2025 03:50:37.293 # Could not rename tmp config file (Device or resource busy)
1:X 12 Aug 2025 03:50:37.293 # WARNING: Sentinel was not able to save the new configuration on disk!!!: Device or resource busy
1:X 12 Aug 2025 03:51:16.272 * +fix-slave-config slave 192.168.184.122:6381 192.168.184.122 6381 @ mymaster 192.168.184.122 6379
可以发现一切都恢复正常,Redis集群的主节点、从节点和哨兵节点都能够正常工作。
到此,Redis哨兵模式下的异常测试已经完成。通过这些测试,我们验证了Redis哨兵在主节点故障、从节点故障以及哨兵节点故障等情况下的自动恢复和故障转移能力。 下面继续测试业务逻辑的异常情况。
平台redis配置
在 iidp-app 配置文件中可配置redis模式和nodes等相关项,具体配置参数和功能参考下面的配置内容和注释
#redis
redis.host=10.233.4.32 # 只在单机版中有效,哨兵模式读取的是 redis.cluster.nodes
redis.port=6379
redis.db=9 # 在单机版和哨兵模式有效
redis.password=snest123 # 所有模式有效
redis.max_total=100
redis.max_idle=100
redis.min_idle=0
redis.max_wait_millis=1000
redis.connection_timeout=2000
redis.so_timeout=30000
redis.host={aes}cVx1RYbhsw6NFjiNEZdKsBUUgok9hERqbvUyuHxwA6k=
redis.port={aes}7OuP9bX+lQ4AlwcFEgVMmw==
#redis.db=1
#redis.password={aes}XyWQ6RnjQ+vll5tFgWczLA==
#哨兵模式
redis.mode=sentinel
#哨兵模式主节点名称配置
redis.master=mymaster
#多个节点以英文逗号分隔,请勿输入空格,这里用哨兵的端口
redis.cluster.nodes=192.168.184.122:26379,192.168.184.122:26380,192.168.184.122:26381
#集群节点扫描间隔,单位为毫秒
#redis.cluster.scanInterval=2000
#redis.db=0
#redis.password=
redis.max_total=100
redis.max_idle=100
redis.min_idle=0
redis.max_wait_millis=1000
redis.connection_timeout=2000
redis.so_timeout=30000
需要需要恢复到原来的redis模式和配置,将配置改回去并重启所有业务pod即可。
业务异常
跑了大概一天,没有发现redisson连接异常的日志,slowlog也较正常
场景0:从单机版redis切换到哨兵模式
步骤: 在配置了单机版redis的平台环境安装app,确认功能正常。 停止所有后端容器,redis的配置切换到哨兵。重启所有后端容器
验证结果: 前端菜单、界面、增删改等正常
菜单加载:
菜单删除:
租户授权成功:
新增菜单:
查看新增的菜单是否生效:
APP安装
场景1:主从切换
步骤:
哨兵Redis停止master节点,模拟主从切换。
之后更新APP,查看元模型能否被消费。
验证结果:
使用Topic检查文档(我之前提供的文档),检查对应的客户端能否正常及时消费
检查slowlog是否有超过200ms的执行命令
实际测试结果
停止掉主节点后,由于redis哨兵集群配置是60s超时,所以在此期间一直报错,所有的redis操作都会异常,包括消费消息:
2025-08-13 17:46:39.925 [http-nio-8060-exec-2] ERROR o.a.c.c.C.[.[localhost].[/].[dispatcherServlet] -Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception
com.sie.snest.engine.exception.ModelException: Unable to write command into connection! Check CPU usage of the JVM. Try to increase nettyThreads setting. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@1440995011 [redisClient=[addr=redis://192.168.184.122:6379], channel=[id: 0x0c4c951d, L:0.0.0.0/0.0.0.0:60620], currentCommand=null, usage=1], command: (GET), params: [iidp:open_api:route:/checkhealth] after 3 retry attempts
at com.sie.snest.engine.model.MethodMeta.invoke(MethodMeta.java:217)
at com.sie.snest.engine.api.distributed.RpcInvocationV2.invoke(RpcInvocationV2.java:168)
at com.sie.snest.engine.data.RecordSet.call(RecordSet.java:361)
at com.sie.snest.sdk.cache.RedisHelper.get(RedisHelper.java:26)
at com.sie.snest.api.transformer.filter.ApiRequestFilter.doFilter(ApiRequestFilter.java:115)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
at com.alibaba.druid.support.http.WebStatFilter.doFilter(WebStatFilter.java:124)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:540)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:769)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:895)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1732)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Check CPU usage of the JVM. Try to increase nettyThreads setting. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@1440995011 [redisClient=[addr=redis://192.168.184.122:6379], channel=[id: 0x0c4c951d, L:0.0.0.0/0.0.0.0:60620], currentCommand=null, usage=1], command: (GET), params: [iidp:open_api:route:/checkhealth] after 3 retry attempts
at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:367)
at org.redisson.command.RedisExecutor.lambda$execute$4(RedisExecutor.java:197)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:889)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:956)
at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1263)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Caused by: io.netty.channel.StacklessClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
redis连接异常:
登录系统异常:
消费异常:
2025/08/13 17:58:17 消费错误: dial tcp 192.168.184.122:6380: connectex: No connection could be made because the target machine actively refused it.
2025/08/13 17:58:28 消费错误: dial tcp 192.168.184.122:6380: connectex: No connection could be made because the target machine actively refused it.
2025/08/13 17:58:37 消费错误: dial tcp 192.168.184.122:6380: connectex: No connection could be made because the target machine actively refused it.
在实际业务中,需要主动 catch 住异常,防止整个服务挂掉,并做重试,而且无休止的重试,直到成功。
待选主成功后,异常终止,redis恢复正常。
访问loadView接口:
访问loadMenu接口:
都能正常访问。
场景2:初次安装
步骤:
清空Redis中 final_meta* 下的内容,重启所有后端容器,检查元模型能否被正常消费
验证结果:
使用Topic检查文档(我之前提供的文档),检查对应的客户端能否正常及时消费
检查slowlog是否有超过200ms的执行命令
实际验证结果:
能够正常消费,没有发现消费延迟的情况。
消费topic详情如下:
2025/08/13 15:57:34 收到消息 ID=1755071850670-0, 内容=map[key:OppmMonthTarget source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:35 收到消息 ID=1755071850700-0, 内容=map[key:OppmProLine source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:35 收到消息 ID=1755071852169-0, 内容=map[key:demo_role,data_source_test,crud_ds_test,tenant_ds_test,OppmCustomerArchive,a_app_model,buss_store,ops_trace_log,OppmOpportunity,test_order,sql_template_test,TestTest1,shard_year_test,ops_tracking_user_model,TestTest2,test_order_ref,OppmCompanyReport,TestUser,demo_product,service_node source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:35 收到消息 ID=1755071852244-0, 内容=map[key:group_by_test,sie_order,OppmProLine,apiOrderService,OppmCrmOpportunityLog,OppmCrmOpportunity,test_eam_maintenance_task,OppmOpportunityThisWeek,wf_process_policy_user,test_data_auth_vo,install_app,demoUserSendVariables,wf_service_method_hook,TestOrderOrg,wf_process_model_view,worker_job_info,seed_ds_test,TestOrg,shard_list_test,seed_test source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:35 收到消息 ID=1755071852292-0, 内容=map[key:er_a,test_eam_maintenance_document,worker_instance_info,demo_order,main_ds_test,host_node,demo_product_b,demo_product_c,TestDataSource,base_model_test,test_eam_maintenance_other_expenses,OppmIndustry,inherit_model_test,OppmRefProLineOrg,demoUserReceiverVariables,demo_supplier,TestRole,TestTest,MetaPropertyVm,wf_process_test source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:36 收到消息 ID=1755071852382-0, 内容=map[key:test_eam_fault_maintenance_order,OppmMonthTarget,normal_scene_test,many_to_many_c,test_eam_maintenance_item,broadcast_test,buss_order,base_importexportexcel,crud_ds_test_ref,test_eam_maintenance_working_hours,worker_registration_info,sie_order_item,host_node_vm,UserVm,buss_ds_test,test_eam_maintenance_information,er_d,MetaAppVm,er_b,test_data_auth source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
2025/08/13 15:57:36 收到消息 ID=1755071852424-0, 内容=map[key:product_ds_test,OppmRefOrgAccount,OppmAreaSalesManager,buss_wallet,OppmOpportunityNextWeek,receiverVariables,shard_month_test,TestRule,ops_trace_handle_service,demo_user,senderVariables,OppmAreaReport,ops_tracking_service_model source:47d7894f-070b-47d9-8a75-94c242440260 type:update]
场景3:安装、更新、卸载
步骤:
测试正常的安装、更新、卸载在哨兵模式下的效果
验证方式:
使用Topic检查文档(我之前提供的文档),检查对应的客户端能否正常及时消费
检查slowlog是否有超过200ms的执行命令
实际验证结果: 能够正常消费,没有发现消费延迟的情况。 能够正常安装、更新、卸载。安装卸载、更新、卸载后,元模型都能被正常消费。但是出现一些单个的stream事件,而且是不相干的元模型事件。
安装app后消费详情如下:
2025/08/13 16:07:34 收到消息 ID=1755072452966-0, 内容=map[key:OppmMonthTarget source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:34 收到消息 ID=1755072452994-0, 内容=map[key:OppmProLine source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:37 收到消息 ID=1755072455636-0, 内容=map[key:demo_role,data_source_test,crud_ds_test,tenant_ds_test,OppmCustomerArchive,a_app_model,buss_store,ops_trace_log,OppmOpportunity,test_order,sql_template_test,TestTest1,shard_year_test,ops_tracking_user_model,TestTest2,test_order_ref,OppmCompanyReport,TestUser,demo_product,service_node source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:37 收到消息 ID=1755072455733-0, 内容=map[key:group_by_test,sie_order,OppmProLine,apiOrderService,OppmCrmOpportunityLog,OppmCrmOpportunity,test_eam_maintenance_task,OppmOpportunityThisWeek,wf_process_policy_user,test_data_auth_vo,install_app,demoUserSendVariables,wf_service_method_hook,TestOrderOrg,wf_process_model_view,worker_job_info,seed_ds_test,TestOrg,shard_list_test,seed_test source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:37 收到消息 ID=1755072455783-0, 内容=map[key:er_a,test_eam_maintenance_document,worker_instance_info,demo_order,main_ds_test,host_node,demo_product_b,demo_product_c,TestDataSource,base_model_test,test_eam_maintenance_other_expenses,OppmIndustry,inherit_model_test,OppmRefProLineOrg,demoUserReceiverVariables,demo_supplier,TestRole,TestTest,MetaPropertyVm,wf_process_test source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:37 收到消息 ID=1755072455881-0, 内容=map[key:test_eam_fault_maintenance_order,OppmMonthTarget,normal_scene_test,many_to_many_c,test_eam_maintenance_item,broadcast_test,buss_order,base_importexportexcel,crud_ds_test_ref,test_eam_maintenance_working_hours,worker_registration_info,sie_order_item,host_node_vm,UserVm,buss_ds_test,test_eam_maintenance_information,er_d,MetaAppVm,er_b,test_data_auth source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
2025/08/13 16:07:37 收到消息 ID=1755072455909-0, 内容=map[key:product_ds_test,OppmRefOrgAccount,OppmAreaSalesManager,buss_wallet,OppmOpportunityNextWeek,receiverVariables,shard_month_test,TestRule,ops_trace_handle_service,demo_user,senderVariables,OppmAreaReport,ops_tracking_service_model source:97f97547-d38c-4f10-ab4e-4099f96a718f type:update]
卸载消费详情如下:
2025/08/13 16:02:14 收到消息 ID=1755072133153-0, 内容=map[key:test_eam_maintenance_document,TestTest,TestUser,TestDataSource,test_eam_maintenance_other_expenses,TestTest1,TestTest2,TestRole,test_order_ref,test_order,MetaPropertyVm,TestOrderOrg,test_eam_maintenance_information,test_eam_fault_maintenance_order,test_eam_maintenance_item,test_eam_maintenance_working_hours,UserVm,MetaAppVm,test_data_auth,base_importexportexcel,TestRule,TestOrg,test_data_auth_vo,test_eam_maintenance_task source:bb0ab662-2b19-4b9b-9ffe-0d8b5e9d9223 type:remove]

更新后消费详情如下,可以发现更新跟安装类似,都是通过重启pod来重新计算终态并发布终态消息。
2025/08/13 17:08:56 收到消息 ID=1755076134622-0, 内容=map[key:OppmMonthTarget source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:56 收到消息 ID=1755076134667-0, 内容=map[key:OppmProLine source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:59 收到消息 ID=1755076137980-0, 内容=map[key:demo_role,data_source_test,crud_ds_test,tenant_ds_test,OppmCustomerArchive,a_app_model,buss_store,ops_trace_log,OppmOpportunity,test_order,sql_template_test,TestTest1,shard_year_test,ops_tracking_user_model,TestTest2,test_order_ref,worker_job_info,TestUser,demo_product,service_node source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:59 收到消息 ID=1755076138058-0, 内容=map[key:group_by_test,sie_order,OppmProLine,apiOrderService,OppmCrmOpportunityLog,OppmCrmOpportunity,test_eam_maintenance_task,OppmOpportunityThisWeek,wf_process_policy_user,test_data_auth_vo,install_app,demoUserSendVariables,wf_service_method_hook,TestOrderOrg,wf_process_model_view,OppmCompanyReport,seed_ds_test,TestOrg,shard_list_test,seed_test source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:59 收到消息 ID=1755076138108-0, 内容=map[key:er_a,test_eam_maintenance_document,worker_instance_info,demo_order,main_ds_test,host_node,demo_product_b,demo_product_c,TestDataSource,base_model_test,test_eam_maintenance_other_expenses,OppmIndustry,inherit_model_test,OppmRefProLineOrg,demoUserReceiverVariables,demo_supplier,TestRole,TestTest,MetaPropertyVm,wf_process_test source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:59 收到消息 ID=1755076138218-0, 内容=map[key:test_eam_fault_maintenance_order,OppmMonthTarget,normal_scene_test,many_to_many_c,test_eam_maintenance_item,broadcast_test,buss_order,base_importexportexcel,crud_ds_test_ref,test_eam_maintenance_working_hours,worker_registration_info,sie_order_item,host_node_vm,UserVm,buss_ds_test,test_eam_maintenance_information,er_d,MetaAppVm,er_b,test_data_auth source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:08:59 收到消息 ID=1755076138267-0, 内容=map[key:product_ds_test,OppmRefOrgAccount,OppmAreaSalesManager,buss_wallet,OppmOpportunityNextWeek,receiverVariables,shard_month_test,TestRule,ops_trace_handle_service,demo_user,senderVariables,OppmAreaReport,ops_tracking_service_model source:9c408177-4c1f-45ea-90e7-3194c04f0aef type:update]
2025/08/13 17:09:01 收到消息 ID=1755076140128-0, 内容=map[key:OppmMonthTarget source:4abda77c-aedb-4f48-9186-8f65fded6840 type:update]
2025/08/13 17:09:01 收到消息 ID=1755076140163-0, 内容=map[key:OppmProLine source:4abda77c-aedb-4f48-9186-8f65fded6840 type:update]
场景4:模拟大量卸载安装
步骤:
用脚本进行一晚的安装卸载APP,测试是否有问题
验证方式:
使用Topic检查文档(我之前提供的文档),检查对应的客户端能否正常及时消费
检查slowlog是否有超过200ms的执行命令
实际验证结果:
因为需要脚本执行安装和卸载app流程,首选需要准备jmeter测试脚本。
可在这里下载实现准备好的 jmeter脚本-安装卸载.zip
导入这些jmeter脚本,查看jmeter脚本是否正常,如下图:
在脚本执行期间,pod会定时频繁执行安装
卸载
操作,会导致pod频繁重启,如下图:
可以进一步在kubesphere上面查看相关deployment的事件情况,如下图:
同时查看消费端情况:
2024-08-14 跑了一整晚的安装
卸载
流程,详细日志参考日志文件
场景5:模拟长时间运行
持续进行中
步骤:
不停止Redis和后端,测试运行一星期的效果
验证方式:
使用Topic检查文档(我之前提供的文档),检查对应的客户端能否正常及时消费
检查slowlog是否有超过200ms的执行命令