cpu 100% 排查思路
先介绍一下这次的情况
风和日丽,朗朗乾坤,sentry 突然报了一堆异常:
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Check CPU usage of the JVM. Check that there are no blocking invocations in async/reactive/rx listeners or subscribeOnElements method. Check connection with Redis node: 10.1.92.15/10.1.92.15:6379 for TCP packet drops. Try to increase nettyThreads setting. Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@351263091 [redisClient=[addr=redis://10.1.92.15:6379], channel=[id: 0x1917d706, L:/10.2.1.234:59100 - R:10.1.92.15/10.1.92.15:6379], currentCommand=CommandData [command=(EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t, val = struct.unpack('dLc0', value); local expireDate = 92233720368547758; local expireDateScore = redis.call('zscore', KEYS[2], ARGV[2]); if expireDateScore ~= false then expireDate = tonumber(expireDateScore) end; if t ~= 0 then local expireIdle = redis.call('zscore', KEYS[3], ARGV[2]); if expireIdle ~= false then if tonumber(expireIdle) > tonumber(ARGV[1]) then redis.call('zadd', KEYS[3], t + tonumber(ARGV[1]), ARGV[2]); end; expireDate = math.min(expireDate, tonumber(expireIdle)) end; end; if expireDate <= tonumber(ARGV[1]) then return nil; end; local maxSize = tonumber(redis.call('hget', KEYS[5], 'max-size')); if maxSize ~= nil and maxSize ~= 0 then local mode = redis.call('hget', KEYS[5], 'mode'); if mode == false or mode == 'LRU' then redis.call('zadd', KEYS[4], tonumber(ARGV[1]), ARGV[2]); else redis.call('zincrby', KEYS[4], 1, ARGV[2]); end; end; return val; , 5, LABEL:WORKFLOW_TRANSFER_INFO_CACHE, redisson__timeout__set:{LABEL:WORKFLOW_TRANSFER_INFO_CACHE}, redisson__idle__set:{LABEL:WORKFLOW_TRANSFER_INFO_CACHE}, redisson__map_cache__last_access__set:{LABEL:WORKFLOW_TRANSFER_INFO_CACHE}, {LABEL:WORKFLOW_TRANSFER_INFO_CACHE}:redisson_options, 1734502065741, PooledUnsafeDirectByteBuf(freed)], promise: io.opentelemetry.javaagent.instrumentation.redisson.CompletableFutureWrapper@3bfcc385[Not completed, 1 dependents], codec=org.redisson.codec.TypedJsonJacksonCodec], usage=1], command: (HGET), params: [LABEL:TASK_JOB_CACHE:827596914534928384, PooledUnsafeDirectByteBuf(ridx: 0, widx: 18, cap: 256)] after 3 retry attemptsat org.redisson.command.RedisExecutor$1.run(RedisExecutor.java:294)at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:713)at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:701)at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:788)at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:501)at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)at java.lang.Thread.run(Thread.java:1583)
redisson 已经给明了方向,直接从sentry 标签里找到pod名,连上去看看CPU使用情况。
top -Hp 1
查看java进程中的线程的使用情况,看下哪些线程cpu使用率比较高
java -jar arthas-boot.jar
然后连上arthas
thread -n 10
输出cpu占用最多的10个线程的堆栈,发现占用最高的几个,全是redisson的。
"redisson-netty-2-6" #184 prio=5 os_prio=0 cpu=295856.81ms elapsed=3109.81s tid=0x00007fbd7c41bf40 nid=0xbd runnable [0x00007fbd6c189000]
java.lang.Thread.State: RUNNABLE
at org.redisson.client.handler.CommandsQueue.write(CommandsQueue.java:79)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:808)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:306)
at org.redisson.client.RedisConnection.send(RedisConnection.java:206)
at org.redisson.command.RedisExecutor.sendCommand(RedisExecutor.java:590)
at org.redisson.command.RedisExecutor.lambda$execute$3(RedisExecutor.java:164)
at org.redisson.command.RedisExecutor$$Lambda$2031/0x000000080241d570.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17/CompletableFuture.java:863)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17/CompletableFuture.java:841)
at java.util.concurrent.CompletableFuture.postComplete(java.base@17/CompletableFuture.java:510)
at java.util.concurrent.CompletableFuture.complete(java.base@17/CompletableFuture.java:2147)
at org.redisson.connection.pool.ConnectionPool.connectedSuccessful(ConnectionPool.java:302)
at org.redisson.connection.pool.ConnectionPool.connectTo(ConnectionPool.java:266)
at org.redisson.connection.pool.ConnectionPool.lambda$acquireConnection$2(ConnectionPool.java:226)
at org.redisson.connection.pool.ConnectionPool$$Lambda$2025/0x000000080241c838.accept(Unknown Source)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(java.base@17/CompletableFuture.java:718)
at java.util.concurrent.CompletableFuture.postComplete(java.base@17/CompletableFuture.java:510)
at java.util.concurrent.CompletableFuture.complete(java.base@17/CompletableFuture.java:2147)
at org.redisson.misc.AsyncSemaphore.tryRun(AsyncSemaphore.java:79)
at org.redisson.misc.AsyncSemaphore.release(AsyncSemaphore.java:96)
at org.redisson.connection.ClientConnectionsEntry.releaseConnection(ClientConnectionsEntry.java:153)
at org.redisson.connection.pool.ConnectionPool.releaseConnection(ConnectionPool.java:420)
at org.redisson.connection.pool.ConnectionPool.returnConnection(ConnectionPool.java:416)
at org.redisson.connection.pool.MasterConnectionPool.returnConnection(MasterConnectionPool.java:34)
at org.redisson.connection.MasterSlaveEntry.releaseWrite(MasterSlaveEntry.java:611)
at org.redisson.connection.SingleEntry.releaseRead(SingleEntry.java:48)
at org.redisson.connection.MasterSlaveConnectionManager.releaseRead(MasterSlaveConnectionManager.java:585)
at org.redisson.command.RedisExecutor.release(RedisExecutor.java:629)
at org.redisson.command.RedisExecutor.releaseConnection(RedisExecutor.java:613)
at org.redisson.command.RedisExecutor.lambda$execute$4(RedisExecutor.java:174)
at org.redisson.command.RedisExecutor$$Lambda$2035/0x000000080241e278.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@17/CompletableFuture.java:863)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@17/CompletableFuture.java:841)
at java.util.concurrent.CompletableFuture.postComplete(java.base@17/CompletableFuture.java:510)
at java.util.concurrent.CompletableFuture.completeExceptionally(java.base@17/CompletableFuture.java:2162)
at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:327)
at org.redisson.command.RedisExecutor.lambda$execute$2(RedisExecutor.java:169)
at org.redisson.command.RedisExecutor$$Lambda$2034/0x000000080241e040.operationComplete(Unknown Source)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at io.netty.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:113)
at org.redisson.client.handler.CommandEncoder.write(CommandEncoder.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
at io.netty.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:120)
at org.redisson.client.handler.CommandBatchEncoder.write(CommandBatchEncoder.java:45)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
at org.redisson.client.handler.CommandsQueue.write(CommandsQueue.java:83)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@17/Thread.java:833)
找到问题代码,如下:
@Overridepublic void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception {if (msg instanceof QueueCommand) {QueueCommand data = (QueueCommand) msg;QueueCommandHolder holder = new QueueCommandHolder(data, promise);Queue<QueueCommandHolder> queue = ctx.channel().attr(COMMANDS_QUEUE).get();while (true) {if (lock.compareAndSet(false, true)) { // CommandsQueue.java:79try {queue.add(holder);try {holder.getChannelPromise().addListener(future -> {if (!future.isSuccess()) {queue.remove(holder);}});ctx.writeAndFlush(data, holder.getChannelPromise());} catch (Exception e) {queue.remove(holder);throw e;}} finally {lock.set(false);}break;}}} else {super.write(ctx, msg, promise);}}
可以看到CommandsQueue.java:79
那里是有一个compareAndSet
。
众所周知,compareAndSet是一种乐观锁,是实现自旋锁的一种手段。
/*** Atomically sets the value to {@code newValue}* if the current value {@code == expectedValue},* with memory effects as specified by {@link VarHandle#compareAndSet}.** @param expectedValue the expected value* @param newValue the new value* @return {@code true} if successful. False return indicates that* the actual value was not equal to the expected value.*/public final boolean compareAndSet(boolean expectedValue, boolean newValue) {return VALUE.compareAndSet(this,(expectedValue ? 1 : 0),(newValue ? 1 : 0));}
这是java中AtomicBoolean#compareAndSet
的逻辑,很简单就是比较是不是我的期望值,是的话改成新值,然后返回true,不是的话返回false。
redisson源码中可以看到,返回false的话,会在while循环里一直重试。就是因为始终设置不成功,一直循环,导致cpu占用100%。
找到问题原因了,要么重写redisson这段逻辑,要么直接升级redisson版本。
随后,我在redisson仓库找到了:issues#6153。
来自官方维护人员的解释:只有当设置 lock = true 的那个线程被杀掉了才会出现这个问题。
这个问题已经在3.37.0版本里解决了,最简单的就是升级至3.37.0及以上版本。
至此总结一下排查cpu 100%的思路
首先定位到进程:
top
top 命令默认是按cpu使用率排序的。
- -c 让top展示进程的命令而不是进程名,也可以进入top后按c键来切换。
- -p 指定监控的进程id。
- -n 刷新n次后退出。
- -H 监控线程而不是进程。第一列是线程编号,不是进程编号了。
- -E 指定内存展示的单位:k/m/g/t/p/e,也可以进入top后按E键来切换。
上面这些选项都能在top的手册(man top
)里找到。
定位到进程之后,找到该进程里有哪些线程cpu占用高。一般是某些线程死循环导致的cpu100%。
top -Hp <pid>
拿到线程编号。
jps
jps 命令会输出所有java进程的进程号。如果能确定是java进程cpu占用高的话直接找到java进程号就行了。
找到有问题的线程
jstack
导出目前java程序的所有堆栈信息。
jstack -l <pid>
- -l 会列出线程锁信息,包括持有锁的线程和等待锁的线程。
在jstack的输出中,能看到线程名称和线程编号,其中nid是操作系统级别唯一的线程id,可以用我们在top里拿到的线程编号搜索nid=<线程编号>
找到对应的线程。
jstack -l <pid> | grep "nid=<线程编号>" -C 10
不过jstack的输出中,线程编号大多数时候是十六进制的,top中输出的线程编号是十进制的,需要转换一下。
jstack -l <pid> | grep "nid=$(printf '%x' <线程编号>)" -C 10
没错,printf '%x' n
能把一个数字转换成十六进制。
arthas
如果有arthas的话就简单多了,连上arthas,然后用thread -n 10
输出cpu占用最高的10个线程的堆栈。