最近,我们升级了 SpringBoot 3.x,并且,升级后,我们全面改造了原来的 Sleuth 以及 Micrometer 监控,最新的 io.micrometer.observation.Observation
抽象将链路追踪以及指标监控(opentracing 和 opentelemetry 两个标准)结合,这样,我们就可以在链路追踪中,同时记录指标监控数据了。
并且,在此基础上,我们还加入了全局的 io.micrometer.observation.ObservationHandler
,用于在 Observation
start 的时候,生成 JFR 事件,在 stop 的时候 commit JFR 事件。这样我们就实现了通过一个 Observation
:
/actuator/prometheus
,供 Prometheus 采集,Grafana 展示:为何要这么做呢?
这样,既节约了成本,又能及时报警,又能容易通过 Jaeger 通过链路追踪数据定位出问题的实例,然后通过获取对应实例的 JFR 事件定位详细问题。
但是,我们在全面使用 Observation
的时候,发现了一个问题,就是内存溢出以及 CPU 飙高(非不断 GC 导致),刚开始我们因为 CPU 飙高是内存溢出引起的,但是后来发现,并不只这个原因。
首先为何会出现内存溢出,我们先做个测试,添加依赖:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.2.1</version>
</parent>
<groupId>com.github.hashjang</groupId>
<artifactId>wwsmbjysymrdo</artifactId>
<properties>
<!--所有项目 Java 基线版本为 17,如果需要升级,修改这里,注意不可降级-->
<!--Baseline Java Version 17, if you need to upgrade, modify here, note that it cannot be downgraded-->
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<!--这里放我们自定义的依赖版本属性-->
<disruptor.version>3.4.4</disruptor.version>
</properties>
<dependencies>
<!--lombok简化代码-->
<!--lombok simplifies code-->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency>
<!--日志需要用log4j2-->
<!--choose log4j2 for logging-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-log4j2</artifactId>
</dependency>
<!--log4j2异步日志需要的依赖,所有项目都必须用log4j2和异步日志配置-->
<!--Dependencies required for log4j2 asynchronous logging, all projects must use log4j2 and asynchronous logging configuration-->
<dependency>
<groupId>com.lmax</groupId>
<artifactId>disruptor</artifactId>
<version>${disruptor.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
</dependencies>
</project>
之后编写代码测试:
package com.github.hashjang.wwsmbjysymrdo;
import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
@Log4j2
@SpringBootApplication
public class Main {
public static void main(String[] args) {
SpringApplication.run(Main.class);
}
@Component
public static class Runner implements CommandLineRunner {
@Autowired
private ObservationRegistry observationRegistry;
@Override
public void run(String... args) throws Exception {
//不断创建新的 Observation,然后 start
while(true) {
try {
Observation test = Observation.createNotStarted("test", observationRegistry);
test.start();
} catch (Throwable r) {
log.error("error", r);
break;
}
}
log.info("complete");
}
}
}
之后,限制 JVM 最大堆内存为 32m 之后启动,可以看到,不断创建新的 Observation,然后 start,最终导致内存溢出:
. ____ _ __ _ _
/\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/ ___)| |_)| | | | | || (_| | ) ) ) )
' |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot :: (v3.2.1)
2024-01-19T12:50:37.474+08:00 INFO 21401 --- [ main] c.g.h.w.Main : Starting Main using Java 17.0.8 with PID 21401 (/Users/hash/Desktop/Project/Java/spring-cloud-native/article/Problem/wwsmbjysymrdo/target/classes started by hash in /Users/hash/Desktop/Project/Java/spring-cloud-native)
2024-01-19T12:50:37.475+08:00 INFO 21401 --- [ main] c.g.h.w.Main : No active profile set, falling back to 1 default profile: "default"
2024-01-19T12:50:37.853+08:00 INFO 21401 --- [ main] c.g.h.w.Main : Started Main in 0.529 seconds (process running for 0.762)
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
WARN StatusConsoleListener org.apache.logging.log4j.spi.AbstractLogger caught java.lang.OutOfMemoryError logging ReusableSimpleMessage: error
java.lang.OutOfMemoryError: Java heap space
2024-01-19T12:50:40.437+08:00 INFO 21401 --- [ main] c.g.h.w.Main : complete
2024-01-19T12:50:40.913+08:00 WARN 21401 --- [ionShutdownHook] o.s.b.f.s.DisposableBeanAdapter : Invocation of close method failed on bean with name 'log4j2Metrics': java.lang.OutOfMemoryError: Java heap space
这里发生了内存溢出,但是仅仅因为 Observation
的创建和 start
,就导致内存溢出,这是不合理的,我们分析一下,为何会出现这个问题。
我们通过增加如下启动参数启动并且在退出的时候 dump JFR:
-XX:StartFlightRecording=disk=true,dumponexit=true
或者使用下面的参数在内存溢出的时候 dump 整个堆:
-XX:+HeapDumpOnOutOfMemoryError
都可以看到,内存溢出的时候,堆中有大量的 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl
对象:
查看源码,发现在 org.springframework.boot.actuate.autoconfigure.observation.ObservationAutoConfiguration
中,会自动配置一些全局的 ObservationHandler
:
@Configuration(proxyBeanMethods = false)
@ConditionalOnClass(MeterRegistry.class)
@ConditionalOnMissingClass("io.micrometer.tracing.Tracer")
static class OnlyMetricsConfiguration {
@Bean
ObservationHandlerGrouping metricsObservationHandlerGrouping() {
return new ObservationHandlerGrouping(MeterObservationHandler.class);
}
}
@Configuration(proxyBeanMethods = false)
@ConditionalOnClass(Tracer.class)
@ConditionalOnMissingClass("io.micrometer.core.instrument.MeterRegistry")
static class OnlyTracingConfiguration {
@Bean
ObservationHandlerGrouping tracingObservationHandlerGrouping() {
return new ObservationHandlerGrouping(TracingObservationHandler.class);
}
}
@Configuration(proxyBeanMethods = false)
@ConditionalOnClass({ MeterRegistry.class, Tracer.class })
static class MetricsWithTracingConfiguration {
@Bean
ObservationHandlerGrouping metricsAndTracingObservationHandlerGrouping() {
return new ObservationHandlerGrouping(
List.of(TracingObservationHandler.class, MeterObservationHandler.class));
}
}
@Configuration(proxyBeanMethods = false)
@ConditionalOnBean(MeterRegistry.class)
@ConditionalOnMissingBean(MeterObservationHandler.class)
static class MeterObservationHandlerConfiguration {
@ConditionalOnMissingBean(type = "io.micrometer.tracing.Tracer")
@Configuration(proxyBeanMethods = false)
static class OnlyMetricsMeterObservationHandlerConfiguration {
@Bean
DefaultMeterObservationHandler defaultMeterObservationHandler(MeterRegistry meterRegistry) {
return new DefaultMeterObservationHandler(meterRegistry);
}
}
@ConditionalOnBean(Tracer.class)
@Configuration(proxyBeanMethods = false)
static class TracingAndMetricsObservationHandlerConfiguration {
@Bean
TracingAwareMeterObservationHandler<Observation.Context> tracingAwareMeterObservationHandler(
MeterRegistry meterRegistry, Tracer tracer) {
DefaultMeterObservationHandler delegate = new DefaultMeterObservationHandler(meterRegistry);
return new TracingAwareMeterObservationHandler<>(delegate, tracer);
}
}
}
@Configuration(proxyBeanMethods = false)
@ConditionalOnClass(Advice.class)
static class ObservedAspectConfiguration {
@Bean
@ConditionalOnMissingBean
ObservedAspect observedAspect(ObservationRegistry observationRegistry) {
return new ObservedAspect(observationRegistry);
}
}
以上代码的意思是,根据你的项目中是否添加了链路追踪,或者指标监控的依赖,来初始化不同的 ObservationHandler,如果你的项目中只有指标监控,那么就会初始化 DefaultMeterObservationHandler,这个 DefaultMeterObservationHandler 会在每个 Observation 的 start 的时候,创建一个 io.micrometer.core.instrument.internal.DefaultLongTaskTimerSampleImpl 对象,然后将这个 io.micrometer.core.instrument.internal.DefaultLongTaskTimerSampleImpl 对象放入 DefaultLongTaskTimer 的 activeTasks 中,这是一个 ConcurrentLinkedDeque。在调用 Observation 的 stop 的时候,会从 DefaultLongTaskTimer 的 activeTasks 中移除这个 io.micrometer.core.instrument.internal.DefaultLongTaskTimer
知道问题之后,我们给遗漏 stop 的地方加上了 try finally stop。但是,我们发现,即使加上了 try finally stop,也会出现 CPU 飙高的问题,我们通过 JFR 看一下,CPU 究竟消耗在哪里:
我们惊奇的发现,还是和 DefaultLongTaskTimer
有关,看来,即使我们加上了 try finally stop,但是 DefaultLongTaskTimer
的 activeTasks
里面还是有很多 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl
对象,导致多线程 stop 的时候 CPU 飙高。我们做个实验:
package com.github.hashjang.wwsmbjysymrdo;
import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;
@Log4j2
@SpringBootApplication
public class Main {
public static void main(String[] args) {
SpringApplication.run(Main.class);
}
@Component
public static class Runner implements CommandLineRunner {
@Autowired
private ObservationRegistry observationRegistry;
@Override
public void run(String... args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(100);
//预热
for (int j = 0; j < 100000; j++) {
executor.execute(() -> {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
});
}
//测试
long start = System.currentTimeMillis();
Future<?> test[] = new Future[100];
for (int i = 0; i < 100; i++) {
test[i] = executor.submit(() -> {
for (int j = 0; j < 100000; j++) {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
}
});
}
for (int i = 0; i < 100; i++) {
test[i].get();
}
log.info("cost {} ms", System.currentTimeMillis() - start);
}
}
}
在我的电脑上,这个测试最后输出显示大概需要 5300ms 左右。我们将全局的 ObservationHandler
改为什么都不做的,对比下:
package com.github.hashjang.wwsmbjysymrdo;
import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;
@Log4j2
@SpringBootApplication
public class Main {
public static void main(String[] args) {
SpringApplication.run(Main.class);
}
@Bean
MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) {
return new MeterObservationHandler<>() {
};
}
@Component
public static class Runner implements CommandLineRunner {
@Autowired
private ObservationRegistry observationRegistry;
@Override
public void run(String... args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(100);
//预热
for (int j = 0; j < 100000; j++) {
executor.execute(() -> {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
});
}
//测试
long start = System.currentTimeMillis();
Future<?> test[] = new Future[100];
for (int i = 0; i < 100; i++) {
test[i] = executor.submit(() -> {
for (int j = 0; j < 100000; j++) {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
}
});
}
for (int i = 0; i < 100; i++) {
test[i].get();
}
log.info("cost {} ms", System.currentTimeMillis() - start);
}
}
}
最后大概需要 1400 ms 左右。这个差距还是很明显的。
我们看一下 DefaultMeterObservationHandler
的源码:
@Override
public void onStart(Observation.Context context) {
LongTaskTimer.Sample longTaskSample = LongTaskTimer.builder(context.getName() + ".active")
.tags(createTags(context)).register(meterRegistry).start();
context.put(LongTaskTimer.Sample.class, longTaskSample);
Timer.Sample sample = Timer.start(meterRegistry);
context.put(Timer.Sample.class, sample);
}
@Override
public void onStop(Observation.Context context) {
Timer.Sample sample = context.getRequired(Timer.Sample.class);
sample.stop(Timer.builder(context.getName()).tags(createErrorTags(context)).tags(createTags(context))
.register(this.meterRegistry));
LongTaskTimer.Sample longTaskSample = context.getRequired(LongTaskTimer.Sample.class);
longTaskSample.stop();
}
可以看出,默认情况下,DefaultMeterObservationHandler
会在 Observation
的 start
的时候,创建一个 LongTaskTimer.Sample
对象,然后将这个 LongTaskTimer.Sample
对象放入 LongTaskTimer
的 activeTasks
中,然后还有一个 Timer.Sample
对象,这个 Timer.Sample
对象是用于记录 Observation
的耗时的。
其中,LongTaskTimer.Sample
对象会引起如果 Observation
只 start 没有 stop 的话,会导致内存溢出,而 Timer.Sample
对象没有这个问题。并且,LongTaskTimer.Sample
对象的在 stop
的时候,因为多线程的原因可能 ConcurrentLinkedQueue
很大导致 CPU 飙高。
我们可以替换掉 DefaultMeterObservationHandler
,自己实现一个 MeterObservationHandler
,在 start
的时候,不创建 LongTaskTimer.Sample
对象,只保留 Timer.Sample
对象,这样就不会出现内存溢出以及 CPU 飙高的问题了。
package com.github.hashjang.wwsmbjysymrdo;
import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
@Log4j2
@SpringBootApplication
public class Main {
public static void main(String[] args) {
SpringApplication.run(Main.class);
}
@Bean
MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) {
return new MeterObservationHandler<>() {
@Override
public void onStart(Observation.Context context) {
Timer.Sample sample = Timer.start(meterRegistry);
context.put(Timer.Sample.class, sample);
}
@Override
public void onStop(Observation.Context context) {
List<Tag> tags = createTags(context);
tags.add(Tag.of("error", getErrorValue(context)));
Timer.Sample sample = context.getRequired(Timer.Sample.class);
sample.stop(Timer.builder(context.getName()).tags(tags).register(meterRegistry));
}
@Override
public void onEvent(Observation.Event event, Observation.Context context) {
Counter.builder(context.getName() + "." + event.getName())
.tags(createTags(context))
.register(meterRegistry)
.increment();
}
private String getErrorValue(Observation.Context context) {
Throwable error = context.getError();
return error != null ? error.getClass().getSimpleName() : "none";
}
private List<Tag> createTags(Observation.Context context) {
List<Tag> tags = new ArrayList<>();
for (KeyValue keyValue : context.getLowCardinalityKeyValues()) {
tags.add(Tag.of(keyValue.getKey(), keyValue.getValue()));
}
return tags;
}
};
}
@Component
public static class Runner implements CommandLineRunner {
@Autowired
private ObservationRegistry observationRegistry;
@Override
public void run(String... args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(100);
//预热
for (int j = 0; j < 100000; j++) {
executor.execute(() -> {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
});
}
//测试
long start = System.currentTimeMillis();
Future<?> test[] = new Future[100];
for (int i = 0; i < 100; i++) {
test[i] = executor.submit(() -> {
for (int j = 0; j < 100000; j++) {
Observation observation = Observation.createNotStarted("test", observationRegistry);
observation.start();
observation.stop();
}
});
}
for (int i = 0; i < 100; i++) {
test[i].get();
}
log.info("cost {} ms", System.currentTimeMillis() - start);
}
}
}
并且,针对这个问题,我们已经提交了 Issue,希望能够尽快采纳建议到 Micrometer 的主干分支中:Remove LongtaskTimer.Sample in DefaultMeterObservationHandler (for the purpose of prevent memory leak and lower CPU consuming)