spring-boot-Micrometer+Prometheus

environment:
micrometer 1.8.2
prometheus 0.14.1
spring-boot-actuator 2.6.6

Use Cases

<!-- Springbootstart upactuator,Dependencies will be introduced by default:micrometer-core -->
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
  <version>2.6.6</version>
</dependency>
<!-- micrometerbridgingprometheusBag,Related dependencies will be introduced by default:io.prometheus:simpleclient -->
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
  <version>1.8.2</version>
</dependency> 

Timer

Record the execution time of each task. The default time window for Doubi is 1 minute. If you want to modify it, you can configure: io.micrometer.core.instrument.distribution.DistributionStatisticConfig.Builder#expiry

Metrics.timer("my_name", "my_tag_1", "my_tag_2").record(() -> {
    doMyJob();
}); 

LongTaskTimer

Similar to Timer, it records the task execution time. The official comments also say that LongTask is a subjective judgment, for example: a task of more than 1 minute
A big difference is that there is an additional interface method: io.micrometer.core.instrument.LongTaskTimer#activeTasks
Get the number of tasks currently executing

Metrics.more().longTaskTimer("my_name", "my_tag").record(doMyJob()); 

Gague

When the server pulls indicators, or when the client reports indicators, the provided objects and methods are called to obtain the current indicators. That is: what is recorded is the current status

RingBuffer<MatchingOutput> rb = disruptor.getRingBuffer();
Metrics.gauge("ringbuffer_remaining", Tags.of("my_tag_1", "my_tag_2"), rb, RingBuffer::remainingCapacity); 

Counter

Counter dots

Metrics.counter("my_request", "my_tag_1", "my_tag_2").increment(); 

DistributionSummary

Sample distribution of tracking events. An example is the response size of a request to an http server.

DistributionSummary ds =  DistributionSummary.builder("my.data.size")
    .tag("type", "my_type_1")
    .publishPercentileHistogram()
    .register(Metrics.globalRegistry);
ds.record(myValue); 

Configure actuator

Configure the indicator pulling port and the web interface that needs to be exposed

management:
  server:
    port: 9999
  endpoints:
    web:
      exposure:
        include: '*'
  metrics:
    tags:
      application: myAppName 

Springboot integration startup process

Pull indicators: http://localhost:9999/actuator/prometheus

servlet configuration

There are many entrances to interface automatic configuration, such as the following two

  1. usuallywebServe:org.springframework.boot.actuate.autoconfigure.endpoint.web.servlet.WebMvcEndpointManagementContextConfiguration#webEndpointServletHandlerMapping
  2. Cloud service provider:org.springframework.boot.actuate.autoconfigure.cloudfoundry.servlet.CloudFoundryActuatorAutoConfiguration#cloudFoundryWebEndpointServletHandlerMapping

servlet logic

org.springframework.boot.actuate.metrics.export.prometheus.PrometheusScrapeEndpoint

@ReadOperation(producesFrom = TextOutputFormat.class)
public WebEndpointResponse<String> scrape(TextOutputFormat format, @Nullable Set<String> includedNames) {
    try {
        Writer writer = new StringWriter(this.nextMetricsScrapeSize);
        Enumeration<MetricFamilySamples> samples = (includedNames != null)
        ? this.collectorRegistry.filteredMetricFamilySamples(includedNames)
        : this.collectorRegistry.metricFamilySamples();
        format.write(writer, samples);

        String scrapePage = writer.toString();
        this.nextMetricsScrapeSize = scrapePage.length() + METRICS_SCRAPE_CHARS_EXTRA;

        return new WebEndpointResponse<>(scrapePage, format);
    }
    catch (IOException ex) {
        // This actually never happens since StringWriter doesn't throw an IOException
        throw new IllegalStateException("Writing metrics failed", ex);
    }
} 

No filter is configured, get the enumeration object
io.prometheus.client.CollectorRegistry#metricFamilySamples -》 io.prometheus.client.CollectorRegistry.MetricFamilySamplesEnumeration#MetricFamilySamplesEnumeration()

io.prometheus.client.CollectorRegistry.MetricFamilySamplesEnumeration

  1. sampleNameFilter
  2. collectorIter: value collection corresponding to the io.prometheus.client.CollectorRegistry#namesToCollectors attribute
  3. Query next in the constructor once: findNextElement

findNextElement

Iterate over the collectorIter iterator once and collect the metrics once

  1. io.prometheus.client.Collector#collect(io.prometheus.client.Predicate<java.lang.String>)
  2. io.micrometer.prometheus.MicrometerCollector#collect
  3. Traverse all io.micrometer.prometheus.MicrometerCollector.Child objects in the io.micrometer.prometheus.MicrometerCollector#children collection
  4. For example, lambda anonymous implementation in Gauge type: io.micrometer.prometheus.PrometheusMeterRegistry#newGauge
  5. Group all samples in the traversed child according to conventionName (for example: ringbuffer_remaining). Each group corresponds to a sample family: io.prometheus.client.Collector.MetricFamilySamples
  6. Return the List and assign its iterator to the next property: io.prometheus.client.CollectorRegistry.MetricFamilySamplesEnumeration#next
  7. Traversesamples:io.prometheus.client.Collector.MetricFamilySamples#samples
  8. Write the sample (io.prometheus.client.Collector.MetricFamilySamples.Sample) data into the response result: org.springframework.boot.actuate.metrics.export.prometheus.TextOutputFormat#CONTENT_TYPE_004#write

Interface output case

Publicly configured tag, all indicators will have this tag: application=myAppName
Indicator name: ringbuffer_remaining
Indicator tag: type=my_tag_1
Indicator type: gauge

# HELP ringbuffer_remaining 
# TYPE ringbuffer_remaining gauge
ringbuffer_remaining{application="myAppName",type="my_tag_1",} 1024.0 

Sampling logic

Gauge

Combined with the code from the previous Gague use case
io.micrometer.core.instrument.internal.DefaultGauge

  1. ref: weak reference corresponding to the ringbuffer instance
  2. value: corresponds to the RingBuffer::remainingCapacity method

The sampling logic is to directly call the instance response method to return the result as the dot value.

public class DefaultGauge<T> extends AbstractMeter implements Gauge {
    ...
    private final WeakReference<T> ref;
    private final ToDoubleFunction<T> value;
    ...
    @Override
    public double value() {
        T obj = ref.get();
        if (obj != null) {
            try {
                return value.applyAsDouble(obj);
            }
            catch (Throwable ex) {
                logger.log("Failed to apply the value function for the gauge '" + getId().getName() + "'.", ex);
            }
        }
        return Double.NaN;
    }
    ...
} 

Timer

io.micrometer.prometheus.PrometheusTimer

  1. count: LongAdder, incrementing counter
  2. totalTime: LongAdder, the cumulative result of task time consumption
  3. max: io.micrometer.core.instrument.distribution.TimeWindowMax, a simplified version of ringbuffer, used to record the maximum value in the time window
  4. histogramFlavor: Histogram flavor (type), there are only two types in the current version: Prometheus/VictoriaMetrics
  5. histogram
  6. Prometheustype:io.micrometer.core.instrument.distribution.TimeWindowFixedBoundaryHistogram#TimeWindowFixedBoundaryHistogram
  7. VictoriaMetricstype:io.micrometer.core.instrument.distribution.FixedBoundaryVictoriaMetricsHistogram#FixedBoundaryVictoriaMetricsHistogram

The sampling logic, that is, the monitoring method will trigger the recording when it is actually called. The sampling logic is just to call the interface method implemented by the instance to take a sample snapshot when the interface pulls data.

  1. io.micrometer.core.instrument.distribution.HistogramSupport#takeSnapshot()
  2. io.micrometer.prometheus.PrometheusTimer#takeSnapshot
  3. io.micrometer.core.instrument.AbstractTimer#takeSnapshot
  4. If histogram != null, append histogramCounts data
--io.micrometer.core.instrument.AbstractTimer#takeSnapshot
    @Override
    public HistogramSnapshot takeSnapshot() {
        return histogram.takeSnapshot(count(), totalTime(TimeUnit.NANOSECONDS), max(TimeUnit.NANOSECONDS));
    }
--io.micrometer.prometheus.PrometheusTimer#takeSnapshot
    @Override
    public HistogramSnapshot takeSnapshot() {
        HistogramSnapshot snapshot = super.takeSnapshot();

        if (histogram == null) {
            return snapshot;
        }

        return new HistogramSnapshot(snapshot.count(),
                snapshot.total(),
                snapshot.max(),
                snapshot.percentileValues(),
                histogramCounts(),
                snapshot::outputSummary);
    } 

time window

io.micrometer.core.instrument.distribution.TimeWindowMax

  1. rotatingUpdater: AtomicIntegerFieldUpdater, rotating identifier atomic update method
  2. clock: Clock, system clock, returns the current system timestamp
  3. durationBetweenRotatesMills: long, scroll step size
  4. ringBuffer: AtomicLong[], queue
  5. currentBucket: int, the current cursor of the queue
  6. lastRotateTimestampMills: the timestamp of the last rotate
  7. rotating: int, identifier, 0 – not rotating, 1 – rotating

Every time a record is written or poll is queried, it will be checked in advance whether it needs to be rotated and the rotate method will be called.
io.micrometer.core.instrument.distribution.TimeWindowMax#rotate

  1. wallTime=current system time
  2. timeSinceLastRotateMillis = wallTime – lastRotateTimestampMillis, that is: the time interval between the current time and the last rollover
  3. If it is lower than the step, return directly without flipping: timeSinceLastRotateMillis < durationBetweenRotatesMillis
  4. Otherwise, update the identifier, indicating that it is currently flipping and needs to block and wait for the next time.
  5. If timeSinceLastRotateMillis has exceeded the length of the entire queue: timeSinceLastRotateMillis >= durationBetweenRotatesMillis * ringBuffer.length
  6. Then just reset the queue and return
  7. Traverse the ringBuffer and set all positions to 0
  8. currentBucket is updated to 0
  9. Update the last rollover time: lastRotateTimestampMillis = wallTime – timeSinceLastRotateMillis % durationBetweenRotatesMillis
  10. Otherwise, reset the bucket that has timed out between the current time and the last rollover time to 0
int iterations = 0;
do {
    ringBuffer[currentBucket].set(0);
    if (++currentBucket >= ringBuffer.length) {
        currentBucket = 0;
    }
    timeSinceLastRotateMillis -= durationBetweenRotatesMillis;
    lastRotateTimestampMillis += durationBetweenRotatesMillis;
} while (timeSinceLastRotateMillis >= durationBetweenRotatesMillis && ++iterations < ringBuffer.length); 

For example: the current time is 4, the last rollover time is 2, the queue size is 3, durationBetweenRotatesMillis=1, currentBucket=1, then timeSinceLastRotateMillis=4-2=2
Cycle round 1

  1. renewringBuffer[1]=0
  2. Update currentBucket=2
  3. Update timeSinceLastRotateMillis=2-1
  4. renewlastRotateTimestampMillis=2+1
  5. Update iterations=1

Cycle round 2

  1. renewringBuffer[2]=0
  2. Update currentBucket=3
  3. currentBucket>=queue length
  4. Reset currentBucket=0
  5. Update timeSinceLastRotateMillis=1-1
  6. renewlastRotateTimestampMillis=3+1
  7. Update iterations=2, at this time timeSinceLastRotateMillis=0, which is less than durationBetweenRotatesMillis, end the loop

Rotate the legend once

When it is discovered that the last rotation time (lastRotateTimestampMillis) has lagged behind the current time (wallTime) by 4 units, lastRotateTimestampMillis moves 4 time units to the right, and currentBucket also moves 4 units to the right. But because currentBucket is the index of the array, when it goes out of bounds, it moves to 0 and continues (a ring). For example, the following picture:
The currentBucket moves 4 units to the right, the queue length is 3, the current index=0, then the index=2 after the move (turns around)
image.png

Summarize

Micrometer can integrate Prometheus and time series databases such as influxDB. Its main function is to bridge, similar to the relationship between Slf4j, log4j, and logback. Provide a general management capability and connect the management data to the corresponding time series database. Users only need to care about when to do management. For example:

  1. Bridge the io.micrometer.prometheus.PrometheusMeterRegistry in the bridging package to bridge the point data to io.prometheus.client.CollectorRegistry
  2. Use the io.micrometer.influx.InfluxMeterRegistry in the bridging package to bridge and push the management data to influxDB according to the influx protocol.
  3. The default push frequency is once every minute, which can be configured as needed: io.micrometer.core.instrument.push.PushRegistryConfig#step
  4. The thread pool defaults to single thread: java.util.concurrent.Executors#newSingleThreadScheduledExecutor(java.util.concurrent.ThreadFactory)
  5. Thread pool thread naming rules are implemented for influxDB: influx-metrics-publisher

The actuator is like a launcher, which will automate the configuration required to connect to a specific time series database, such as indicator matrix related: Prometheus exposure web interface related configuration, influx related configuration, micrometer metrics and other related configurations

  1. org.springframework.boot.actuate.autoconfigure.metrics.JvmMetricsAutoConfiguration
  2. org.springframework.boot.actuate.autoconfigure.metrics.KafkaMetricsAutoConfiguration
  3. org.springframework.boot.actuate.autoconfigure.metrics.Log4J2MetricsAutoConfiguration
  4. org.springframework.boot.actuate.autoconfigure.metrics.LogbackMetricsAutoConfiguration
  5. org.springframework.boot.actuate.autoconfigure.metrics.SystemMetricsAutoConfiguration

Finally, you can use reporting tools such as Grafana to connect to data sources to display charts. Common architectures are: Micrometer-》Prometheus-》Grafana
Note: There are bottlenecks in front-end page rendering. For example, if there are too many tags for an indicator, the report will be very laggy. Generally, 5k tags will be perceived, and 1W+ will obviously affect the use.