首页
About Me
推荐
weibo
github
Search
1
linuxea:gitlab-ci之docker镜像质量品质报告
49,451 阅读
2
linuxea:如何复现查看docker run参数命令
23,044 阅读
3
Graylog收集文件日志实例
18,580 阅读
4
linuxea:jenkins+pipeline+gitlab+ansible快速安装配置(1)
18,275 阅读
5
git+jenkins发布和回滚示例
18,181 阅读
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack/logs
Open-Falcon
Prometheus
victoriaMetrics
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
互联咨询
最后的净土
软件交付
持续集成
gitops
devops
登录
Search
标签搜索
kubernetes
docker
zabbix
Golang
mariadb
持续集成工具
白话容器
elk
linux基础
nginx
dockerfile
Gitlab-ci/cd
最后的净土
基础命令
gitops
jenkins
docker-compose
Istio
haproxy
saltstack
marksugar
累计撰写
690
篇文章
累计收到
139
条评论
首页
栏目
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack/logs
Open-Falcon
Prometheus
victoriaMetrics
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
互联咨询
最后的净土
软件交付
持续集成
gitops
devops
页面
About Me
推荐
weibo
github
搜索到
90
篇与
的结果
2023-08-25
linuxea: openobseve HA本地单集群模式
ha默认就不支持本地存储了,集群模式下openobseve会运行多个节点,每个节点都是无状态的,数据存储在对象存储中,元数据在etcd中,因此理论上openobseve可以随时进行水平扩容组件如下:router:处理数据写入和页面查询,作为路由etcd: 存储用户信息,函数,规则,元数据等s3: 数据本身querier: 数据查询ingester: 数据没有在被写入到s3中之前,数据会进行临时通过预写来确保数据不会丢失,这类似于prometheus的walcompactor: 合并小文件到大文件,以及数据保留时间要配置集群模式,我们需要一个 对象存储,awk的s3,阿里的oss,或者本地的minio,还需要部署一个etcd作为元数据的存储,并且为ingester数据提供一个pvc,因为openobseve是运行在k8s上etcd我们将etcd运行在外部k8s之外的外部节点version: '2' services: oo_etcd: container_name: oo_etcd #image: 'docker.io/bitnami/etcd/3.5.8-debian-11-r4' image: uhub.service.ucloud.cn/marksugar-k8s/etcd:3.5.8-debian-11-r4 #network_mode: host restart: always environment: - ALLOW_NONE_AUTHENTICATION=yes - ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379 #- ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 #- ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380 - ETCD_DATA_DIR=/bitnami/etcd/data volumes: - /etc/localtime:/etc/localtime:ro # 时区2 - /data/etcd/date:/bitnami/etcd # chown -R 777 /data/etcd/date/ ports: - 2379:2379 - 2380:2380 logging: driver: "json-file" options: max-size: "50M" mem_limit: 2048mpvc需要一个安装好的storageClass,我这里使用的是nfs-subdir-external-provisioner创建的nfs-latestminio部署一个单机版本的minio进行测试即可version: '2' services: oo_minio: container_name: oo_minio image: "uhub.service.ucloud.cn/marksugar-k8s/minio:RELEASE.2023-02-10T18-48-39Z" volumes: - /etc/localtime:/etc/localtime:ro # 时区2 - /docker/minio/data:/data command: server --console-address ':9001' /data environment: - MINIO_ACCESS_KEY=admin #管理后台用户名 - MINIO_SECRET_KEY=admin1234 #管理后台密码,最小8个字符 ports: - 9000:9000 # api 端口 - 9001:9001 # 控制台端口 logging: driver: "json-file" options: max-size: "50M" mem_limit: 2048m healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 30s timeout: 20s retries: 3启动后创建一个名为openobserve的桶安装openObserve我们仍然使用helm进行安装helm repo add openobserve https://charts.openobserve.ai helm repo update kubectl create ns openobserve对values.yaml定制的内容如下,latest.yaml:image: repository: uhub.service.ucloud.cn/marksugar-k8s/openobserve pullPolicy: IfNotPresent # Overrides the image tag whose default is the chart appVersion. tag: "latest" # 副本数 replicaCount: ingester: 1 querier: 1 router: 1 alertmanager: 1 compactor: 1 ingester: persistence: enabled: true size: 10Gi storageClass: "nfs-latest" # NFS的storageClass accessModes: - ReadWriteOnce # Credentials for authentication # 账号密码 auth: ZO_ROOT_USER_EMAIL: "root@example.com" ZO_ROOT_USER_PASSWORD: "abc123" # s3地址 ZO_S3_ACCESS_KEY: "admin" ZO_S3_SECRET_KEY: "admin1234" etcd: enabled: false # if true then etcd will be deployed as part of openobserve externalUrl: "172.16.100.47:2379" config: # ZO_ETCD_ADDR: "172.16.100.47:2379" # etcd地址 # ZO_HTTP_ADDR: "172.16.100.47:2379" ZO_DATA_DIR: "./data/" #数据目录 # 开启minio ZO_LOCAL_MODE_STORAGE: s3 ZO_S3_SERVER_URL: http://172.16.100.47:9000 ZO_S3_REGION_NAME: local ZO_S3_ACCESS_KEY: admin ZO_S3_SECRET_KEY: admin1234 ZO_S3_BUCKET_NAME: openobserve ZO_S3_BUCKET_PREFIX: openobserve ZO_S3_PROVIDER: minio ZO_TELEMETRY: "false" # 禁用匿名 ZO_WAL_MEMORY_MODE_ENABLED: "false" # 内存模式 ZO_WAL_LINE_MODE_ENABLED: "true" # wal写入模式 #ZO_S3_FEATURE_FORCE_PATH_STYLE: "true" # 数据没有在被写入到s3中之前,数据会进行临时通过预写来确保数据不会丢失,这类似于prometheus的wal resources: ingester: {} querier: {} compactor: {} router: {} alertmanager: {} autoscaling: ingester: enabled: false minReplicas: 1 maxReplicas: 100 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80 querier: enabled: false minReplicas: 1 maxReplicas: 100 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80 router: enabled: false minReplicas: 1 maxReplicas: 100 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80 compactor: enabled: false minReplicas: 1 maxReplicas: 100 targetCPUUtilizationPercentage: 80 # targetMemoryUtilizationPercentage: 80指定本地minio,桶名称,认证信息等;指定etcd地址;为ingester指定sc; 而后安装 helm upgrade --install openobserve -f latest.yaml --namespace openobserve openobserve/openobserve如下[root@master-01 ~/openObserve]# helm upgrade --install openobserve -f latest.yaml --namespace openobserve openobserve/openobserve Release "openobserve" does not exist. Installing it now. NAME: openobserve LAST DEPLOYED: Sun Aug 20 18:04:31 2023 NAMESPACE: openobserve STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: 1. Get the application URL by running these commands: kubectl --namespace openobserve port-forward svc/openobserve-openobserve-router 5080:5080 [root@master-01 ~/openObserve]# kubectl -n openobserve get pod NAME READY STATUS RESTARTS AGE openobserve-alertmanager-6f486d5df5-krtxm 1/1 Running 0 53s openobserve-compactor-98ccf664c-v9mkb 1/1 Running 0 53s openobserve-ingester-0 1/1 Running 0 53s openobserve-querier-695cf4fcc9-854z8 1/1 Running 0 53s openobserve-router-65b68b4899-j9hs7 1/1 Running 0 53s [root@master-01 ~/openObserve]# kubectl -n openobserve get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE data-openobserve-ingester-0 Bound pvc-5d86b642-4464-4b3e-950a-d5e0b4461c27 10Gi RWO nfs-latest 2m47s而后配置一个Ingress指向openobserve-routerapiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: openobserve-ui namespace: openobserve labels: app: openobserve annotations: # kubernetes.io/ingress.class: nginx cert-manager.io/issuer: letsencrypt kubernetes.io/tls-acme: "true" nginx.ingress.kubernetes.io/enable-cors: "true" nginx.ingress.kubernetes.io/connection-proxy-header: keep-alive nginx.ingress.kubernetes.io/proxy-connect-timeout: '600' nginx.ingress.kubernetes.io/proxy-send-timeout: '600' nginx.ingress.kubernetes.io/proxy-read-timeout: '600' nginx.ingress.kubernetes.io/proxy-body-size: 32m spec: ingressClassName: nginx rules: - host: openobserve.test.com http: paths: - path: / pathType: ImplementationSpecific backend: service: name: openobserve-router port: number: 5080添加本地hosts后打开此时是没有任何数据的测试我们手动写入测试数据[root@master-01 ~/openObserve]# curl http://openobserve.test.com/api/linuxea/0820/_json -i -u 'root@example.com:abc123' -d '[{"author":"marksugar","name":"www.linuxea.com"}]' HTTP/1.1 200 OK Date: Sun, 20 Aug 2023 11:02:08 GMT Content-Type: application/json Content-Length: 65 Connection: keep-alive Vary: Accept-Encoding vary: accept-encoding Access-Control-Allow-Origin: * Access-Control-Allow-Credentials: true Access-Control-Allow-Methods: GET, PUT, POST, DELETE, PATCH, OPTIONS Access-Control-Allow-Headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization Access-Control-Max-Age: 1728000 {"code":200,"status":[{"name":"0820","successful":1,"failed":0}]}数据插入同时,在NFS的本地磁盘也会写入[root@Node-172_16_100_49 ~]# cat /data/nfs-share/openobserve/data-openobserve-ingester-0/wal/files/linuxea/logs/0820/0_2023_08_20_13_2c624affe8540b70_7099015230658842624DKMpVA.json {"_timestamp":1692537124314778,"author":"marksugar","name":"www.linuxea.com"}在minio内的数据也进行写入minio中存储的数据无法查看,因为元数据在etcd中。
2023年08月25日
208 阅读
0 评论
0 点赞
2023-08-20
linuxea: openobseve单节点和查询语法
OpenObserve声称可以比Elasticsearch 它⼤约可以节省 140 倍的存储成本,同时由Rust开发的可观测性平台(⽇志、指标、追踪),它可以进行日志搜索,基于sql查询语句和搜索的日志关键字的上下周围数据,高压缩比的存储,身份验证和多租户,支持S3,miniio的高可用和集群,并且兼容elasticsearch的摄取,搜索,聚合api,计划报警和实时报警等功能。如果只是对日志搜索引擎感兴趣,相比于Elasticsearch的和zincSearch,OpenObserve更轻量,它不依赖于数据索引,数据被压缩后存储或者使用parquet列格式存储到对象存储中,尽管如此,在分区和缓存等技术的加持下,速度也不会太慢,并且在聚合查询数据情况下,OpenObserve的速度要比es快的多。OpenObserve的节点是无状态的,因此在水平扩展中,无需担心数据的复制损坏,他的运维工作和成本比Elasticsearch要低得多。并且OpenObserve内置的图像界面,单节点无需使用其他组件,仅仅OpenObserve就完成了存储和查询。同时OpenObserve作为prometheus的远程存储和查看,但是对于查询语句并非全部支持,因此,我们只进行日志收集的处理,其他不进行测试。单节点sled单节点和本地磁盘单节点使用本地磁盘模式,也是默认的方式,对于简单使用和测试在官方数据中,每天可以处理超过2T的数据我们可以简单的理解为sled是存储的元数据的地方,本地磁盘存储的数据。sled单机和对象存储数据存放在对象存储后,高可用的问题交给了对象存储提供商,并且openobseve使用列式存储的方式,并且进行分区,这样的情况下就规避了一定部分因为对象存储与openobseve之间的网络延迟导致的网络问题etcd和对象存储除此之外,进一步来说,元数据也需要进行有个妥善的地方存储,因此,元数据存储的etcd,meta数据存储在本地或者s3来保证数据的安全性单机安装在k8s的用户可以参考官方的deploy文件, 而使用docker-compose我们需要指定三个环境变量,分别是数据的存储目录,用户名和密码 - ZO_DATA_DIR=/data - ZO_ROOT_USER_EMAIL=root@example.com - ZO_ROOT_USER_PASSWORD=Complexpass#123如下version: "2.2" services: openobserve: container_name: openobserve restart: always image: public.ecr.aws/zinclabs/openobserve:latest ports: - "5080:5080" volumes: - /etc/localtime:/etc/localtime:ro # 时区2 - /data/openobserve:/data environment: - ZO_DATA_DIR=/data - ZO_ROOT_USER_EMAIL=root@example.com - ZO_ROOT_USER_PASSWORD=Complexpass#123 logging: driver: "json-file" options: max-size: "100M" mem_limit: 4096m而后我们直接打开映射的5080端口当前的版本中多语言通过机器翻译,因此并不准确loglog支持curl,filebeat,fluentbit,fluentd, vector等,并且这些提供了一定的示例curlcurl在低版本中,我们需要-d指定文件最快,如果你的版本在7.82,那么可以使用--json指定,参考官方文档1.创建一个linuxea的组,组下创建一个名为0819的分区,在0819中写入'[{"author":"marksugar"}]'[root@Node-172_16_100_151 /data/openObserve]# curl http://172.16.100.151:5080/api/linuxea/0819/_json -i -u 'root@example.com:Complexpass#123' -d '[{"author":"marksugar"}]' HTTP/1.1 200 OK content-length: 65 content-type: application/json vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers date: Sat, 19 Aug 2023 07:43:15 GMT {"code":200,"status":[{"name":"0819","successful":1,"failed":0}]}而后创建一组数据,内容如下cat > linuxea.json << LOF [{ "name": "linuxea", "web": "www.linuxea.com", "time": "2023-08-19", "log": "2023-08-18 09:04:01 Info Super Saiyan , this is a normal phenomenon", "info": "this is test" }] LOF接着也添加到linuxea/0819curl http://172.16.100.151:5080/api/linuxea/0819/_json -i -u 'root@example.com:Complexpass#123' --data-binary "@linuxea.json"如:[root@Node-172_16_100_151 /data/openObserve]# cat > linuxea.json << LOF > [{ > "name": "linuxea", > "web": "www.linuxea.com", > "time": "2023-08-19", > "log": "2023-08-18 09:04:01 Info Super Saiyan , this is a normal phenomenon", > "info": "this is test" > }] > LOF [root@Node-172_16_100_151 /data/openObserve]# curl http://172.16.100.151:5080/api/linuxea/0819/_json -i -u 'root@example.com:Complexpass#123' --data-binary "@linuxea.json" HTTP/1.1 200 OK content-length: 65 content-type: application/json vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers date: Sat, 19 Aug 2023 07:46:37 GMT {"code":200,"status":[{"name":"0819","successful":1,"failed":0}]}3, 在linuxea组下创建一个0820的分区写入不同的数据curl http://172.16.100.151:5080/api/linuxea/0820/_json -i -u 'root@example.com:Complexpass#123' -d '[{"author":"marksugar","name":"www.linuxea.com"}]'如下:[root@Node-172_16_100_151 /data/openObserve]# curl http://172.16.100.151:5080/api/linuxea/0820/_json -i -u 'root@example.com:Complexpass#123' -d '[{"author":"marksugar","name":"www.linuxea.com"}]' HTTP/1.1 200 OK content-length: 65 content-type: application/json vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers date: Sat, 19 Aug 2023 07:47:45 GMT {"code":200,"status":[{"name":"0820","successful":1,"failed":0}]}接着 ,我们回到界面查看数据的插入情况08190820查询默认情况下,如果将要查询的数据的字段是msg, message, log, logs,可以使用match_all('error'),match_all('error')是说匹配所有的error,但前提是msg, message, log, logs的字段内如果不是msg, message, log, logs的,比如是body,我们可以使用str_match(body, 'error')当然,这也可以使用str_match(log, 'error')来进行查询其他更多用法,参考example-queries我们重组数据进行查询如果此时,以下字段中的log中包含了我们需要的数据,那么我们就可以使用match_allcat > linuxea.json << LOF [{ "name": "linuxea", "web": "www.linuxea.com", "time": "2023-08-19", "log": "2023-08-18 09:04:01 Info Super Saiyan , this is a normal phenomenon linuxea", "info": "this is test", "author":"marksugar", }] LOF比如,我们查询的关键字是linuxea如果我们要查询的字段是marksugar,就不能使用match_all('marksugar'),因为match_all只能在是msg, message, log, logs中默认使用,正确的语法是:str_match(author, 'marksugar'),并且str_match的查询比match_all要快而对于其他的,可以直接使用key:value来, 比如: name='linuxea'对于这条查询,也可以使用sql语句的方式 name='linuxea' 等于 SELECT * FROM "0819" WHERE name='linuxea'参考基于k8s上loggie/vector/openobserve日志收集https://openobserve.ai/docs/example-queries/https://everything.curl.dev/http/post/jsonhttps://openobserve.ai/docs/ingestion/logs/curl/
2023年08月20日
263 阅读
0 评论
0 点赞
2023-08-19
linuxea: 基于k8s上loggie/vector/openobserve日志收集
在上次的日志收集组件变化中简单的介绍了新方案,通常要么基于K8s收集容器的标准输出,要么收集文件。我们尝试使用最新的方式进行配置日志收集的组合进行测试,如下:但是,在开始之前,我们需要部署kafka,zookeeper和kowl1.kafka修改kafka的ip地址version: "2" services: zookeeper: container_name: zookeeper image: uhub.service.ucloud.cn/marksugar-k8s/zookeeper:latest container_name: zookeeper restart: always ports: - '2182:2181' environment: - ALLOW_ANONYMOUS_LOGIN=yes logging: driver: "json-file" options: max-size: "100M" mem_limit: 2048m kafka: hostname: 172.16.100.151 image: uhub.service.ucloud.cn/marksugar-k8s/kafka:2.8.1 container_name: kafka user: root restart: always ports: - '9092:9092' volumes: - "/data/log/kafka:/bitnami/kafka" # chmod 777 -R /data/kafka environment: - KAFKA_BROKER_ID=1 - KAFKA_LISTENERS=PLAINTEXT://:9092 - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.16.100.151:9092 - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 - ALLOW_PLAINTEXT_LISTENER=yes depends_on: - zookeeper logging: driver: "json-file" options: max-size: "100M" mem_limit: 2048m kowl: container_name: kowl # network_mode: host restart: always # image: quay.io/cloudhut/kowl:v1.5.0 image: uhub.service.ucloud.cn/marksugar-k8s/kowl:v1.5.0 restart: on-failure hostname: kowl ports: - "8081:8080" environment: KAFKA_BROKERS: 172.16.100.151:9092 volumes: - /etc/localtime:/etc/localtime:ro # 时区2 depends_on: - kafka logging: driver: "json-file" options: max-size: "100M" mem_limit: 2048m2.loggie接着参考官网helm-chart下载,而后解压,配置loggie的用例VERSION=v1.4.0 helm pull https://github.com/loggie-io/installation/releases/download/$VERSION/loggie-$VERSION.tgz && tar xvzf loggie-$VERSION.tgz根据官网的配置示例进行修改,而后得到一个如下的latest.yaml,我们关键需要定义资源配额,加速后镜像地址,外部挂载容器的实际目录image: uhub.service.ucloud.cn/marksugar-k8s/loggie:v1.4.0 resources: limits: cpu: 2 memory: 2Gi requests: cpu: 100m memory: 100Mi extraArgs: {} # log.level: debug # log.jsonFormat: true extraVolumeMounts: - mountPath: /var/log/pods name: podlogs - mountPath: /var/lib/docker/containers name: dockercontainers - mountPath: /var/lib/kubelet/pods name: kubelet extraVolumes: - hostPath: path: /var/log/pods type: DirectoryOrCreate name: podlogs - hostPath: # path: /var/lib/docker/containers path: /data/containerd # containerd的实际目录 type: DirectoryOrCreate name: dockercontainers - hostPath: path: /var/lib/kubelet/pods type: DirectoryOrCreate name: kubelet extraEnvs: {} timezone: Asia/Shanghai ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ nodeSelector: {} ## Affinity for pod assignment ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity affinity: {} # podAntiAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # - labelSelector: # matchExpressions: # - key: app # operator: In # values: # - loggie # topologyKey: "kubernetes.io/hostname" ## Tolerations for pod assignment ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ tolerations: [] # - effect: NoExecute # operator: Exists # - effect: NoSchedule # operator: Exists updateStrategy: type: RollingUpdate ## Agent mode, ignored when aggregator.enabled is true config: loggie: reload: enabled: true period: 10s monitor: logger: period: 30s enabled: true listeners: filesource: period: 10s filewatcher: period: 5m reload: period: 10s sink: period: 10s queue: period: 10s pipeline: period: 10s discovery: enabled: true kubernetes: # Choose: docker or containerd containerRuntime: containerd # Collect log files inside the container from the root filesystem of the container, no need to mount the volume rootFsCollectionEnabled: false # Automatically parse and convert the wrapped container standard output format into the original log content parseStdout: false # If set to true, it means that the pipeline configuration generated does not contain specific Pod paths and meta information, # and these data will be dynamically obtained by the file source, thereby reducing the number of configuration changes and reloads. dynamicContainerLog: false # Automatically add fields when selector.type is pod in logconfig/clusterlogconfig typePodFields: logconfig: "${_k8s.logconfig}" namespace: "${_k8s.pod.namespace}" nodename: "${_k8s.node.name}" podname: "${_k8s.pod.name}" containername: "${_k8s.pod.container.name}" http: enabled: true port: 9196 ## Aggregator mode, by default is disabled aggregator: enabled: false replicas: 2 config: loggie: reload: enabled: true period: 10s monitor: logger: period: 30s enabled: true listeners: reload: period: 10s sink: period: 10s discovery: enabled: true kubernetes: cluster: aggregator http: enabled: true port: 9196 servicePorts: - name: monitor port: 9196 targetPort: 9196 # - name: gprc # port: 6066 # targetPort: 6066 serviceMonitor: enabled: false ## Scrape interval. If not set, the Prometheus default scrape interval is used. interval: 30s relabelings: [] metricRelabelings: []而后调试并安装helm install loggie -f latest.yaml -nloggie --create-namespace --dry-run ./ helm install loggie -f latest.yaml -nloggie --create-namespace ./默认情况下会以ds的方式进行部署,也就是每个Node节点安装一个。[root@master-01 ~/loggie-io]# kubectl -n loggie get pod NAME READY STATUS RESTARTS AGE loggie-42rcs 1/1 Running 0 15d loggie-56sz8 1/1 Running 0 15d loggie-jnzrc 1/1 Running 0 15d loggie-k5xqj 1/1 Running 0 15d loggie-v84wf 1/1 Running 0 14d2.1 配置收集在配置收集日志之前,我们先创建一个pod,加入此时有一组pod,他的标签是app: linuxea,在kustomize中表现如下:commonLabels: app: linuxea而后开始loggie的配置。在loggie的配置可以大致理解为局部配置和全局配置,如果没有特别的要求,默认的全局配置是够用,倘若不够我们需要局部声明不同的配置信息。1,此时创建一个sink上游是kafka,ip地址是172.16.100.151:9092,我们输入类型,地址,即将创建的topic的名称apiVersion: loggie.io/v1beta1 kind: Sink metadata: name: default-kafka spec: sink: | type: kafka brokers: ["172.16.100.151:9092"] topic: "pod-${fields.environment}-${fields.topic}"but,如果这是一个加密的,你需要配置如下apiVersion: loggie.io/v1beta1 kind: Sink metadata: name: default-kafka spec: sink: | type: kafka brokers: ["172.16.100.151:9092"] topic: "pod-${fields.environment}-${fields.topic}" sasl: type: scram userName: 用户名 password: 密码 algorithm: sha2562,而在LogConfig使用的是标签来关联那些pod的日志将会被收集到,如下 labelSelector: app: linuxea # 对应deployment的标签 标记有app: linuxea标签的pod均被收集3,而这些pod的日志的路径paths是pod中标准输出stdout,如果是文件目录这里应该填写对应的地址和正则匹配4,接着配置一个fields来描述资源,key:value fields: topic: "java-demo" environment: "dev"而这个自定义的描述被sink中的环境变量所提取,既:topic: "pod-${fields.environment}-${fields.topic}"5,在interceptors中我们进行了限流,这意味着每秒最多只能处理 interceptors: | - type: rateLimit qps: 900006,最后使用sinkRef关联创建的sink: sinkRef: default-kafka完整的yaml如下:apiVersion: loggie.io/v1beta1 kind: Sink metadata: name: default-kafka spec: sink: | type: kafka brokers: ["172.16.100.151:9092"] topic: "pod-${fields.environment}-${fields.topic}" --- apiVersion: loggie.io/v1beta1 kind: LogConfig metadata: name: java-demo namespace: linuxea-dev spec: selector: type: pod labelSelector: app: linuxea # 对应deployment的标签 pipeline: sources: | - type: file name: production-java-demo paths: - stdout ignoreOlder: 12h workerCount: 128 fields: topic: "java-demo" environment: "dev" interceptors: | - type: rateLimit qps: 90000 - type: transformer actions: - action: jsonDecode(body) sinkRef: default-kafka interceptorRef: default创建完成[root@master-01 ~/loggie-io]# kubectl -n loggie get sink NAME AGE default-kafka 15d [root@master-01 ~/loggie-io]# kubectl -n linuxea-dev get LogConfig NAME POD SELECTOR AGE java-demo {"app":"linuxea"} 15d日志写入后,到kafka查看的日志格式如下:{ "fields":{ "containername":"java-demo" "environment":"dev" "logconfig":"java-demo" "namespace":"linuxea-dev" "nodename":"172.16.100.83" "podname":"production-java-demo-5cf5b97645-4xh89" "topic":"java-demo" } "body":"2023-08-15T22:10:22.773955049+08:00 stdout F 2023-08-15 22:10:22.773 INFO 7 --- [ main] com.example.demo.DemoApplication : Started DemoApplication in 1.492 seconds (JVM running for ..." }3.openobserve我们需要安装openobserve,日志将会被消费到openobserve,安装openobserve在172.16.100.151的节点上version: "2.2" services: openobserve: container_name: openobserve restart: always image: public.ecr.aws/zinclabs/openobserve:latest ports: - "5080:5080" volumes: - /etc/localtime:/etc/localtime:ro # 时区2 - /data/openobserve:/data environment: - ZO_DATA_DIR=/data - ZO_ROOT_USER_EMAIL=root@example.com - ZO_ROOT_USER_PASSWORD=Complexpass#123 logging: driver: "json-file" options: max-size: "100M" mem_limit: 4096m接着我们就可以消费kafka后,将日志用vector写入到172.16.100.151上的openobserve了4.vectorvector作为替代logstash的角色,在此处的作用是消费kafka中的数据此时,我们需要配置在github的vector的releases页面下载安装包,我直接下载的rpmhttps://github.com/vectordotdev/vector/releases/download/v0.31.0/vector-0.31.0-1.x86_64.rpm安装完之后,我们需要创建一个配置文件vector.toml。格式非常简单,如下:mv /etc/vector/vector.toml /etc/vector/vector.toml-bak cat > /etc/vector/vector.toml << EOF [api] enabled = true address = "0.0.0.0:8686" [sources.kafka151] type = "kafka" bootstrap_servers = "172.16.100.151:9092" group_id = "consumer-group-name" topics = [ "pod-dev-java-demo" ] [sources.kafka151.decoding] codec = "json" [sinks.openobserve] type = "http" inputs = [ "kafka151" ] uri = "http://172.16.100.151:5080/api/pod-dev-java-demo/default/_json" method = "post" auth.strategy = "basic" auth.user = "root@example.com" auth.password = "Complexpass#123" compression = "gzip" encoding.codec = "json" encoding.timestamp_format = "rfc3339" healthcheck.enabled = false EOFbut,如果kafka加密了的话,我们需要添加额外的sasl配置[sources.kafka151] type = "kafka" bootstrap_servers = "172.16.100.151:9092" group_id = "consumer-group-name" topics = [ "pod-dev-java-demo" ] sasl.enabled = true sasl.mechanism = "SCRAM-SHA-256" sasl.password = "密码" sasl.username = "用户名" [sources.kafka151.decoding] codec = "json"对于日志的内容的处理,可以借助https://playground.vrl.dev/将上述文件替换到/etc/vector/vector.toml而后,启动systemctl start vector systemctl enable vector注意:uri = "http://172.16.100.151:5080/api/pod-dev-java-demo/default/_json",我们可以理解成http://172.16.100.151:5080/api/[group]/[items]/_json,如果在一个项目组的多个项目,我们可以通过这种方式进行归类回到openobserve查看而后点击explore查看日志回到logs查看5.openobsever搜索此时我的日志字段如下{ "fields":{ "podname":"production-java-demo-5cf5b97645-9ws4w" "topic":"java-demo" "containername":"java-demo" "environment":"dev" "logconfig":"java-demo" "namespace":"linuxea-dev" "nodename":"172.16.100.83" } "body":"2023-08-15T23:19:33.032689346+08:00 stdout F 2023-08-15 23:19:33.032 INFO 7 --- [ main] com.example.demo.DemoApplication : Started DemoApplication in 1.469 seconds (JVM running for ..." }如果我想搜索的内容是body中包含DemoApplication的内容,语法如下str_match(body, 'DemoApplication')默认情况下,只有msg,meesage,logs才会被全局匹配,对于不是这些字段的,我们需要使用str_match,如果匹配的字段是body的,包含DemoApplication的日志,可以使用如下命令str_match(body, 'DemoApplication')现在,一个可以替代传统ELK的日志方案就完成了。
2023年08月19日
347 阅读
0 评论
0 点赞
2023-08-11
linuxea: 日志收集的悄然变化
日志收集短期发展史日志的查看和告警是日志收集最核心的两个原因之一,通常99%的日志都是无用的,除非这些日志被用来做数据聚合环比数据分析。而传统的ELK,无论是Logstash还是ES都是非常消耗系统资源的应用,大规模场景中,要即时消费kafka的数据是一件不太容易的事情。观测性我们知道,现在的大多数应用皆是分布式或者微服务。微服务架构是让开发人员能够更快构建和发布,而随着服务进一步扩张,我们越发的不清楚服务运行的状态。而opentelmetry是来解决这种问题手段之一,微服务扩展后自己的服务和服务依赖之间的关系,通过可观察的方式使开发和维护都能够获得对系统的可见性。为了具备这种能力,系统就需要具备观测行。观测性是用来描述对系统所发生情况的理解程度,正常运行还是已经停止,用户察觉是变快还是变慢。如何构建KPI和SLA约定指标,或者说能够接受怎么样的最坏状态。能够回答如上的这些问题,并且能够指出问题。理想情况下在中断服务之前快速响应并且快速解决。而在术语上,观测性分为:事件日志,链路追踪和聚合指标。如果听到这个,让我们想起了最近的开源领域一些状态,或许就明白了,他们在做什么小米夜莺监控Nightingale在2023.7月底发布了V6版,转而构建可观测性平台。需要时刻掌握运行状态,可能是无法避免浪费,云产品提供了昂贵的观测平台服务:阿里云的商用产品arms和腾讯的商用产品介绍到这里,我想今天的主题是从日志发展作为开始,而结束必然是与事件日志(logs),链路追踪(traces)和聚合指标(metrics)有关。ELK的开始最早的ELK日志收集逻辑如下后来Graylog也成为更多的选择。而随着容器的发展,收集端logstash显然不够轻量,作为收集段fluentd,Fluent-bit用插件的方式比logstash更加轻便。而日志告警除开logstash,就是es插件。但是,他们都有一个同样的问题,假如你只愿意收集某一些日志,而不是所有的日志 ,无论是Logstash还是fluentd,Fluent-bit都i不会很 轻松,你需要配置一些过滤规则或者标签,而整个配置清单至少需要100行上下。而在中间的插曲是阿里开源的 log-point,log-pilot不在对所有的pod收集,你可以通过传入环境变量的方式进行选择,这与早期阿里的平台收集日志是一样 但是好景不长,log-pilot突然就停更了,它不在支持新特性和变化。事情在短期内由回到了开源领域的fluentd。而在这个 过程中 ,有一家石墨公司推出了clickvisual,它不在使用es,而是clickhouse,因此在同配置下它的集群性能超过了ES集群并且,clickhouse的数据很容易通过ttl来进行修改删除过期数据。但是clickvisual的产品是自己公司内部使用,而后免费开源出来的,因此它的界面似乎并没有获得广大用户青睐,clickvisual社区比较清淡。但是维护人员很活跃。当然,这并没有完。除了上述这些,log-pilot在停更的2年后,阿里随后推出了新的开源项目ilogtail,但是ilogtail似乎和log-pilot的宿命一样,ilogtail的社区更多时候永远慢一些 ,无论是补丁还是PR合并,以及ISSUE回复,这让社区旁观的人仍然会认为这依旧是一个 KPI产品。而在ilogtail出现的同时,另一个 loggie-io悄然出现,loggie-io是网易公司的日志收集端, loggie-io与clickvisual一样,都是商业公司内部的产品,而后进行开源公开。而clickvisual和 loggie-io的维护者相对要活跃,因此使用者居多。并且 loggie-io成功了接替了没有log-pilot这些日志的空白。并且 loggie-io提供了更多的使用功能 。此时的拓扑如下而在这其中,唯一没有被取代的是logstash, logstash是老牌日志处理中最关键的一环,他几乎包含了所有能够被用到的功能他都有。但是Datadog公司的vector出现后,logstash有了被替代的可能。vector是由rust编写,相比较java 的logstash使用资源更小,vector能够替代logstash的日志收集,中转,过滤,处理。它几乎可以替代logstash.事情到了 这里 并没有完,VictoriaLogs还没有结束,而github上openobserve通过不到一年的时间收获6K星,它的出现对标了es和kibana,因为它可以同时替代es和kibana。并声称与 Elasticsearch 相比,后者可以将日志存储成本降低约 140 倍。支持日志、指标、跟踪(Opentelemetry),集群支持S3,警报和查询, SQL 和 PromQL。或许是得益于openobserve的parquet,openobserve声称单机每天可以处理超过 2 TB 的数据。Mac M2的处理速度为约 31 MB/秒,即每分钟处理 1.8 GB,每天处理 2.6 TB。而这种情况在上一次这样的描述的是vm存储TimescaleDB相比。而且openobserve与vm的存储都是无状态的,尽管他们并不相同 ,但他可以仍然水平扩展,这样一来,这一切就更加明显了。
2023年08月11日
341 阅读
0 评论
0 点赞
2023-08-08
linuxea: vector与alertmanager的调试日志警报
日志告警一直都是一个无法回避的问题,无论是在什么时候,能够掌握程序日志的报错信息是有利于早期发现并定位问题。而在过去,常用手段可以通过logstash的if判断进行正则匹配,或者通过第三方工具读取ES,再或者通过grafan来进行触发而在阿里云或者腾讯云中同样也具备日志过滤,并且自带多级处理。而在传统的ELK中,fluentd也是可以承担这个任务,而在新兴的开源软件中,以上逐渐被慢慢剥离。取而代之的是阿里的ilogtail, 网易的 loggie-io,以及Datadog公司的vector。vector是由rust编写,在处理和消费速度上优于logstash,我将会分享如何通过vector调试vector处理日志关键字触发告警。在logstash上是可以支持重复日志计数和沉默的,而vector只负责过滤和转发,因此alertmanager可以承担这一个功能开始之前,我们需要了解alertmanager是如何接受告警的:alertmanager安装alertmanager提供一个config.yml的示例mkdir /data/alertmanager -p cat > /data/alertmanager/config.yml << EOF global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] group_wait: 30s group_interval: 5m repeat_interval: 24h receiver: email routes: - receiver: 'webhooke' group_by: ['alertname', 'instance'] group_wait: 30s group_interval: 5m repeat_interval: 24h match: severity: 'critical' - receiver: 'webhookw' group_by: ['alertname', 'instance'] group_wait: 30s group_interval: 5m repeat_interval: 24h match: severity: '~(warning)$' receivers: - name: 'webhookw' webhook_configs: - send_resolved: true url: 'http://webhook-dingtalk:8060/dingtalk/webhookw/send' - name: 'webhooke' webhook_configs: - send_resolved: true url: 'http://webhook-dingtalk:8060/dingtalk/webhooke/send' inhibit_rules: - source_match: alertname: node_host_lost,PodMemoryUsage severity: 'critical' target_match: severity: 'warning' equal: ['ltype'] EOFdocker-composeversion: "2.2" services: kafka: container_name: alertmanager restart: always image: registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/alertmanager:v0.24.0 ports: - "9093:9093" volumes: - /etc/localtime:/etc/localtime:ro # 时区2 - /data/alertmanager/config.yml:/etc/alertmanager/config.yml # chmod 777 -R /data/kafka environment: - ALLOW_PLAINTEXT_LISTENER=yes logging: driver: "json-file" options: max-size: "100M" mem_limit: 4096m想要发送到alertmanager,我们需要符合的格式,如下[ { "labels": { "alertname": "name1", "dev": "sda1", "instance": "example3", "severity": "warning" } } ]如下alerts1='[ { "labels": { "alertname": "name1", "dev": "sda1", "instance": "example3", "severity": "warning" } } ]' curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts返回success[root@master-01 /var/log]# curl -XPOST -d"$alerts1" http://172.16.100.151:9093/api/v1/alerts {"status":"success"}可以在界面查看vectoralertmanager了解之后,我们按照官方的配置拿到如下信息,并且进行调试:配置说明[sources.filetest] : 数据来源[transforms.ftest]: 数据处理[transforms.remap_alert_udev]: ramap数据,相当于此前logstash的grok,比grok功能强大condition = "match!(.message, r'.*WebApplicationContext*.')" 过滤包含WebApplicationContext的关键字的日志而后将日志格式为json,重新组合为alertmanager的数据格式source = """ . = parse_json!(.message) . = [ { "labels": { "alertname": .fields.podname, "namespace": .fields.namespace, "environment": .fields.environment, "podname": .fields.podname, "nodename": .fields.nodename, "topic": .fields.topic, "body": .body, "severity": "critical" } } ] """用于调试打印[sinks.sink0] inputs = ["remap_alert_*"] target = "stdout" type = "console" [sinks.sink0.encoding] codec = "json"用于发送alertmanager[sinks.alertmanager] type = "http" inputs = ["remap_alert_*"] uri = "http://172.16.100.151:9093/api/v1/alerts" compression = "none" encoding.codec = "json" acknowledgements.enabled = truevector.toml最终如下[api] enabled = true address = "0.0.0.0:8686" [sources.filetest] type = "file" include = ["/var/log/test.log"] [transforms.ftest] type = "filter" inputs = ["filetest"] condition = "match!(.message, r'.*WebApplicationContext*.')" [transforms.remap_alert_udev] type = "remap" inputs = ["ftest"] source = """ . = parse_json!(.message) . = [ { "labels": { "alertname": .fields.podname, "namespace": .fields.namespace, "environment": .fields.environment, "podname": .fields.podname, "nodename": .fields.nodename, "topic": .fields.topic, "body": .body, "severity": "critical" } } ] """ [sinks.sink0] inputs = ["remap_alert_*"] target = "stdout" type = "console" [sinks.sink0.encoding] codec = "json" [sinks.alertmanager] type = "http" inputs = ["remap_alert_*"] uri = "http://172.16.100.151:9093/api/v1/alerts" compression = "none" encoding.codec = "json" acknowledgements.enabled = true对于其他的日志格式处理,参考https://playground.vrl.dev/启动 vector[root@master-01 ~/vector]# vector -c vector.toml 2023-08-05T06:42:30.336918Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info" 2023-08-05T06:42:30.337720Z INFO vector::app: Loading configs. paths=["vector.toml"] 2023-08-05T06:42:30.355841Z INFO vector::topology::running: Running healthchecks. 2023-08-05T06:42:30.355886Z INFO vector::topology::builder: Healthcheck passed. 2023-08-05T06:42:30.355907Z INFO vector::topology::builder: Healthcheck passed. 2023-08-05T06:42:30.355930Z INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470" 2023-08-05T06:42:30.355940Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[] 2023-08-05T06:42:30.356284Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data. 2023-08-05T06:42:30.356411Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068 2023-08-05T06:42:30.356959Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground手动 追加一条信息[root@master-01 ~]# echo '{"body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","fields":{"containername":"java-demo","environment":"dev","logconfig":"java-demo","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","topic":"java-demo"}}' >> /var/log/test.log如果没有问题,这里 将会将日志打印到console,并且会发送到alertmanager[root@master-01 ~/vector]# vector -c vector.toml 2023-08-05T06:42:30.336918Z INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=info,rdkafka=info,buffers=info,lapin=info,kube=info" 2023-08-05T06:42:30.337720Z INFO vector::app: Loading configs. paths=["vector.toml"] 2023-08-05T06:42:30.355841Z INFO vector::topology::running: Running healthchecks. 2023-08-05T06:42:30.355886Z INFO vector::topology::builder: Healthcheck passed. 2023-08-05T06:42:30.355907Z INFO vector::topology::builder: Healthcheck passed. 2023-08-05T06:42:30.355930Z INFO vector: Vector has started. debug="false" version="0.31.0" arch="x86_64" revision="0f13b22 2023-07-06 13:52:34.591204470" 2023-08-05T06:42:30.355940Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}: vector::sources::file: Starting file server. include=["/var/log/test.log"] exclude=[] 2023-08-05T06:42:30.356284Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: file_source::checkpointer: Loaded checkpoint data. 2023-08-05T06:42:30.356411Z INFO source{component_kind="source" component_id=filetest component_type=file component_name=filetest}:file_server: vector::internal_events::file::source: Resuming to watch file. file=/var/log/test.log file_position=4068 2023-08-05T06:42:30.356959Z INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground {"labels":{"alertname":"production-java-demo-5cf5b97645-tsmxx","body":"2023-08-02T00:18:34.866228161+08:00 stdouts.b.w.embedded.tomcat.TomcatWebServer WebApplicationContext","environment":"dev","namespace":"linuxea-dev","nodename":"172.16.100.83","podname":"production-java-demo-5cf5b97645-tsmxx","severity":"critical","topic":"java-demo"}}alertmanager已经收到一个匹配到的日志警报已经被发送到alertmanager,接着你可以用它发往任何地方。
2023年08月08日
293 阅读
0 评论
0 点赞
2022-05-15
linuxea:kube-prometheus远程存储victoriametrics
我们知道,在使用promentheus的过程中,默认的数据量一旦到一个量级后,查询区间的数据会非常缓慢,甚至一个查询就可能导致promentheus的崩溃,尽管我们不需要存储多久的数据,但是集群pod在一定的数量后,短期的数据仍然非常多,对于Promentheus本身的存储引擎来讲,仍是一个不小的问题,而使用外部存储就显得很有必要。早期流行的influxDB,由于社区对Promentheus并不友好,因此早些就放弃。此前,尝试了Prometheus远程存储Promscale和TimescaleDB测试,而后在讨论中发现VictoriaMetrics是更可取的方式。而VictoriaMetrics也有自己的一套系统监控。而在官方的介绍中,VictoriaMetrics强烈diss了TimescaleDBIt provides high data compression, so up to 70x more data points may be crammed into limited storage comparing to TimescaleDB and up to 7x less storage space is required compared to Prometheus, Thanos or Cortex.VictoriaMetrics可用于 Prometheus 监控数据做长期远程存储的时序数据库之一,而在github上是这样介绍的,截取部分如下可以直接用于 Grafana 作为 Prometheus 数据源使用指标数据摄取和查询具备高性能和良好的可扩展性,性能比 InfluxDB 和 TimescaleDB 高出 20 倍内存方面也做了优化,比 InfluxDB 少 10x 倍,比 Prometheus、Thanos 或 Cortex 少 7 倍其他有能够理解的部分话术针对具有高延迟 IO 和低 IOPS 的存储进行了优化提供全局的查询视图,多个 Prometheus 实例或任何其他数据源可能会将数据摄取到 VictoriaMetricsVictoriaMetrics 由一个没有外部依赖的小型可执行文件组成所有的配置都是通过明确的命令行标志和合理的默认值完成的所有数据都存储在 - storageDataPath 命令行参数指向的目录中可以使用 vmbackup/vmrestore 工具轻松快速地从实时快照备份到 S3 或 GCS 对象存储中支持从第三方时序数据库获取数据源由于存储架构原因,它可以保护存储在非正常关机(即 OOM、硬件重置或 kill -9)时免受数据损坏同样支持指标的 relabel 操作注意VictoriaMetrics 不支持prometheus本身读取,但是为了解决报警的问题,开发人员建议配置--storage.tsdb.retention.time=24h保留24小时的数据在prometheus中,而其他的数据写入到远程VictoriaMetrics ,通过grafana展示。VictoriaMetrics wiki说不支持prometheus读取,因为它发送的数据量很大; remote_read api 可以解决警报问题。我们可以启动一个 prometheus 实例,它只有 remote_read 配置部分和规则部分。victoriaMetrics 警报非常好!由于Prometheus中的这个问题,Prometheus 远程读取 API 不是为读取由其他 Prometheus 实例写入远程存储的数据而设计的。至于 Prometheus 中的警报,则将 Prometheus 本地存储保留设置为涵盖所有已配置警报规则的持续时间。通常 24 小时就足够了:--storage.tsdb.retention.time=24h. 在这种情况下,Prometheus 将对本地存储的数据执行警报规则,同时remote_write像往常一样将所有数据复制到配置的 url。而这些在github的wiki中以及为什么 VictoriaMetrics 不支持Prometheus 远程读取 API?有过说明远程读取 API 需要在给定时间范围内传输所有请求指标的所有原始数据。例如,如果一个查询包含 1000 个指标,每个指标有 10K 个值,那么远程读取 API 必须1000*10K向 Prometheus 返回 =10M 个指标值。这是缓慢且昂贵的。Prometheus 的远程读取 API 不适用于查询外部数据——也就是global query view. 有关详细信息,请参阅此问题。因此,只需通过vmui、Prometheus Querying API 或Grafana 中的 Prometheus 数据源直接查询 VictoriaMetrics 。VictoriaMetrics在VictoriaMetrics 中介绍如下VictoriaMetrics uses their modified version of LSM tree (Logging Structure Merge Tree). All the tables and indexes on the disk are immutable once created. When it's making the snapshot, they just create the hard link to the immutable files.VictoriaMetrics stores the data in MergeTree, which is from ClickHouse and similar to LSM. The MergeTree has particular design decision compared to canonical LSM.MergeTree is column-oriented. Each column is stored separately. And the data is sorted by the "primary key", and the "primary key" doesn't have to be unique. It speeds up the look-up through the "primary key", and gets the better compression ratio. The "parts" is similar to SSTable in LSM; it can be merged into bigger parts. But it doesn't have strict levels.The Inverted Index is built on "mergeset" (A data structure built on top of MergeTree ideas). It's used for fast lookup by given the time-series selector.提到的技术点, LSM 树,以及MergeTreeVictoriaMetrics 将数据存储在 MergeTree 中,MergeTree 来自 ClickHouse,类似于 LSM。与规范 LSM 相比,MergeTree 具有特定的设计决策。MergeTree 是面向列的。每列单独存储。并且数据按“主键”排序,“主键”不必是唯一的。它通过“主键”加快查找速度,获得更好的压缩比。“部分”类似于 LSM 中的 SSTable;它可以合并成更大的部分。但它没有严格的等级。倒排索引建立在“mergeset”(建立在 MergeTree 思想之上的数据结构)之上。通过给定时间序列选择器,它用于快速查找。为了能够有更多的理解,可以参考LSM Tree原理详解):https://www.jianshu.com/p/b43b856e09bb应用到kube-prometheus对照如下kubernetes版本安装对应的kube-prometheus版本kube-prometheus stackKubernetes 1.19Kubernetes 1.20Kubernetes 1.21Kubernetes 1.22Kubernetes 1.23release-0.7✔✔✗✗✗release-0.8✗✔✔✗✗release-0.9✗✗✔✔✗release-0.10✗✗✗✔✔main✗✗✗✔✔Quickstart找到符合集群对应的版本进行安装,如果你是ack,需要卸载ack-arms-prometheus替换镜像k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1 v5cn/prometheus-adapter:v0.9.1k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 bitnami/kube-state-metrics:2.4.2quay.io/brancz/kube-rbac-proxy:v0.12.0 bitnami/kube-rbac-proxy:0.12.0开始部署$ cd kube-prometheus $ git checkout main kubectl.exe create -f .\manifests\setup\ kubectl.exe create -f .\manifests配置ingress-nginx> kubectl.exe -n monitoring get svc NAME TYPE CLUSTER-IP PORT(S) alertmanager-main ClusterIP 192.168.31.49 9093/TCP,8080/TCP alertmanager-operated ClusterIP None 9093/TCP,9094/TCP,9094/UDP blackbox-exporter ClusterIP 192.168.31.69 9115/TCP,19115/TCP grafana ClusterIP 192.168.130.3 3000/TCP kube-state-metrics ClusterIP None 8443/TCP,9443/TCP node-exporter ClusterIP None 9100/TCP prometheus-adapter ClusterIP 192.168.13.123 443/TCP prometheus-k8s ClusterIP 192.168.118.39 9090/TCP,8080/TCP prometheus-operated ClusterIP None 9090/TCP prometheus-operator ClusterIP None 8443/TCP ingress-nginxapiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: monitoring-ui namespace: monitoring spec: ingressClassName: nginx rules: - host: local.grafana.com http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 3000 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: prometheus-ui namespace: monitoring spec: ingressClassName: nginx rules: - host: local.prom.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-k8s port: number: 9090配置nfs测试apiVersion: apps/v1 kind: Deployment metadata: name: nfs-client-provisioner labels: app: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app: nfs-client-provisioner template: metadata: labels: app: nfs-client-provisioner spec: serviceAccountName: nfs-client-provisioner containers: - name: nfs-client-provisioner image: quay.io/external_storage/nfs-client-provisioner:latest imagePullPolicy: IfNotPresent volumeMounts: - name: nfs-client-root mountPath: /persistentvolumes env: - name: PROVISIONER_NAME value: fuseim.pri/ifs - name: NFS_SERVER value: 192.168.3.19 - name: NFS_PATH value: /data/nfs-k8s volumes: - name: nfs-client-root nfs: server: 192.168.3.19 path: /data/nfs-k8s --- apiVersion: v1 kind: ServiceAccount metadata: name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: nfs-client-provisioner-runner rules: - apiGroups: [""] resources: ["persistentvolumes"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["persistentvolumeclaims"] verbs: ["get", "list", "watch", "update"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["events"] verbs: ["create", "update", "patch"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: run-nfs-client-provisioner subjects: - kind: ServiceAccount name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default roleRef: kind: ClusterRole name: nfs-client-provisioner-runner apiGroup: rbac.authorization.k8s.io --- kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default rules: - apiGroups: [""] resources: ["endpoints"] verbs: ["get", "list", "watch", "create", "update", "patch"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default subjects: - kind: ServiceAccount name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default roleRef: kind: Role name: leader-locking-nfs-client-provisioner apiGroup: rbac.authorization.k8s.iovm配置创建一个pvc-victoriametricsapiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: nfs-storage namespace: default provisioner: fuseim.pri/ifs # or choose another name, must match deployment's env PROVISIONER_NAME' parameters: archiveOnDelete: "false" # Supported policies: Delete、 Retain , default is Delete reclaimPolicy: Retain --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-victoriametrics namespace: monitoring spec: accessModes: - ReadWriteMany storageClassName: nfs-storage resources: requests: storage: 10Gi准备pvc[linuxea.com ~/victoriametrics]# kubectl apply -f pvc.yaml storageclass.storage.k8s.io/nfs-storage created persistentvolumeclaim/pvc-victoriametrics created [linuxea.com ~/victoriametrics]# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM ... pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4 10Gi RWX Retain Bound monitoring/pvc-victoriametrics ... [linuxea.com ~/victoriametrics]# kubectl get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ... monitoring pvc-victoriametrics Bound pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4 10Gi 创建victoriametrics,并配置上面的pvc1w : 一周# vm-grafana.yaml apiVersion: apps/v1 kind: Deployment metadata: name: victoria-metrics namespace: monitoring spec: selector: matchLabels: app: victoria-metrics template: metadata: labels: app: victoria-metrics spec: containers: - name: vm image: victoriametrics/victoria-metrics:v1.76.1 imagePullPolicy: IfNotPresent args: - -storageDataPath=/var/lib/victoria-metrics-data - -retentionPeriod=1w ports: - containerPort: 8428 name: http resources: limits: cpu: "1" memory: 2048Mi requests: cpu: 100m memory: 512Mi readinessProbe: httpGet: path: /health port: 8428 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /health port: 8428 initialDelaySeconds: 120 timeoutSeconds: 30 volumeMounts: - mountPath: /var/lib/victoria-metrics-data name: victoriametrics-storage volumes: - name: victoriametrics-storage persistentVolumeClaim: claimName: nas-csi-pvc-oms-fat-victoriametrics --- apiVersion: v1 kind: Service metadata: name: victoria-metrics namespace: monitoring spec: ports: - name: http port: 8428 protocol: TCP targetPort: 8428 selector: app: victoria-metrics type: ClusterIPapply[linuxea.com ~/victoriametrics]# kubectl apply -f vmctoriametrics.yaml deployment.apps/victoria-metrics created service/victoria-metrics created [linuxea.com ~/victoriametrics]# kubectl -n monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 88 268d blackbox-exporter-55c457d5fb-6rc8m 3/3 Running 114 260d grafana-756dc9b545-b2skg 1/1 Running 38 260d kube-state-metrics-76f6cb7996-j2hx4 3/3 Running 153 260d node-exporter-4hxzp 2/2 Running 120 316d node-exporter-54t9p 2/2 Running 124 316d node-exporter-8rfht 2/2 Running 120 316d node-exporter-hqzzn 2/2 Running 126 316d prometheus-adapter-59df95d9f5-7shw5 1/1 Running 78 260d prometheus-k8s-0 2/2 Running 89 268d prometheus-operator-7775c66ccf-x2wv4 2/2 Running 115 260d promoter-66f6dd475c-fdzrx 1/1 Running 3 8d victoria-metrics-56d47f6fb-qmthh 0/1 Running 0 15s [linuxea.com ~/victoriametrics]# kubectl -n monitoring get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main NodePort 10.68.30.147 <none> 9093:30092/TCP 316d alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 316d blackbox-exporter ClusterIP 10.68.25.245 <none> 9115/TCP,19115/TCP 316d etcd-k8s ClusterIP None <none> 2379/TCP 316d external-node-k8s ClusterIP None <none> 9100/TCP 315d external-pve-k8s ClusterIP None <none> 9221/TCP 305d external-windows-node-k8s ClusterIP None <none> 9182/TCP 316d grafana NodePort 10.68.133.224 <none> 3000:30091/TCP 316d kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 316d node-exporter ClusterIP None <none> 9100/TCP 316d prometheus-adapter ClusterIP 10.68.138.175 <none> 443/TCP 316d prometheus-k8s NodePort 10.68.207.185 <none> 9090:30090/TCP 316d prometheus-operated ClusterIP None <none> 9090/TCP 316d prometheus-operator ClusterIP None <none> 8443/TCP 316d promoter ClusterIP 10.68.26.69 <none> 8080/TCP 11d victoria-metrics ClusterIP 10.68.225.139 <none> 8428/TCP 18s修改prometheus的远程存储配置,我们主要修改如下,其他参数可在官方文档查看首先修改远程写如到vm remoteWrite: - url: "http://victoria-metrics:8428/api/v1/write" queueConfig: capacity: 5000 remoteTimeout: 30s并且prometheus的存储时间为1天retention: 1d一天的本地存储只是为了应对告警,而远程写入到vm后通过grafana来看Prometheus-prometheus.yamlapiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.35.0 name: k8s namespace: monitoring spec: retention: 1d alerting: alertmanagers: - apiVersion: v2 name: alertmanager-main namespace: monitoring port: web enableFeatures: [] externalLabels: {} image: quay.io/prometheus/prometheus:v2.35.0 nodeSelector: kubernetes.io/os: linux podMetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.35.0 podMonitorNamespaceSelector: {} podMonitorSelector: {} probeNamespaceSelector: {} probeSelector: {} replicas: 1 resources: requests: memory: 400Mi remoteWrite: - url: "http://victoria-metrics:8428/api/v1/write" queueConfig: capacity: 5000 remoteTimeout: 30s ruleNamespaceSelector: {} ruleSelector: {} securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: 2.35.0而此时的配置不出意外会被应用到URL/configremote_write: - url: http://victoria-metrics:8428/api/v1/write remote_timeout: 5m follow_redirects: true queue_config: capacity: 5000 max_shards: 200 min_shards: 1 max_samples_per_send: 500 batch_send_deadline: 5s min_backoff: 30ms max_backoff: 100ms metadata_config: send: true send_interval: 1m查看日志level=info ts=2022-04-28T15:26:12.047Z caller=main.go:944 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Starting WAL watcher" queue=1a1964 ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Starting scraped metadata watcher" ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Replaying WAL" queue=1a1964 .... totalDuration=55.219178ms remote_storage=85.51µs web_handler=440ns query_engine=719ns scrape=45.6µs scrape_sd=1.210328ms notify=4.99µs notify_sd=352.209µs rules=47.503195ms回到nfs查看[root@Node-172_16_100_49 /data/nfs-k8s/monitoring-pvc-victoriametrics-pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4]# ll total 0 drwxr-xr-x 4 root root 48 Apr 28 22:37 data -rw-r--r-- 1 root root 0 Apr 28 22:37 flock.lock drwxr-xr-x 5 root root 71 Apr 28 22:37 indexdb drwxr-xr-x 2 root root 43 Apr 28 22:37 metadata drwxr-xr-x 2 root root 6 Apr 28 22:37 snapshots drwxr-xr-x 3 root root 27 Apr 28 22:37 tmp修改grafana的配置此时看到的数据是用promenteus中获取到的,修改grefana来从vm读取数据 datasources.yaml: |- { "apiVersion": 1, "datasources": [ { "access": "proxy", "editable": false, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://victoria-metrics:8428", "version": 1 } ] }顺便修改时区stringData: # 修改 时区 grafana.ini: | [date_formats] default_timezone = CST如下apiVersion: v1 kind: Secret metadata: labels: app.kubernetes.io/component: grafana app.kubernetes.io/name: grafana app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 8.5.0 name: grafana-datasources namespace: monitoring stringData: # 修改链接的地址 datasources.yaml: |- { "apiVersion": 1, "datasources": [ { "access": "proxy", "editable": false, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://victoria-metrics:8428", "version": 1 } ] } type: Opaque --- apiVersion: v1 kind: Secret metadata: labels: app.kubernetes.io/component: grafana app.kubernetes.io/name: grafana app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 8.5.0 name: grafana-config namespace: monitoring stringData: # 修改 时区 grafana.ini: | [date_formats] default_timezone = CST type: Opaque # grafana: # sidecar: # datasources: # enabled: true # label: grafana_datasource # searchNamespace: ALL # defaultDatasourceEnabled: false # additionalDataSources: # - name: Loki # type: loki # url: http://loki-stack.loki-stack:3100/ # access: proxy # - name: VictoriaMetrics # type: prometheus # url: http://victoria-metrics-single-server.victoria-metrics-single:8428 # access: proxy而此时的datasources就变成了vm,远程写入到了vm,grafana读取的是vm,而Prometheus还是读的是prometheus监控vmdashboards与版本有关,https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/dashboards并且添加监控# victoriametrics-metrics apiVersion: v1 kind: Service metadata: name: victoriametrics-metrics namespace: monitoring labels: app: victoriametrics-metrics annotations: prometheus.io/port: "8428" prometheus.io/scrape: "true" spec: type: ClusterIP ports: - name: metrics port: 8428 targetPort: 8428 protocol: TCP selector: # 对应victoriametrics的service app: victoria-metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: victoriametrics-metrics namespace: monitoring spec: endpoints: - interval: 15s port: metrics path: /metrics namespaceSelector: matchNames: - monitoring selector: matchLabels: app: victoriametrics-metrics参考Prometheus远程存储Promscale和TimescaleDB测试victoriametricsLSM Tree原理详解
2022年05月15日
1,684 阅读
0 评论
0 点赞
2022-03-03
linuxea:Prometheus远程存储Promscale和TimescaleDB测试
promscale 是一个开源的可观察性后端,用于由 SQL 提供支持的指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 的强大和高性能基础之上。它通过 OpenTelemetry Collector 原生支持 Prometheus 指标和 OpenTelemetry 跟踪以及许多其他格式,如 StatsD、Jaeger 和 Zipkin,并且100% 兼容 PromQL。其完整的 SQL 功能使开发人员能够关联指标、跟踪和业务数据,从而获得新的有价值的见解,当数据在不同系统中孤立时是不可能的。它很容易与 Grafana 和 Jaeger 集成,以可视化指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 之上,继承了坚如磐石的可靠性、高达 90% 的本机压缩、连续聚合以及在全球数百万个实例上运行的系统的操作成熟度。Promscale 可以用作 Grafana和PromLens等可视化工具的 Prometheus 数据源。Promscale 包括两个组件:Promscale 连接器:一种无状态服务,为可观察性数据提供摄取接口,处理该数据并将其存储在 TimescaleDB 中。它还提供了一个使用 PromQL 查询数据的接口。Promscale 连接器自动设置 TimescaleDB 中的数据结构以存储数据并在需要升级到新版本的 Promscale 时处理这些数据结构中的更改。TimescaleDB:存储所有可观察性数据的基于 Postgres 的数据库。它提供了用于查询数据的完整 SQL 接口以及分析函数、列压缩和连续聚合等高级功能。TimescaleDB 提供了很大的灵活性来存储业务和其他类型的数据,然后你可以使用这些数据与可观察性数据相关联。Promscale 连接器使用 Prometheusremote_write接口摄取 Prometheus 指标、元数据和 OpenMetrics 示例。它还使用 OpenTelemetry 协议 (OTLP) 摄取 OpenTelemetry 跟踪。它还可以使用 OpenTelemetry 收集器以其他格式摄取指标和跟踪,以通过 Prometheusremote_write接口和 OpenTelemetry 协议处理和发送它们。例如,你可以使用 OpenTelemetry Collector 将 Jaeger 跟踪和 StatsD 指标摄取到 Promscale。对于 Prometheus 指标,Promscale 连接器公开 Prometheus API 端点,用于运行 PromQL 查询和读取元数据。这允许你将支持 Prometheus API 的工具(例如 Grafana)直接连接到 Promscale 进行查询。也可以向 Prometheus 发送查询,并让 Prometheus 使用接口上的 Promscale 连接器从 Promscale 读取数据 remote_read。你还可以使用 SQL 在 Promscale 中查询指标和跟踪,这允许你使用与 PostgreSQL 集成的许多不同的可视化工具。例如,Grafana 支持通过 PostgreSQL 数据源使用开箱即用的 SQL 查询 Promscale 中的数据我准备通过容器的方式进行尝试,我们先安装docker和docker-composeyum install -y yum-utils \ device-mapper-persistent-data \ lvm2 yum-config-manager \ --add-repo \ https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo yum install docker-ce docker-ce-cli containerd.io docker-compose -y编排我按照官网的docker配置,进行编排了compose进行测试version: '2.2' services: timescaledb: image: timescaledev/promscale-extension:latest-ts2-pg13 container_name: timescaledb restart: always hostname: "timescaledb" network_mode: "host" environment: - csynchronous_commit=off - POSTGRES_PASSWORD=123 volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/timescaledb/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m promscale: image: timescale/promscale:0.10 container_name: promscale restart: always hostname: "promscale" network_mode: "host" environment: - PROMSCALE_DB_PASSWORD=123 - PROMSCALE_DB_PORT=5432 - PROMSCALE_DB_NAME=postgres - PROMSCALE_DB_HOST=127.0.0.1 - PROMSCALE_DB_SSL_MODE=allow volumes: - /etc/localtime:/etc/localtime:ro # - /data/prom/postgresql/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m grafana: image: grafana/grafana:8.3.7 container_name: grafana restart: always hostname: "grafana" network_mode: "host" #environment: # - GF_INSTALL_PLUGINS="grafana-clock-panel,grafana-simple-json-datasource" volumes: - /etc/localtime:/etc/localtime:ro - /data/grafana/plugins:/var/lib/grafana/plugins mem_limit: 512m user: root prometheus: image: prom/prometheus:v2.33.4 container_name: prometheus restart: always hostname: "prometheus" network_mode: "host" #environment: volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/prometheus/data:/prometheus:rw # NOTE: chown 65534:65534 /data/prometheus/ - /data/prom/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /data/prom/prometheus/alert:/etc/prometheus/alert #- /data/prom/prometheus/ssl:/etc/prometheus/ssl command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention=45d' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--web.enable-lifecycle' - '--web.enable-admin-api' mem_limit: 512m user: root stop_grace_period: 1m node_exporter: image: prom/node-exporter:v1.3.1 container_name: node_exporter user: root privileged: true network_mode: "host" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)' restart: unless-stoppedgrafana我们这里使用的root用户就是因为需要手动安装下插件bash-5.1# grafana-cli plugins install grafana-clock-panel ✔ Downloaded grafana-clock-panel v1.3.0 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary. bash-5.1# grafana-cli plugins install grafana-simple-json-datasource ✔ Downloaded grafana-simple-json-datasource v1.4.2 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary.配置grafana在这里下载一些模板https://grafana.com/grafana/dashboards/?pg=hp&plcmt=lt-box-dashboards&search=prometheusVisualize data in Promscaleprometheus我们可以尝试配置远程, 配置参数可以查看官网remote_write: - url: "http://127.0.0.1:9201/write" remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true远程配置如下remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true prometheus.yaml如下global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - scheme: http static_configs: - targets: - '127.0.0.1:9093' rule_files: - "alert/host.alert.rules" - "alert/container.alert.rules" - "alert/targets.alert.rules" scrape_configs: - job_name: prometheus scrape_interval: 30s static_configs: - targets: ['127.0.0.1:9090'] - targets: ['127.0.0.1:9093'] - targets: ['127.0.0.1:9100'] remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true重新启动后查看日志ts=2022-03-03T01:35:28.123Z caller=main.go:1128 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml ts=2022-03-03T01:35:28.137Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting WAL watcher" queue=797d34 ts=2022-03-03T01:35:28.138Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting scraped metadata watcher" ts=2022-03-03T01:35:28.277Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Replaying WAL" queue=797d34 ts=2022-03-03T01:35:38.177Z caller=main.go:1165 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=10.053377011s db_storage=1.82µs remote_storage=13.752341ms web_handler=549ns query_engine=839ns scrape=10.038744417s scrape_sd=44.249µs notify=41.342µs notify_sd=6.871µs rules=30.465µs ts=2022-03-03T01:35:38.177Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ts=2022-03-03T01:35:53.584Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Done replaying WAL" duration=25.446317635s查看数据[root@localhost data]# docker exec -it timescaledb sh / # su - postgres timescaledb:~$ psql psql (13.4) Type "help" for help. postgres=# 我们查询一个过去五分钟io的指标查询指标SELECT * from node_disk_io_now WHERE time > now() - INTERVAL '5 minutes'; time | value | series_id | labels | device_id | instance_id | job_id ----------------------------+-------+-----------+---------------+-----------+-------------+-------- 2022-03-02 21:03:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:08:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3在进行一次聚合查询标签键的查询值每个标签键都扩展为自己的列,该列将外键标识符存储为其值。这允许JOIN按标签键和值进行聚合和过滤。要检索由标签 ID 表示的文本,可以使用该val(field_id) 函数。这使你可以使用特定的标签键对所有系列进行聚合等操作。例如,要查找指标的中值node_disk_io_now,按与其关联的工作分组:SELECT val(job_id) as job, percentile_cont(0.5) within group (order by value) AS median FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes' GROUP BY job_id;如下postgres=# SELECT postgres-# val(job_id) as job, postgres-# percentile_cont(0.5) within group (order by value) AS median postgres-# FROM postgres-# node_disk_io_now postgres-# WHERE postgres-# time > now() - INTERVAL '5 minutes' postgres-# GROUP BY job_id; job | median ------------+-------- prometheus | 0 (1 row)查询指标的标签集任何度量行中的labels字段表示与测量相关的完整标签集。它表示为标识符数组。要以 JSON 格式返回整个标签集,你可以使用该jsonb()函数,如下所示:SELECT time, value, jsonb(labels) as labels FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes';如下 time | value | labels ----------------------------+-------+------------------------------------------------------------------------------------------------------- 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:14:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"}查询node_disk_infopostgres=# SELECT * FROM prom_series.node_disk_info; series_id | labels | device | instance | job | major | minor -----------+------------------------+--------+----------------+------------+-------+------- 250 | {150,140,91,3,324,325} | dm-0 | 127.0.0.1:9100 | prometheus | 253 | 0 439 | {150,253,91,3,508,325} | sda | 127.0.0.1:9100 | prometheus | 8 | 0 440 | {150,258,91,3,507,325} | sr0 | 127.0.0.1:9100 | prometheus | 11 | 0 516 | {150,252,91,3,324,564} | dm-1 | 127.0.0.1:9100 | prometheus | 253 | 1 (4 rows)带标签查询SELECT jsonb(labels) as labels, value FROM node_disk_info WHERE time < now(); Results:labels | value -----------------------------------------------------------------------------------------------------------------------------------+------- {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | NaN {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1通过命令进行查看她的指标视图postgres=# \d+ node_disk_info View "prom_metric.node_disk_info" Column | Type | Collation | Nullable | Default | Storage | Description -------------+--------------------------+-----------+----------+---------+----------+------------- time | timestamp with time zone | | | | plain | value | double precision | | | | plain | series_id | bigint | | | | plain | labels | label_array | | | | extended | device_id | integer | | | | plain | instance_id | integer | | | | plain | job_id | integer | | | | plain | major_id | integer | | | | plain | minor_id | integer | | | | plain | View definition: SELECT data."time", data.value, data.series_id, series.labels, series.labels[2] AS device_id, series.labels[3] AS instance_id, series.labels[4] AS job_id, series.labels[5] AS major_id, series.labels[6] AS minor_id FROM prom_data.node_disk_info data LEFT JOIN prom_data_series.node_disk_info series ON series.id = data.series_id;更多的查询,可以查看官网的教程计划删除promscale里是pg里配删除计划的是90天删除通过SELECT * FROM prom_info.metric;查看我们可以通过调整来修改TimescaleDB 包括一个后台作业调度框架,用于自动化数据管理任务,例如启用简单的数据保留策略。为了添加这样的数据保留策略,数据库管理员可以创建、删除或更改导致drop_chunks根据某个定义的计划自动执行的策略。要在超表上添加这样的策略,不断导致超过 24 小时的块被删除,只需执行以下命令:SELECT add_retention_policy('conditions', INTERVAL '24 hours');随后删除该策略:SELECT remove_retention_policy('conditions');调度程序框架还允许查看已调度的作业:SELECT * FROM timescaledb_information.job_stats;创建数据保留策略以丢弃超过 6 个月的数据块:SELECT add_retention_policy('conditions', INTERVAL '6 months');复制使用基于整数的时间列创建数据保留策略:SELECT add_retention_policy('conditions', BIGINT '600000');我们可以调整prometheus的数据,参考Data RetentionSELECT set_default_retention_period(180 * INTERVAL '1 day')如下postgres=# SELECT postgres-# set_default_retention_period(180 * INTERVAL '1 day'); set_default_retention_period ------------------------------ t (1 row)已经修改为180天我们打开prometheus和grafana都可以正常的查看grafana关于当前版本的压测在slack上,有一个朋友做了压测Promscale版本 2.30.0promscale 0.10.0Hello everyone, I am doing a promscale+timescaledb performance test with 1 promscale(8cpu 32GB memory), 1 timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem), 1 prometheus(8cpu 32G mem), simulate 2500 node_exporters( 1000 metrics/min * 2500 = 2.5 million metrics/min ) . But it seams not stable,做一个promscale+timescaledb性能测试,1个promscale(8cpu 32GB内存),1个timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem),1个prometheus(8cpu 32G mem),模拟2500个node_exporters(1000 指标/分钟 * 2500 = 250 万指标/分钟)。 但它的接缝不稳定异常如下there are warninigs in prometheus: level=info ts=2022-03-01T12:36:33.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:34.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=warn ts=2022-03-01T12:36:34.482Z caller=watcher.go:101 msg="[WARNING] Ingestion is a very long time" duration=5m9.705407837s threshold=1m0s level=info ts=2022-03-01T12:36:35.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:40.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=70000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z and errors in prometheus: Mar 01 20:38:55 localhost start-prometheus.sh[887]: ts=2022-03-01T12:38:55.288Z caller=dedupe.go:112 component=remote level=warn remote_name=ceec38 url=http://192.168.105.76:9201/write msg="Failed to send batch, retrying" err="Post \"http://192.168.105.76:9201/write\": context deadline exceeded" any way to increase the thoughput at current configuration?他们推荐使用 remote_write: - url: "http://promscale:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 随后将配置pg 更改为 14.2 和 remote_write 设置,将 promscale mem 增加到 32G,仍然不稳定。请注意,这是2500台的节点压测,那么,这个朋友的测试可以看到至少在目前看来,promscale的开发版本仍然是处于一个初期。官方并没有进行可靠性压测 .我们期待未来的稳定版本
2022年03月03日
1,968 阅读
0 评论
0 点赞
1
2
...
13