Replies: 1 comment · 11 replies
-
|
This error looks like the server gets errors while sending requests to the S3 service. Unfortunately, we were unable to determine the root cause.
You can specify the version of the self monitor db in the yaml. Normally, the space usage on S3 should be the same as the usage on local disk. It would be helpful if we could get the earliest error. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
How long is the timeout for S3 queries? This is all in our data center so we shouldn't be seeing issues with latency or anything. 10G networking between my vm and the S3 service. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
It should be 30s by default. greptimedb/config/standalone.example.toml Lines 477 to 482 in 2f82e75
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Thanks. Ok, is there any chance a "datatype incompatible" error could cause something like this? I was watching my logs and a timeout error occurred. I immediately tried to find the file using rclone and it was not there. So I started watching the logs, plus the 10 lines around it, using grep for that file and when it popped up again, I got the following: Is there anything in there that helps us diagnose things? Oh, and for reference, here is my docker compose file. networks:
traefiknet:
external: True
volumes:
observe_db_share:
driver_opts:
type: "nfs"
o: "addr=internal1,nolock,soft,rw"
device: ":/vol/volname/group/prod/observe/share"
observe_db_data:
driver_opts:
type: "nfs"
o: "addr=internal1,nolock,soft,rw"
device: ":/vol/volname/group/prod/observe/data"
services:
observe-db:
container_name: observe-db
hostname: observe-db
image: greptime/greptimedb:v1.0.0-beta.4
restart: always
command:
- "standalone"
- "start"
- "--user-provider"
- "static_user_provider:file:/etc/greptime/users"
- "--http-addr"
- "0.0.0.0:4000"
- "--rpc-bind-addr"
- "0.0.0.0:4001"
- "--mysql-addr"
- "0.0.0.0:4002"
- "--postgres-addr"
- "0.0.0.0:4003"
networks:
traefiknet:
extra_hosts:
- "s3host:pinipforserver"
volumes:
- "observe_db_share:/opt/share"
- "/srv/observe-db/config:/etc/greptime"
# - "observe_db_data:/greptimedb_data"
- "/var/greptime-data:/greptimedb_data"
environment:
GREPTIMEDB_STANDALONE__LOGGING__LEVEL: "info"
GREPTIMEDB_STANDALONE__STORAGE__TYPE: "S3"
GREPTIMEDB_STANDALONE__STORAGE__DATA_HOME: "/greptimedb_data"
GREPTIMEDB_STANDALONE__STORAGE__BUCKET: "greptime-data"
GREPTIMEDB_STANDALONE__STORAGE__ROOT: "/observe-db-prod"
GREPTIMEDB_STANDALONE__STORAGE__REGION: "us-west-2"
GREPTIMEDB_STANDALONE__STORAGE__ENDPOINT: "https://s3host"
GREPTIMEDB_STANDALONE__STORAGE__ACCESS_KEY_ID: "redacted"
GREPTIMEDB_STANDALONE__STORAGE__SECRET_ACCESS_KEY: "redacted"
GREPTIMEDB_STANDALONE__STORAGE__REGION_ENGINE__MITO__MANIFEST_CACHE_SIZE: "2048MB" # Did I get the var format correct for this?
GREPTIMEDB_STANDALONE__STORAGE__REGION_ENGINE__MITO__WRITE_CACHE_SIZE: "20000MB" # Did I get the var format correct for this?
mem_limit: "20GB"
cpus: 4.0
ulimits:
nofile:
soft: 262144
hard: 262144
labels:
traefik.enable: "true"
traefik.docker.network: traefiknet
traefik.http.services.observe-db-http.loadbalancer.server.port: "4000"
traefik.tcp.services.observe-db-rpc.loadbalancer.server.port: "4001"
traefik.tcp.services.observe-db-mysql.loadbalancer.server.port: "4002"
traefik.tcp.services.observe-db-postgres.loadbalancer.server.port: "4003"
traefik.http.routers.observe-db-http.entrypoints: "web, greptime4000"
traefik.http.routers.observe-db-http.rule: "Host(`greptimehost`)"
traefik.http.routers.observe-db-http.service: "observe-db-http"
traefik.tcp.routers.observe-db-rpc.entrypoints: "greptime4001"
traefik.tcp.routers.observe-db-rpc.rule: "HostSNI(`*`)"
traefik.tcp.routers.observe-db-rpc.service: "observe-db-rpc"
traefik.tcp.routers.observe-db-mysql.entrypoints: "greptime4002"
traefik.tcp.routers.observe-db-mysql.rule: "HostSNI(`*`)"
traefik.tcp.routers.observe-db-mysql.service: "observe-db-mysql"
traefik.tcp.routers.observe-db-postgres.entrypoints: "greptime4003"
traefik.tcp.routers.observe-db-postgres.rule: "HostSNI(`*`)"
traefik.tcp.routers.observe-db-postgres.service: "observe-db-postgres"
traefik.http.middlewares.observe-greptime-internal-only.ipallowlist.sourcerange: "ipranges"
traefik.http.middlewares.observe-greptime-internal-only.ipallowlist.ipstrategy.depth: "0"
traefik.http.routers.observe-db-http.middlewares: "observe-greptime-internal-only"For the write cache size, I was wondering if greptime was running out of space to cache the data I am pushing, and that was doing something to the system. So I tried to set it super high to see if anything changed. I either have the wrong pattern for the variable, or I was wrong. |
Beta Was this translation helpful? Give feedback.
All reactions
-
It was unrelated. It happened in the request validation. The data you ingested had a column in_predicate_conversion_threshold had an inconsistent data type.
The log you provided was a write operation that timed out, so the file wasn't visible yet. The database will retry the request later. The request succeeded after a retry. If the timeout didn't last for a long time, it should be fine, as it was a temporary failure. You can also ignore the error if the ingestion looks ok. You can check the
If you have Grafana, you can install our dashboard. If not, a small instance of Prometheus will also be helpful. It only has to scrape and store metrics for your db.
The region engine config can not be configured via the env. You have to specify it in a configuration file. You can curl |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Just fyi, I have not had time to dig back into this recently. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
(I posted an earlier discussion on this topic before I had all the details, and I asked for help on Slack earlier, but it is the end of the day, so I'm going to post this here and come back to it tomorrow.)
I'm running into a bunch of errors in my logs that I haven't been able to diagnose.
They seem to all relate to errors reading files from s3 storage.
I expect some, if not all, of this has to do with how I ran out of space in my storage bucket.
I'm also wondering if how I was in the middle of migrating data might have something to do with it.
So here is my detailed background:
I have a standalone, dockerized, instance of greptime with storage on a iscsi local mount. This instance has data from around 5/01/2025 to around 12/15/2025. It currently uses 106G on the iscsi mount.
Per my post here I am migrating that data, a small chunk at a time, into a new instance.
This new instance is also Dockerized, but has the entire virtual machine to itself, and the data storage is now on an S3 compatible system. (Specifically NetApp's service, not my area so I can't recall what exactly they call it.)
For migration, I wrote a script that splits up the data into 15 minute chunks while exporting to a shared NFS mount on the old instance. Then I have importing scripted at well, and that runs on the new instance.
So I export some data, then I import it, then I repeat the process.
Monday I was checking the status of my migration when I found that the s3 bucket was full.
This was a surprise since I had 250G available, and had only imported December, November, and part/most of October. Not the full 7 months worth of data. And that 7 months only took up less than half of what was available...
So I turned off all the ingest, except the greptime self monitoring stuff, and then tried to see about clearing out some data. Immediately ran into the fact that greptime actually needs space in order to do anything, and it had none.
After getting the bucket more space, I started trying to figure things out today.
My first step was to turn off greptime self monitoring so I could be sure none of the errors hat to do with that. I also had noticed a lot of the paths mentioned in the errors were for the self monitoring db, so I went ahead and just dropped that database to see if it would clear things up.
It did recover a lot of space, but I'm still seeing errors for my other databases.
My current attempt to figure something out is running compaction on the tables in my metric db. Considering the results, I think compaction isn't actually about saving space and is more about rearranging data to be easier to read. And I do keep seeing errors for some of the tables. Though it does work for some. Like:
And I think this is the error log from the container for that message:
So, how do I clear up these errors so that I can get my monitoring working again?
Why does data usage seem so much higher on S3? Or is it? I had recently started shipping a few more vms worth of metrics and logs, but only about 6 more, where I had about 25-30 vms prior.
Is the way I'm importing from such small chunks of data causing issues?
Is there a way to tell greptime to find all the s3 paths it can't read and then remove anything related to them? That way it wouldn't try to use paths that don't exist.
Or am I just following the wrong path entirely and there's something else I should look into?
Beta Was this translation helpful? Give feedback.
All reactions