Data corruption: zeroes are sometimes written instead of the actual data

**What happened**:
When writing a file on a host, sometimes a zero-filled region may be written instead of the actual data.


**What you expected to happen**:
No data corruption.

**How to reproduce it (as minimally and precisely as possible)**:

On one host: `pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file`
On another host: `while ! hexdump -C /mnt/jfs-partition/test-file | grep '00 00 00 00'; do sleep 1; done`

After some time, the command on the second host will find a large cluster of zeroes (>1k) and stop. In the file, you see, for instance:

```
00033130  26 32 91 00 00 00 00 00  00 00 00 00 00 00 00 00  |&2..............|
00033140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00036000  88 ff a0 0b c6 17 26 95  78 b7 e3 28 f5 35 8b 98  |......&.x..(.5..|
```

**Anything else we need to know?**

More details:
* The network on the *writing* host is quite overloaded/unstable; I believe this may be related because the issue only occurs when the network is overloaded. No other (hardware, disk) failures are observed on the host;
* The zero byte clusters are often, *but not always*, have 1k-divisible size;
* EIO errors sometimes happen on the writing side (because of the bad networking), but they don't seem to correlate to the issues directly;
* Zero clusters appear in the middle of the file, not at the beginning or end;
* Compression and encryption are enabled.

Unfortunately, we have many irrelevant logs on the *writing* server because this is a production host. These ones may be relevant (the file inode is 2439005):
```
2024/07/30 07:53:45.891705 juicefs[4019926] <ERROR>: write inode:2439005 indx:0  input/output error [writer.go:211]
2024/07/30 07:53:52.457389 juicefs[4019926] <WARNING>: slow request: PUT chunks/20/20750/20750481_3_4194304 (req_id: "", err: Put "https://somehost/somedb/%2Fchunks%2F20%2F20750%2F20750481_3_4194304": write tcp 10.42.43.15:60050->100.118.102.36:443: use of closed network connection, cost: 59.994437087s) [cached_store.go:667]
[mysql] 2024/07/30 07:54:35 packets.go:37: read tcp 10.42.43.15:47762->10.0.8.236:3306: read: connection reset by peer
2024/07/30 07:54:35.379260 juicefs[4019926] <WARNING>: Upload chunks/20/20750/20750498_5_4194304: timeout after 1m0s: function timeout (try 1) [cached_store.go:407]
2024/07/30 07:54:39.800111 juicefs[4019926] <INFO>: slow operation: flush (2439005,17488,6D9AF143D1C50E5B) - input/output error <53.869292> [accesslog.go:83]
```

What else do we plan to try:
* Writing from different hosts with similar networking to completely figure out hardware issues;
* Checking the metadata to see if the zero clusters correspond to single chunks.

**Environment**:
- JuiceFS version (use `juicefs --version`) or Hadoop Java SDK version: `juicefs version 1.2.0+2024-06-18.873c47b922ba` (both hosts)
- Cloud provider or hardware configuration running JuiceFS: Bare-metal host on the writing side, Aliyun ECS on the reading side
- OS (e.g `cat /etc/os-release`): NixOS 24.11 (Vicuna) (both hosts)
- Kernel (e.g. `uname -a`): `6.1.90` (both hosts)
- Object storage (cloud provider and region, or self maintained): Aliyun OSS
- Metadata engine info (version, cloud provider managed or self maintained): Aliyun RDS, MySQL
- Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): Private Aliyun networking on the reading side, public Internet on the writing side
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption: zeroes are sometimes written instead of the actual data #5038

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data corruption: zeroes are sometimes written instead of the actual data #5038

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions