You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a test raspberry pi I have, I experienced this issue:
netdata was running
disk got almost full
netdata agent properly raised disk full alarm to WARNING, which was sent to the cloud
disk got almost 100% full
netdata agent tried to commit dbengine data to disk, which failed
10min_dbengine_global_io_errors was raised and the alarm was sent to the cloud
disk got 100% full
netdata agent properly raised disk full alarm to CRITICAL, which was NOT sent to the cloud
At this point, the netdata agent was still running. I didn't check if it was responsive or not.
I removed a file to make disk space and tried to see if netdata could recover from the situation.
I realized that netdata could not respond to dashboard queries. All queries were just hanging.
I tried to restart netdata. systemd killed it after some time (90 secs I think).
After netdata restart, the cloud didn't sync alarms properly. Certain alarms are still raised in the cloud, while none is raised at the agent, so the cloud failed to detect that there is a discrepancy in the alarm log.
Expected behavior
netdata should survive a disk full situation. Even if data cannot be saved to disk, netdata should continue to function properly, given that old data may have to be discarded.
netdata should always send alerts to the cloud, even if it cannot commit the alarm log to disk. The cloud should be aware of all alerts, even if disk is not usable.
netdata should properly re-sync with the cloud after a crash. A crash may mean a alarm snapshot re-sync is needed. At least, we should be able to trigger this somehow.
Disk full situation
Netdata may have to change the dbengine rotation policy to adapt to disk full situations.
So, once it cannot append metrics to a disk file, it could trigger a dbengine file rotation. dbengine file rotation could be done by moving the oldest file to the newest position, zeroing its headers and using the existing file as a preallocated buffer to commit data to disk.
This strategy of using preallocated disk space could be the default, to allow dbengine use always a fixed amount of disk space for metrics.
Ideally, we would like to have something similar for sqlite. I hope there is a solution for this.
No disk situation
Disks may also fail completely. No disk at all, suddenly, at runtime.
Netdata runtime should be able to survive such a situation and continue running, triggering alarms, streaming metrics to a parent, communicating with netdata cloud.
Steps to reproduce
Run netdata
Get the disk 100% full
or
Run netdata from a removable disk
remove the disk on the fly
Installation method
kickstart.sh
System info
Any
Netdata build info
# /opt/netdata/bin/netdata -W buildinfo
Version: netdata v1.33.1-99-gcf90fc9e8
Configure options: '--prefix=/opt/netdata/usr''--sysconfdir=/opt/netdata/etc''--localstatedir=/opt/netdata/var''--libexecdir=/opt/netdata/usr/libexec''--libdir=/opt/netdata/usr/lib''--with-zlib''--with-math''--with-user=netdata''--enable-cloud''--without-bundled-protobuf''--with-bundled-libJudy''CFLAGS=-static -O3 -I/openssl-static/include''LDFLAGS=-static -L/openssl-static/lib''PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
Binary architecture: armv7l
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: NO
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: YES
Libraries:
protobuf: YES (system)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
I generally agree that the preallocated space solution should probably be the default here (and I’m 99% certain that there is a way to get SQLite to preallocate space too). In most cases, that should actually improve performance for us and lower the overall impact on the rest of the system.
However, it will not fix this issue in all cases. Specifically, on filesystems that use copy-on-write semantics for internal updates (BTRFS, ZFS, etc) this approach can still fail (because you need at least enough room for the new data there regardless).
Bug description
On a test raspberry pi I have, I experienced this issue:
10min_dbengine_global_io_errors
was raised and the alarm was sent to the cloudAt this point, the netdata agent was still running. I didn't check if it was responsive or not.
I removed a file to make disk space and tried to see if netdata could recover from the situation.
I realized that netdata could not respond to dashboard queries. All queries were just hanging.
I tried to restart netdata. systemd killed it after some time (90 secs I think).
After netdata restart, the cloud didn't sync alarms properly. Certain alarms are still raised in the cloud, while none is raised at the agent, so the cloud failed to detect that there is a discrepancy in the alarm log.
Expected behavior
Disk full situation
Netdata may have to change the dbengine rotation policy to adapt to disk full situations.
So, once it cannot append metrics to a disk file, it could trigger a dbengine file rotation. dbengine file rotation could be done by moving the oldest file to the newest position, zeroing its headers and using the existing file as a preallocated buffer to commit data to disk.
This strategy of using preallocated disk space could be the default, to allow dbengine use always a fixed amount of disk space for metrics.
Ideally, we would like to have something similar for sqlite. I hope there is a solution for this.
No disk situation
Disks may also fail completely. No disk at all, suddenly, at runtime.
Netdata runtime should be able to survive such a situation and continue running, triggering alarms, streaming metrics to a parent, communicating with netdata cloud.
Steps to reproduce
or
Installation method
kickstart.sh
System info
Netdata build info
Additional info
Related to netdata/netdata-cloud#323
The text was updated successfully, but these errors were encountered: