Page MenuHomePhabricator

Log files on labs instance fill up disk (/var is only 2GB) (tracking)
Closed, ResolvedPublic

Description

The instances provided on labs eqiad (such as deployment-bastion.eqiad.wmflabs) have a /var/ of only 2GB. That tends to fill up pretty quickly with log being written there often causing issues.

This is a tracking bug to list all /var/ related issues in labs, such as daemon not compressing or logrotate keeping too much data...


Version: unspecified
Severity: normal

Details

Reference
bz69601

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:33 AM
bzimport set Reference to bz69601.
bzimport added a subscriber: Unknown Object (MLST).

There are two things that can be done to alleviate/fix this issue entirely:

(a) /var/log itself can be made much bigger at need (there is a puppet class for that very purpose).

(b) you can substitute the default /var with a LVM instead. This is harder to do after the fact but not ridiculously so (it does require a reboot to have daemons use the new one though).

Either of those could be used - please ping me on IRC if you want to discuss either in detail and for help deploying them.

greg lowered the priority of this task from High to Medium.Nov 25 2014, 8:28 PM

I'll note that this thing is finally going to be fixed shortly by T87003

This is "fixed" only insofar as there is now more /var to fill before things break; but that provides no guard against unbounded log growth.

Apr 12 20:39:13 <bd808>	Krenair: on deployment-mediawiki01 it looks like the biggest disk hog is /var/cache/hhvm/fcgi.hhbc.sq3 (2.3G).
Apr 12 20:39:47 <bd808>	Krenair: judging by the output of `sqlite3 fcgi.hhbc.sq3 .tables` we aren't cleaning that file up when we deploy new HHVM builds
Apr 12 20:40:48 <bd808>	a "feature" of hhbc files is that they version the cache to match the exact HHVM build
Apr 12 20:41:10 <bd808>	so when we deploy a new hhvm binary it starts a fresh cache segment
Apr 12 20:41:22 <bd808>	but nothing automatically cleans up the old cache data
Apr 12 20:43:09 <bd808>	it looks like we have 9 versions of each table in that particular hhbc cache
Apr 12 20:46:38 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:47:18 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:47:47 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:48:13 <bd808>	That got back about 2G on each MW server

I haven't seen this issue occur in a long while

hashar claimed this task.

That was a transient issue due to labs instances having a /var of just 2G bytes which is really not enough. Got solved by fine tuning log spam and I believe most instances now have a way larger /var.

There is still the diamond probe that fill in apache2 log but that is really a low issue ( T74175 ).