Log files on labs instance fill up disk (/var is only 2GB) (tracking)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Aug 15 2014, 12:28 PM

Description

The instances provided on labs eqiad (such as deployment-bastion.eqiad.wmflabs) have a /var/ of only 2GB. That tends to fill up pretty quickly with log being written there often causing issues.

This is a tracking bug to list all /var/ related issues in labs, such as daemon not compressing or logrotate keeping too much data...

Version: unspecified
Severity: normal

Details

Reference: bz69601

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T71590 rsync errors to beta cluster, inconsistent state after scap
Resolved	hashar	T71601 Log files on labs instance fill up disk (/var is only 2GB) (tracking)
Resolved	hashar	T71604 acct (process and login accounting) fill up instances /var/ partition
Declined	None	T71602 diamond does not compress its logs
Declined	None	T71605 atop (monitoring system) logs fill up instances /var/ partition
Invalid	None	T71976 HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk
Resolved	yuvipanda	T71979 hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition
Resolved	hashar	T91354 Process accounting routinely fill up /var on deployment-bastion
Resolved	None	T1259 HHVM core dumps in Beta Cluster
Resolved	hashar	T75262 Beta-cluster web server fills up /var/log with Apache logs
Resolved	bd808	T76119 Get HHVM logs into logstash
Resolved	hashar	T74175 Diamond logstash monitor fills /var/log/apache2 access log
Resolved	• Tgr	T132730 deployment-sentry2 /var is full

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:33 AM

• bzimport added projects: Cloud-VPS, Tracking-Neverending.

• bzimport set Reference to bz69601.

• bzimport added a subscriber: Unknown Object (MLST).

hashar created this task.Aug 15 2014, 12:28 PM

There are two things that can be done to alleviate/fix this issue entirely:

(a) /var/log itself can be made much bigger at need (there is a puppet class for that very purpose).

(b) you can substitute the default /var with a LVM instead. This is harder to do after the fact but not ridiculously so (it does require a reboot to have daemons use the new one though).

Either of those could be used - please ping me on IRC if you want to discuss either in detail and for help deploying them.

yuvipanda closed subtask T71976: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk as Resolved.Nov 24 2014, 1:00 PM

hashar reopened subtask T71976: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk as Open.Nov 24 2014, 2:17 PM

greg added a project: Beta-Cluster-Infrastructure.Nov 25 2014, 12:05 AM

greg moved this task from To Triage to Next: Maintenance on the Beta-Cluster-Infrastructure board.

greg lowered the priority of this task from High to Medium.Nov 25 2014, 8:28 PM

I'll note that this thing is finally going to be fixed shortly by T87003

hashar added a subtask: T91354: Process accounting routinely fill up /var on deployment-bastion.Mar 3 2015, 9:42 AM

Krenair subscribed.Mar 3 2015, 3:44 PM

yuvipanda closed subtask T71979: hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition as Resolved.Mar 5 2015, 1:19 PM

greg added a subtask: T1259: HHVM core dumps in Beta Cluster.Mar 10 2015, 8:53 PM

This is "fixed" only insofar as there is now more /var to fill before things break; but that provides no guard against unbounded log growth.

Aklapper added a project: Cloud-Services.Apr 29 2015, 12:18 PM

zhuyifei1999 moved this task from Triage to Tracking on the Cloud-Services board.Jul 16 2015, 6:01 AM

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJul 16 2015, 6:01 AM

Krenair added a subtask: T75262: Beta-cluster web server fills up /var/log with Apache logs.Apr 13 2016, 8:53 PM

Krenair added a subtask: T74175: Diamond logstash monitor fills /var/log/apache2 access log.

Apr 12 20:39:13 <bd808>	Krenair: on deployment-mediawiki01 it looks like the biggest disk hog is /var/cache/hhvm/fcgi.hhbc.sq3 (2.3G).
Apr 12 20:39:47 <bd808>	Krenair: judging by the output of `sqlite3 fcgi.hhbc.sq3 .tables` we aren't cleaning that file up when we deploy new HHVM builds
Apr 12 20:40:48 <bd808>	a "feature" of hhbc files is that they version the cache to match the exact HHVM build
Apr 12 20:41:10 <bd808>	so when we deploy a new hhvm binary it starts a fresh cache segment
Apr 12 20:41:22 <bd808>	but nothing automatically cleans up the old cache data
Apr 12 20:43:09 <bd808>	it looks like we have 9 versions of each table in that particular hhbc cache
Apr 12 20:46:38 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:47:18 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:47:47 <bd808>	!log Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
Apr 12 20:48:13 <bd808>	That got back about 2G on each MW server

Krenair created subtask T132730: deployment-sentry2 /var is full.Apr 14 2016, 10:18 PM

hashar closed subtask T1259: HHVM core dumps in Beta Cluster as Resolved.Apr 19 2016, 8:44 AM

hashar closed subtask T91354: Process accounting routinely fill up /var on deployment-bastion as Resolved.Apr 19 2016, 9:03 AM

• Quiddity removed a parent task: T4007: [DO NOT USE] Tracking bug [superseded by #Tracking].Jul 14 2016, 4:09 AM

Joe closed subtask T71976: HHVM emits logs filling /var/log/upstart/hhvm.log and /var/log/syslog/ filling disk as Invalid.Jul 22 2016, 6:04 AM

hashar closed subtask T71605: atop (monitoring system) logs fill up instances /var/ partition as Declined.Jul 22 2016, 12:07 PM

• Tgr closed subtask T132730: deployment-sentry2 /var is full as Resolved.Aug 4 2016, 2:11 AM

greg moved this task from Next: Maintenance to Epics / Tracking on the Beta-Cluster-Infrastructure board.Aug 5 2016, 8:49 PM

I haven't seen this issue occur in a long while

hashar closed subtask T75262: Beta-cluster web server fills up /var/log with Apache logs as Resolved.Aug 22 2016, 8:26 AM

That was a transient issue due to labs instances having a /var of just 2G bytes which is really not enough. Got solved by fine tuning log spam and I believe most instances now have a way larger /var.

There is still the diamond probe that fill in apache2 log but that is really a low issue ( T74175 ).

hashar closed subtask T74175: Diamond logstash monitor fills /var/log/apache2 access log as Resolved.Sep 26 2016, 10:12 AM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:58 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 7 2017, 6:58 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban), Cloud-Services.Sep 26 2017, 11:48 PM

Liuxinyu970226 moved this task from Tag to Transition completed / Archived on the Tracking-Neverending board.Mar 21 2019, 10:31 AM