Page MenuHomePhabricator

Crunch and delete many old dumps logs
Open, HighPublic

Description

So it seems that @Addshore set up this job to pull and process dumps access logs. This was set up as part of T118739, puts logs on stat1007 at /srv/log/webrequest/archive/dumps.wikimedia.org, and is processed here.

We found this because @WDoranWMF would like some basic stats on dumps downloads. He will detail this here so we can know how to crunch the existing data.

After said crunching, we should set up a job to delete old logs in accordance with our retention policy. Thanks to @ArielGlenn for the data archeology.

Event Timeline

fdans lowered the priority of this task from High to Medium.Apr 26 2021, 3:49 PM
fdans moved this task from Incoming to Datasets on the Analytics board.

@Milimetric Just following up here. The data we would like to know is downloads per dump over time, I think that might been the original ask for the retention. Not sure what else might be available - what's the log format?

The format looks like Common Log Format with two additional fields, "full URI requested" and "user agent"

If it's just a simple data crunch, maybe @Addshore already did it in the computation mentioned above, or maybe someone in Product-Analytics would like to try.
When this analysis is done, let's work with @Addshore and/or @ArielGlenn to truncate logs older than 90 days.

Pinging @Addshore directly, any chance you are still generating this data, or alternatively, that you still have the tools around and could easily do so?

Also asking generally the Analytics folks: and so I guess @Milimetric , these are just plain old boring web logs, isn't there some existing tool the team has that they could be dumped into?

There's no need for a fancy tool, this would be a few lines of spark to read the data and save to, probably, a Hive table with an explicit schema. Should take a day to set up and some time after that to run some analysis. We just don't have the capacity, there's a lot of stuff going on that's higher priority right now. But it's relatively easy for anyone to play with. The only concern here for me that's a bit time sensitive is that there are a bunch of IPs in the logs.

Hey @WDoranWMF @Milimetric,

I had a chat with @ArielGlenn today; I can help with a one-off analysis, but I'd need to understand needs and scope. Before moving forward let's make sure we would not be replicating analysis work already made available by @Addshore.

@WDoranWMF the data is split between access and error logs (https://phabricator.wikimedia.org/T280678#7066462).

Accessing data seems pretty straight forward, and the data volumes (~11G) are manageable.

What level of reporting would you need?

  • A count of (unique) requests per day/hour, with some minimal bot filtering (for obvious cases - if needed), should be easy.
  • Breakdowns per wiki, platform (web/mobile/bot), or similar would be more work. It would require parsing URIs and user agents before crunching data. From the full URI filed we can extract which dump is being requested (e.g. /enwiki/, /wikidatawiki/, /other/pageviews/, /liwikisource/, /frwikivoyage/, etc). There's timestamped user access from mobile and desktop, and what looks like a non trivial amount of bots / scrapers.

@Milimetric effort estimate seems spot on; with the caveat that getting data to a certain quality (anomalies & bots filtered out) will require some back and forth. IMHO:

  1. Total counts per day (without breakdowns or minimal filtering/polish) we could do in a day of work.
  2. UA parsing and breakdown per wiki I'd estimate at more than day, less than a week.

Would we need to ask a security review for exporting aggregated data out of hadoop?

Would we need to ask a security review for exporting aggregated data out of hadoop?

As I understood it this data is just for internal use, so I don't think a review is needed. If we publish it, that would involve a whole bunch more work, not just a review.

regarding the dumps logs that I got sucked in
Right now this data is still used for this dashboard https://grafana.wikimedia.org/d/000000264/wikidata-dump-downloads?orgId=1&refresh=5m&from=now-2y&to=now
This certainly doesn't need very old log files in order to continue.

@Milimetric The dashboard @Addshore shared is good enough for our purposes. So I guess removal of old logs can proceed.

Moving back to incoming to triage with Olja

odimitrijevic raised the priority of this task from Medium to High.Jul 26 2021, 3:59 PM