Page MenuHomePhabricator

[EPIC] Data definitions and logic for 'Pages Created' downloadable reports
Closed, InvalidPublic

Description

DO NOT USE THIS TICKET. THIS HAS BEEN SUPERCEDED BY T206058. THIS IS HERE FOR JOE'S PURPOSES ONLY.

The Pages Created downloadable reports, in Wikitext and csv formats, give details on all articles created during an event (and potentially other pages, once we implement a namespace filter). Their purpose is to provide event organizers with data to provide to partners, grantors, bosses and event participants in order to demonstrate an event's scope and impact.

This task defines the total set of data we want for those reports. The subset of data in the reports and all the elements of the reports are defined in two separate tasks:

  • CSV: The first, partial version of the csv report is defined in T206058.
  • Wikitext: The first, partial version of the Wikitext report is defined in T205502.

Data / column names

  • Title
  • URL
  • Creator
  • Wiki
  • Edits during event
  • Bytes changed during event
  • Pageviews, cumulative
  • Avg. pageviews per day
  • Incoming links
  • More page metrics
Deprecated metrics, not for MVP
  • Description
  • Edits subsequently
  • Words added subsequently
  • Bytes changed subsequently
  • Namespace
  • Still exists? [T206695]
  • Words added during event [T206690]
  • Article class (where available)

Metric definitions

  • Title: include pages as defined in Default Filter Settings, above. Main namespace only.
  • URL of the page.
  • Creator The username of the person who created the article.
  • Wiki where the article exists. Limited to the short list of wikis defined on the Event Setup screen for the event.
  • Still exists? answers = yes/deleted. Tells whether the page still exists at the time the data was updated. [full details in separate ticket, T206695]
  • Edits during event The edit count to the article during the event period.
  • Bytes changed during event The net bytes changed to the page during the event period. Show all numbers with a sign to indicate direction of change. (For this report, all numbers will all be positive, but in other reports this number may be negative.)
  • Article class (where available) These rankings are available on five wikis. Each has its own ranking system and codes. Use the codes appropriate to the wiki. [handled separately in TK]
    • For wikis where article class is unavailable, please omit the column,
    • For individual articles that don't have a class rating but which are in a wiki that does have article classes, answer=unrated
  • Pageviews, cumulative Pageviews to the Main space page from creation until most recent data as of the llast data update. (Granularity of Pageviews API is one day, meaning you always get yesterday's data.) If the user requests stats during the day of creation, we will show "n/a", for "not available" rather than 0, which is misleading.
  • Avg. pageviews per day In order to provide an accurate picture of how many views the page gets now, instead of over its entire history, Avg. Pageviews will be an average over the preceding 30 days. If 30 days are not available, use the average of however many days are available. If the user requests stats during the day of creation (when no figures are available), we will show "n/a", for "not available" rather than 0.
  • Incoming links A count as of last data update of links to the article.
  • More page metrics provides a URL that links users to the XTools "Page History" page for that article.
Deprecated metrics, not for MVP
  • DescriptionPull from the first sentence of the article, truncated to [100?] characters (not including wikitext).
  • Edits subsequently The edit count to the article from the end of the event period until the last data update. If the event is ongoing, answer="ongoing"
  • Words added during event: the net change in words to the given article. [full details in separate ticket, T206695]
  • Words changed subsequently The net change in words to the page from the end of the event period until the last data update. If the event is ongoing, answer="ongoing". Show all numbers with a or - sign to indicate direction of change. As above, omit for scripts/languages where not feasible and present as decided.
  • Bytes changed subsequently The net bytes changed to the page from the end of the event period until the last data update. If the event is ongoing, answer="ongoing". Show all numbers with a or - sign to indicate direction of change.
  • Namespace [leave out of reports until we add Namespace filters]

Fixed vs. Continuing Data

Figures like Pageviews naturally continue to develop after the event is over and must be calculated anew every time the data is updated. Other figures can be considered fixed once the event period is over; these could be stored and need never be calculated again. Here is a breakdown of these two types

Remains fixed

  • Creator
  • Wiki
  • Edits during event
  • Bytes changed during event
  • Words added during event

Continues to develop

  • Title/URL [These may change, though continuity will be maintained by the article ID.]
  • Description
  • Still exists?
  • Edits subsequently
  • Bytes changed subsequently
  • Words added subsequently
  • Article class (where available)
  • Pageviews, cumulative
  • Avg. pageviews per day
  • Incoming links

Default filter settings and logic

The following is the minimum requirement for filtering: Time period AND Wikis AND (Participants OR Categories).

Required filters
  • Time period —This is required; to be counted, all edits, uploads, pages created, etc. must have been performed during the time period of the event defined in Event Setup.
  • Wikis This is required; to be counted, all edits, uploads, pages created, etc. must be in wikis defined for the event in Event Setup
User must pick ONE of the following
  • Participants— This is optional; if the user has provided a list of participants, then metrics will be limited to edits, uploads, pages created, etc. by the specified participants.
  • Categories This is optional; if the user has set categories for the event, all edits, uploads, pages created (or their talk pages, as per T200373) , etc. must be in those categories.
Logic
  • Logic = AND: The relationship among the filters above will be as follows: wikis AND Time period AND Participants AND Categories. In other words, each type of filter supplied narrows the results. So, if the organizer supplies all four types of filtering info, then all four will be applied and results will be presented only for articles that exist at the intersection of all four—or of whichever of the four the organizer has supplied.

Related Objects

Mentioned In
T213470: Implement UX for Updating and Downloading
T212547: Add 'Files Uploaded' data to downloadable csv
T210898: Add 'Pages Improved' data to downloadable Wikitext report
T210775: Add 'Pages Improved' data to downloadable csv
T206820: Create a method to fetch total number of bytes changed during an event
T206821: Expand the current Wikipedia query to request number of edits done during an event
T206817: Create a method to fetch page view data
T206692: Implement ‘Event Summary’ downloadable Wikitext report
T206690: Investigate how to put 'Words added' into 'Event Summary' and 'Pages Created' reports
T206576: Create new UI elements: ‘Download report’ menu, 'Popup format selector', Animated 'Working' Indicator'
T206058: Implement 'Pages Created' downloadable csv
T206045: Add new 'Pages Created' data to downloadable CSV report
T205561: Add 'Event Summary' data to downloadable csv report
T205553: Design 'Pages Created' downloadable Wikitext report
T205502: Add 'Pages Created' data to downloadable Wikitext report
Mentioned Here
T200373: Allow Category filtering to include Talk page categories, but count changes only to associated Main namespace pages [timebox 4 hours]
T206690: Investigate how to put 'Words added' into 'Event Summary' and 'Pages Created' reports
T206695: Create a method for 'New page survival rate' and 'Still exists' metrics
T205502: Add 'Pages Created' data to downloadable Wikitext report
T206058: Implement 'Pages Created' downloadable csv
T199189: [2.3] External links/references event stream
T206045: Add new 'Pages Created' data to downloadable CSV report
T12331: Introduce page creation log
T182183: [Pages Created] Redirects are counted as "pages created"
T192739: [Pages Created] Use oldest revision with rev_parent_id = 0

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jmatazzoni renamed this task from Implement Pages Created downloadable csv report to Implement 'Pages Created' downloadable csv report .Sep 26 2018, 4:47 PM
  • Format: The report will be downloadable as a CSV file. Do we need to let users pick a delimiter? What does Grant Metrics do now?

I'd say not for the first MVP. To make this viable, we'll need a settings page and that's out of scope.

The default order will be alphabetical [how hard is it to ignore articles and such?]

What do you mean?

Descriptive info and settings

[snip]
Location:

We aren't asking for this detail in the event creation yet, that should be in a different ticket.

Date of last data update

This is misleadingly harder than it looks; We only store the numbers in the DB, not the full information, and we are not displaying the page title and specific (per page) information yet. This means that when you hit "download", the information needs to be regenerated -- so the "last update" will always be "now."

Filter settings for download:
Article Worklist: Applied/Not applied/None supplied

We're not asking or accepting any worklist. This should be skipped (requires a separate ticket when relevant)

Categories: Applied/Not applied/None supplied if applied, list categories [only those actually entered by the user, not subcategories included automatically]

Just a small comment -- the wikis will have to be alongside the categories, because categories related to wikis.

Participants list: Applied/Not applied/None supplied
Wikis: list

If the event has *.wikipedia.org, we display that, and not the 300 wikis under it, right? (Verifying)

When we have filters, what is the minimum?: As we add download controls and users gain the ability to turn filters off, we'll have to face the question: what is the minimum before performance degrades too much? We've had one request from a user already to be able to turn Time Period off and just get all data about a Category, for example. This would be something that would happen on a per-download basis (i.e., not across the board for all metrics at once). Would it work? @MusikAnimal, @Mooeypoo?

We actually don't know this yet; we'll have to run some tests and figure it out as we go, especially when the level of data we're fetching is going up.

URL of the page. I don't know whether we have the ability to set default column width for downloads, but if we do there is no need to show the whole URL. [Is this possible?]

Yup, should be doable.

Description Pull from the article the first sentence, truncated to X characters [X to be determined; @Mooeypoo, what did we do in Notifications? I remember we had to increase the allotment at some point because it was too few for some scripts/languages. On English, for reference, 100 chars would be plenty. That is, of course, counting display text only, not wikitext.] If this gets too complex, we can drop but it is a useful feature for users.]]

Having a truncated text shouldn't be hard in itself, the API can give us only the beginning of the article (it would just mean a whole bunch of API requests for each page alongside the current API/sql that we're running). It might be unreasonably longer.

So, two issues with this:

  1. The query can take significantly longer with this feature if there are a lot of pages. Getting the description of each page for 50,000 pages is significant.
  2. Even if we truncate, we are adding a lot of text per row for the CSV file, and it could very quickly become unreasonably large. We need to check how large the csv can be as it is, but also avoid making it so big that we won't be able to produce it for the user to download.

Creator The username of the person who created the article.

Might be misleadingly harder than we think (need to check on this). I don't think the API gives us the "creator" of the article, since Wikis are inherently collaborative; we'll have to fetch the first revision and that might be expensive for edits of already existing pages.

Needs some more input. @MusikAnimal, does XTools do this?

Namespace Until we add filtering controls, this will be Main only. [Should we include it as a marker for later or leave it out?]

We'll add other namespaces later; we can either have this in but have everything with "main" or just hold off on this data point until we can allow other namespaces.

Edits during event The edit count to the article during the event period.
Edits subsequently The edit count to the article from the end of the event period until the last data update. If the event is ongoing, answer="ongoing"

We're not storing the difference or any data about this currently. If we want to add this information we'll need to start storing and seeing how to manage this.

Definitely requires engineering thinking, and a separate ticket. Perhaps will mean making this not a part of the MVP.

Same for "Bytes changes subsequently" and anything else that has "subsequently".

Words changed during event The net change in words to the page during the event period. Show all numbers with a sign to indicate direction of change. In our discussions, the point was raised that this calculation may be feasible for some scripts/languages more than others. We will implement for the scripts where we can and leave others for later.
For scripts/languages where this calculation is not feasible, we can either 1) omit the column, which is preferred, or 2) answer=unavailable [@Mooeypoo, should I make a separate ticket to investigate this?]

This is definitely a separate ticket, and I am not sure it's that easy to do with our consideration for loads. This would mean we'll have to parse the diffs and more or less guess about words (using a space delimiter is a good start but is actually not entirely accurate, even in English :)

This is a significant change. I'd reconsider including this in the MVP. Either way, it requires a separate ticket.

Article class (where available) These rankings are available on five wikis. Each has its own ranking system and codes. Use the codes appropriate to the wiki.
For wikis where article class is unavailable, we can either 1) omit the column, which is preferred, or 2) answer=unavailable [@Mooeypoo, should I make a separate ticket to investigate this?]

Requires a separate ticket for investigation on how to do that properly, and how to make sure we are setting things up per wiki where this exists. Not trivial.

Incoming links A count as of last data update of links to the article.

Only concern: Adding loads, though if we only want a number of links, it might be okay. @MusikAnimal might have insight on this?

Creator The username of the person who created the article.

Might be misleadingly harder than we think (need to check on this). I don't think the API gives us the "creator" of the article, since Wikis are inherently collaborative; we'll have to fetch the first revision and that might be expensive for edits of already existing pages.

Needs some more input. @MusikAnimal, does XTools do this?

https://xtools.wmflabs.org/pages

It goes by revisions with rev_parent_id = 0, which is what we're currently doing in Event Metrics for counting the pages created. There are known caveats, such as T182183 and T192739 but for the most part it is quite accurate. I am sure you could go by MIN(rev_timestamp) but I have yet to craft such a query (please enlighten!). The new page creation log (T12331) is a bit more exacting but it also doesn't handle redirect → article and vice versa. I think this would be an uncommon scenario for event participants anyway.

Incoming links A count as of last data update of links to the article.

Only concern: Adding loads, though if we only want a number of links, it might be okay. @MusikAnimal might have insight on this?

Surprisingly it's very fast. There is an XTools API for it too, though we'll probably want to write our own query.

Article class (where available) These rankings are available on five wikis. Each has its own ranking system and codes. Use the codes appropriate to the wiki.
For wikis where article class is unavailable, we can either 1) omit the column, which is preferred, or 2) answer=unavailable [@Mooeypoo, should I make a separate ticket to investigate this?]

Requires a separate ticket for investigation on how to do that properly, and how to make sure we are setting things up per wiki where this exists. Not trivial.

Most of the dirty work has been done in XTools. You can use the project/assessments API endpoint to get a list of supported wikis and the available assessments. You can get assessments for a set of specific articles using the page/assessments endpoint, which gives you the localalized name of the article class, associated colour and icon. I believe the Dashboard is using the same API, or at least the same configuration.

Words changed during event The net change in words to the page during the event period. Show all numbers with a sign to indicate direction of change. In our discussions, the point was raised that this calculation may be feasible for some scripts/languages more than others. We will implement for the scripts where we can and leave others for later.
For scripts/languages where this calculation is not feasible, we can either 1) omit the column, which is preferred, or 2) answer=unavailable [@Mooeypoo, should I make a separate ticket to investigate this?]

This is definitely a separate ticket, and I am not sure it's that easy to do with our consideration for loads. This would mean we'll have to parse the diffs and more or less guess about words (using a space delimiter is a good start but is actually not entirely accurate, even in English :)
This is a significant change. I'd reconsider including this in the MVP. Either way, it requires a separate ticket.

Agreed that this will slow things down noticeably. But if we do get to it, I find parsing HTML content to be more exacting in getting word counts of prose, as then it is easier to exclude things like infoboxes and references. We do the same thing in XTools, available via the page/prose API endpoint. The MediaWiki text extracts API does something similar but we do not need content, just numbers.

Words changed during event The net change in words to the page during the event period. Show all numbers with a sign to indicate direction of change. In our discussions, the point was raised that this calculation may be feasible for some scripts/languages more than others. We will implement for the scripts where we can and leave others for later.
For scripts/languages where this calculation is not feasible, we can either 1) omit the column, which is preferred, or 2) answer=unavailable [@Mooeypoo, should I make a separate ticket to investigate this?]

This is definitely a separate ticket, and I am not sure it's that easy to do with our consideration for loads. This would mean we'll have to parse the diffs and more or less guess about words (using a space delimiter is a good start but is actually not entirely accurate, even in English :)
This is a significant change. I'd reconsider including this in the MVP. Either way, it requires a separate ticket.

Agreed that this will slow things down noticeably. But if we do get to it, I find parsing HTML content to be more exacting in getting word counts of prose, as then it is easier to exclude things like infoboxes and references. We do the same thing in XTools, available via the page/prose API endpoint. The MediaWiki text extracts API does something similar but we do not need content, just numbers.

I feel like this has been a long desired set of data, especially for classroom assignments where prose length is a bit more interesting than bytes (which include anything). It also helps folks measure differently considering the size of references (which on big projects, take up huge chunks of the bytes)

Have we thought at all about counting ref tags added? References are a measure of quality of content in some Wikipedias.

Have we thought at all about counting ref tags added? References are a measure of quality of content in some Wikipedias.

That is, I assume, the same as citations? If so, yes, we've debated quite a bit. Counting these would be very desirable but they are not in the metadata. So we would have to parse the wikitext, which is too slow and resource intensive.

aezell renamed this task from Implement 'Pages Created' downloadable csv report to Gather 'Pages Created' data and store in Event Metrics database.Oct 2 2018, 9:37 PM
aezell updated the task description. (Show Details)

Here's an idea: Would it be useful to link to the XTools Page History tool in each row? Most of what we have listed here XTools offers, so if there's any stats in particular we have to omit for performance reasons, we can at least link to a tool where they can get that data. You can also give XTools the date range so it only shows data relevant to the event. Obviously it's preferable to have the numbers directly in the CSV export so they could do sorting and other calculations, but I assume a link is better than nothing? And of course it'd promote another Community Tech (and volunteer) tool 😉

jmatazzoni renamed this task from Gather 'Pages Created' data and store in Event Metrics database to Implement 'Pages Created' downloadable csv report.Oct 3 2018, 12:36 AM
jmatazzoni updated the task description. (Show Details)

Have we thought at all about counting ref tags added? References are a measure of quality of content in some Wikipedias.

That is, I assume, the same as citations? If so, yes, we've debated quite a bit. Counting these would be very desirable but they are not in the metadata. So we would have to parse the wikitext, which is too slow and resource intensive.

Research is creating an event stream focused on adding urls as references, I think, in order to support work like: http://blog.archive.org/2018/10/01/more-than-9-million-broken-links-on-wikipedia-are-now-rescued/

@DarTar @Samwalton9 do you have a ticket for that yet?

Agreed that this will slow things down noticeably. But if we do get to it, I find parsing HTML content to be more exacting in getting word counts of prose, as then it is easier to exclude things like infoboxes and references. We do the same thing in XTools, available via the page/prose API endpoint. The MediaWiki text extracts API does something similar but we do not need content, just numbers.

Yeah I'm not saying it's impossible, but we will need to think about how to do this while keeping performance up; my recommendation is to first go with the other metrics, make sure our performance is solid, make sure we do things in a sustainable way, and then we can take a look at either the word count and/or the counting of ref tags, etc, that require going over each diff per calculation.

jmatazzoni renamed this task from Implement 'Pages Created' downloadable csv report to Implement 'Pages Created' downloadable reports.Oct 11 2018, 4:31 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni renamed this task from Implement 'Pages Created' downloadable reports to Data definitions and logic for 'Pages Created' downloadable reports.Oct 11 2018, 7:09 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni renamed this task from Data definitions and logic for 'Pages Created' downloadable reports to EPIC Data definitions and logic for 'Pages Created' downloadable reports.Oct 11 2018, 11:44 PM
Niharika renamed this task from EPIC Data definitions and logic for 'Pages Created' downloadable reports to [EPIC] Data definitions and logic for 'Pages Created' downloadable reports.Oct 12 2018, 2:31 AM
Niharika added a project: Epic.

Here's an idea: Would it be useful to link to the XTools Page History tool in each row? Most of what we have listed here XTools offers, so if there's any stats in particular we have to omit for performance reasons, we can at least link to a tool where they can get that data. You can also give XTools the date range so it only shows data relevant to the event. Obviously it's preferable to have the numbers directly in the CSV export so they could do sorting and other calculations, but I assume a link is better than nothing? And of course it'd promote another Community Tech (and volunteer) tool 😉

This seems like a smart idea, and Niharika says its' not at all hard. So I'm adding it to the document definition and to the first release (defined in T206058).

@jmatazzoni For "Edits during event" and "Bytes changed during event", I assume we want to limit it to the participants (if there are any)?

"Words added during event" can't be restricted to participants, for performance reasons.

In T205363#4945840, @MusikAnimal wrote:

@jmatazzoni For "Edits during event" and "Bytes changed during event", I assume we want to limit it to the participants (if there are any)?

"Words added during event" can't be restricted to participants, for performance reasons.

Don't look at this ticket; all the definitions and info you need should be in the CSV ticket for this report. (as I'm pretty sure it says in the ticket.)

Yes, filters should apply to all metrics, including bytes and edits. Words added has been dropped (which is why you shouldn't look at this ticket.)

jmatazzoni updated the task description. (Show Details)