Jump to content

Research talk:Social media traffic report pilot

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 8 months ago by Isaac (WMF) in topic General feedback goes here

Please provide feedback on the English Wikipedia social media traffic report here.

General feedback goes here

[edit]

Questions, comments, concerns, suggestions that don't fit into any of the sections below

  • I was surprised to see the report published on the English Wikipedia rather than Meta. Is this just for the pilot, or do you think that can scale to all projects? Thanks, Nemo 19:46, 23 March 2020 (UTC)Reply
  • Is the YouTube traffic coming via their official use of Wikipedia (as context beneath videos) or through user comments? (I had heard about the former but I can't recall the last time I saw such usage on a video.) If these can be distinguished, I think it would make an analytic difference. czar 00:49, 30 March 2020 (UTC)Reply
  • Czar, unfortunately we can't distinguish whether the traffic is coming from video descriptions, creator pages, comments, or banners added to videos by youTube itself. A lot of the perennial top traffic hits from YouTube are news organizations; this is likely because YouTube has a policy of linking to the Wikipedia article for news providers on high-traffic videos from that provider. But that's just an assumption on my part. Cheers, Jmorgan (WMF) (talk) 23:53, 30 March 2020 (UTC)Reply
  • This is a great idea! Quick thought: Are you planning on running this as a randomized control trial? I think that could make your analysis more powerful. The idea would be to not put on the list all articles that received at least 500 views from social media, but only a randomly sampled fraction (e.g., 50%) of those. Then you could compare those to the held-out sample of articles that would usually have made it onto the list but didn't because of randomization, and quantifying the impact of the list would be a breeze. If you don't randomize, but publish the complete list, you'll have to deal with all sorts of unobserved confounds, which will challenge the validity of the trial. Ciao, Bob West. --Cervisiarius (talk) 07:41, 31 March 2020 (UTC)Reply
  • Just to add another note -- we have a strict threshold of 500 pageviews for security reasons but retain data privately for pages with less than 500 pageviews so there is also the opportunity to do a regression discontinuity analysis -- e.g., compare pages with 500-600 pageviews with pages with 400-500 pageviews. This obviously is not as powerful as a randomized control trial but also means we don't have to withhold information that we can be sharing. --Isaac (WMF) (talk) 00:29, 1 April 2020 (UTC)Reply
  • That's a great idea, Isaac! On top of comparing [400,500[ to [500,600[, you could also try checking for a "dose-response" relationship: create more fine-grained buckets (as much as the data allows; I'm using a width of 10 in the example to follow) and then see if the jump from [490,500[ to [500,510[ is significantly larger than that between buckets that don't cross the discontinuity (e.g., [480,490[ and [490,500[). That said, if I were you, I'd still go for the randomized control trial if at all possible... ;) --Cervisiarius (talk) 09:58, 4 April 2020 (UTC)Reply

The page has a notice that "This report will no longer be regularly maintained as of 31 May 2020." but still seems to be regularly maintained. Is the template outdated, or is it still going away? — Rhododendrites talk \\ 14:27, 1 November 2020 (UTC)Reply

(@Isaac (WMF):) @Rhododendrites: thanks for the heads-up. I believe Isaac has started the reports up again for a limited time. Cheers, Jtmorgan (talk) 21:08, 1 November 2020 (UTC)Reply
Thanks for checking -- to add to what Jtmorgan said, we had a request to restart the pilot for another few weeks around the US election. It will likely be taken offline again in the next few weeks though (hence the not regularly maintained aspect). --Isaac (WMF) (talk) 13:14, 2 November 2020 (UTC)Reply

@Isaac (WMF): Old page/project, but some new questions if you have the time. I'm interested in the infopanels on YouTube that link to Wikipedia (for topics and for media outlets):

  1. Has YouTube shared a master list somewhere of those topics/outlets? Perhaps we can presume all of them got at least one click during the HostBot run such that the HostBot dataset is all-inclusive?
  2. Has YouTube shared any other metrics about those infopanels? (Whether you can share it or not, I understand, may be a different question).
  3. When HostBot stopped posting, did you continue to retrieve the data? (i.e. does data exist for for whole period)?

For context, I was thinking about a research project involving US elections/voting misinformation on YouTube and was curious how often people click those links... — Rhododendrites talk \\ 01:57, 21 March 2024 (UTC)Reply

@Rhododendrites attempts below at answering your questions:
  • When HostBot stopped posting, did you continue to retrieve the data: yes and no -- we shut down the internal job too so I don't have data past December 2021. But in 2023, we added some general support for this within the internal pageview datasets so we do have this data more recently but it's not exposed publicly anywhere. I know not satisfying but you could connect with Pablo for instance (office hours) to see if there's any way to coordinate a project around it.
  • Master list of topics/outlets: at one point YouTube had shared some of those details with us at WMF. I don't think I have an updated version though and generally they did not want them to be made public. It was just the articles being linked to as well -- not the videos the links are on.
  • Clicks = sufficient to identify dataset: I looked at December 2021 and for them, about half of the pages that show up with any pageviews exceeded the 500-pageview threshold at least once during the month (and so was published in the traffic report). That said, the pages have likely expanded since this was turned off at the end of 2021. I suspect you're right more generally though -- if a page was linked, it would show up in our pageviews at least once.
  • Other metrics about those infopanels: I don't think they shared metrics on them but it's also been a few years so I might not remember.
Generally, I think definitely an interesting space. You can see my evaluation from the time. I focused on the Wikipedia side because we have the data for it -- I don't know of any way though to connect the pageviews to the specific videos where these links are originating unfortunately. Also, they were expanding to other language editions FYI but I'm not sure whether that continued or not.
Hope this helps! Isaac (WMF) (talk) 18:53, 21 March 2024 (UTC)Reply

New column suggestions go here

[edit]

Suggestions of new columns to include in the report (e.g. ORES quality scores, historical traffic averages per article)

Platform traffic as a percentage of all traffic

[edit]
  • Would it be useful to have a calculation of which percentage of all traffic is coming from a specific platform? (I.e. (Platform traffic / All traffic) * 100%). And should the report possibly be filtered on that instead of just on ">500 views from that platform"? Rchard2scout (talk) 13:08, 24 March 2020 (UTC)Reply
  • Rchard2scout this is a good suggestion. Thanks! I can definitely add a platform_percent_of_total_current_day column. That would make it easier for people to sort the table by the articles that are receiving the highest, or lowest, percent of their traffic from a particular platform. We can do that and still keep the "> 500" values for the previous day (the reason we have that is just that we aren't able to report previous day counts that are less than 500, for privacy reasons). Let me know if you have additional thoughts on that. Cheers, Jmorgan (WMF) (talk) 23:22, 25 March 2020 (UTC)Reply
  • From Stuart A. Yates on wikiresearch-l on 2020/03/23: "My immediate thought is how to connect this to the wiki projects for each article, because wiki projects are the primary sources of expert knowledge and have the resources to deal with many issues." Jmorgan (WMF) (talk) 19:54, 24 March 2020 (UTC)Reply
Some WikiProjects already post about an influx of traffic potentially coming from an event. It would be different for each project. At the very least, could be interesting to entertain a bot to note current traffic spikes for the article's talk page so that regular stewards of the page at least have some idea whence the traffic comes to start a discussion on how long it might be sustained. czar 00:52, 30 March 2020 (UTC)Reply

Number of edits

[edit]
  • As I understand it a major use of this will be to see whether there are disruptive edits associated with the additional traffic. If possible could you include the number of edits for that day, or even the number of IP edits and the number of edits by registered users. Smallbones (talk) 22:50, 24 March 2020 (UTC)Reply
  • Smallbones Thank you! This is a great suggestion. I'm considering adding a "number of edits in the past 24 hours to this article that ORES#Advanced_support predicts are likely damaging". These are the same filters available in the Recent Changes feed. I think that this would serve the same purpose, but without putting good-faith edits by IPs or new editors under unfair scrutiny. Do you think that would address the basic need you're articulating here? Cheers, Jmorgan (WMF) (talk) 23:14, 25 March 2020 (UTC)Reply
    • It might just come down to whichever is easiest to get. I don't think people will normally assume all IP edits are made in bad faith. OTOH if there are 50 IP edits made to an article that normally gets 1 edit per month, that would indicate a possible problem whether it's good faith or not. If the damaging edit prediction is working well and easy to get, then it should probably work as well. I doubt that the type of edits we'd be looking for can be subtly indicated by the referrer to beat the system. BTW I may put a paragraph or 2 in The Signpost about this, unless you object, if you want to email me a short sentence or three about the pilot, I won't just have to paraphrase what's on these pages. Or I may contact you in a couple of days. Smallbones (talk) 23:29, 25 March 2020 (UTC)Reply
      • Smallbones I plan on implementing some version of this over the next week or two, and I'll keep you posted. Re: the next Signpost (whenever the next one comes out; I noticed one went out today): Here's a blurb/summary: "The social media traffic report is intended to help editors identify articles that are either going viral, or are being used by social media platforms to "fact check" misinformation posted by their users. In both of these scenarios, previously quiet Wikipedia articles may receive a huge influx of traffic all at once. Until now, editors had no easy way of monitoring these spikes in near-real time unless the social media spike also corresponded to an overall traffic spike that would be visible in the public page traffic reports. A sudden surge may result in bad faith and/or otherwise damaging edits. In some cases, a spike in traffic from a particular social media platform may even reflect a coordinated attempt to insert disinformation into Wikipedia. The WMF Research team thought that these four platforms in particular would be good initial candidates for this data release, but we're eager to hear additional suggestions. In the near future, we'll be rolling out a reporting form so that editors can flag suspicious diffs that they encounter while browsing the pages on the traffic report. Specific examples help the research team understand what disinformation campaigns on Wikipedia might look like, which in turn will help us develop machine learning models that can detect this kind of activity automatically and dashboards or other tools where these edits can be flagged for editor review. If the social media traffic report proves useful, we're considering making it available long-term, and on multiple Wikipedias." Cheers, Jmorgan (WMF) (talk) 23:48, 30 March 2020 (UTC)Reply
[edit]

Are you able to pull the URL of the top referring link? E.g., if a page has gone viral via Reddit, can you link the post that is trending? czar 00:46, 30 March 2020 (UTC)Reply

Czar, no, because we enforce HTTPS and our webrequest logs don't provide any granularity beyond the referring platform. Cheers, Jmorgan (WMF) (talk) 23:50, 30 March 2020 (UTC)Reply

New editor conversion

[edit]

Is there any way to determine the number of editors who registered for an account after visiting the listed page? I would guess not given your statement about enforcing HTTPS (which is a good thing and should not be changed), but it would be interesting to know the effect viral posts like these have on editor recruitment. Wugapodes (talk) 18:41, 1 April 2020 (UTC)Reply

Wugapodes I agree this would be interesting and potentially useful to know. It's possible we could make this determination using the raw webrequest and event logs (which are not public), as long as the person who visited the page created their account within the same browsing session (i.e they didn't close the tab/window between viewing the article and clicking "sign up". This data could be not published at the individual-editor level, but it is possible we could publish the aggregated results of such an analysis. Thanks for the suggestion! Jmorgan (WMF) (talk) 17:27, 2 April 2020 (UTC)Reply

Current Protection Level

[edit]

Would it be possible to include the current protection level in the report? For our articles that are linked from reputable sources (CNN, BBC, WHO, etc) I wouldn't imagine we would have too much to worry about, but the further down we go on the social media list (Facebook, Reddit, "4chan", etc) the greater the risk would be that the edits are being added to upset or disrupt an article. Having a quick column to see what the current protection level is and perhaps when it was added and/or when it will expire could help make this a useful tool for admins to get out in front of efforts by less benevolent social sites to change content here. TomStar81 (talk) 19:15, 1 April 2020 (UTC)Reply

TomStar81 this is an excellent suggestion. I'll look into how it might be implemented. Thank you! Jmorgan (WMF) (talk) 17:22, 2 April 2020 (UTC)Reply
[edit]

Twitter's link search is often useful for identifying the likely reason of a traffic spike. To take two examples from the current report:

Regards, HaeB (talk) 06:06, 31 October 2021 (UTC)Reply

Thanks @HaeB: -- I wasn't actually aware of that search functionality. I'll add it to the header as a tip for now. I've left the page up for quite some time beyond when we intended and want to do some rethinking of the scope/format of this report and how to best maintain it. I plan to take it down early in the new year but maybe for future versions, we could add an "Investigate" column with this Twitter search link and equivalents for other platforms -- e.g., [3] for Reddit. --Isaac (WMF) (talk) 21:00, 23 December 2021 (UTC)Reply
Quick follow-up: I went to add the links and realized that Twitter search links at least are blocked so unfortunately that might not be an easy addition. --Isaac (WMF) (talk) 21:16, 23 December 2021 (UTC)Reply
Thanks for looking into it! That's a pity about the link block (and probably a reason for reconsidering it at some point; it looks that such a use had not been considered when the blacklist entry was originally added).
I'm a bit confused right now why the example links above saved OK (considering that the entry is on the global spam blacklist which applies here on Meta-wiki too). Maybe that's a manifestation of phab:T251047.
Regards, HaeB (talk) 09:39, 26 December 2021 (UTC)Reply
I think I found the explanation (Meta allows Twitter search URLs though it seems not truly intentionally): MediaWiki:Spam-whitelist --Isaac (WMF) (talk) 15:31, 10 January 2022 (UTC)Reply

Social media platform suggestions go here

[edit]

Suggestions of social media platforms to include in the report (e.g. MySpace, Friendster, Vine)


Design and formatting suggestions go here

[edit]

Suggestions about how to make the report more usable (e.g. highlight some cells, make it more mobile-friendly)