Page MenuHomePhabricator

Build a taxonomy for "impactful topics"
Open, Needs TriagePublic

Description

We are looking at improving the taxonomy of topics we use to classify articles so it better reflects topics that are relevant to the community.

  • At the moment, the taxonomy includes an expanded version of the second level of the Wikiproject directory taxonomy.
  • We would like to stick to Wikiprojects as a reference unit because
    • They are largely adopted by the community as a way to organize labour
    • They provide high-quality label data to train topic classifiers
  • The idea is that we can expand the existing topic taxonomy to include more wikiprojects whose topics are considered relevant/impactful by the community
  • The above implies setting up community consultations to define what those topical categories are.
  • Once (a part of ) this set of topical categories is finalized, we can retrain our topic models so that they can classify articles according to a more impactful set of topics.

Details

Other Assignee
Astinson

Event Timeline

@Astinson hi! We had originally created this task to start a conversation about creating the "taxonomy of topics". I am happy to modify it based on what we discussed yesterday and assign it to you!

@Miriam Sounds good! Yeah, we have two layers of work that I think need to happen: first examining the topic areas that we have signal from the Grants space and seeing if we can build for those(which @Rmaung has the most recent data on) and then thinking about how we can gain the most insight from the communities we do have to improve the overall data model to reflect on and offwiki organizing beyond enwiki. I am going to work with Isaac next week to figure out how complex some of the methodologies might be, and propose a timeline or process.

Summary of some data analysis I did for evaluating the current topic taxonomy and gathering some thoughts about potential changes (google doc with more data/notes):

  • We'll have to do some cleaning up of the WikiProject->Topic mapping as WikiProject names etc. have shifted since it was created in 2020. For example, WikiProject Climate Change used to be a task force of WikiProject Environment (I think) and then became its own project so is not currently tracked in the data. This seems pretty doable though as a one-time manual pass and allocation of larger WikiProjects to specific topics.
  • Big changes that I think we are pretty certain about:
    • Shifting of geographic topics to a country-based model. This will allow for more granularity than current regions and incorporate data from Wikidata so build on that community work.
    • Shifting of model-based outputs for people (biography/women topics) to a Wikidata-based output (deterministic based on instance-of:human and gender properties). This will lose some of the hazier, women-related topics that the model-based women topic could surface but be clearer (less likely to provide problematic predictions) and we will see about addressing some of this change with the new topics.
  • A number of small changes to the arts/science topics -- e.g., perhaps merge a few categories that get low usage and have low coverage.
  • The larger discussion will be around how to handle some of the existing history/society topics and what topics are possible for folks engaged in sustainability and human rights work.
  • Expanding the data pipeline to incorporate WikiProjects from other language editions wouldn't have a large effect at the moment (most major wikiprojects with coverage of non-English articles are for geographic/biographical topics and only a few are in areas where we probably do need more diverse data like history/society topics). But this might be useful for certain topics if we do have low data volume/diversity from English and we know there are relevant WikiProjects in other language editions supported by PageAssessments.
Miriam renamed this task from Brainstorm taxonomy for "impactful topics" to Build a taxonomy for "impactful topics".May 9 2024, 2:15 PM
Miriam updated the task description. (Show Details)

As a brief followup note to @Isaac above: I am currently reviewing the data collected by Isaac, and comparing it with the reported use of WikiProjects and other topical collaborations in community reporting areas such as Diff and This Month in GLAM -- to be better understand the topical networks that would most be prepared for having the conversation identified in the "history/society" and other topics like climate/biodiversity/sustianability identifed by Isaac. I currently have a sketched timeframe for targeted data modeling discussions about the rebuild in Q2 of FY24-25.

Hi @Astinson thanks!! Double checking something! Is this work going to be folded under WE1.1.3 (The hypothesis text seems pretty aligned with what we are trying to achieve)? If not, should we also take that work into account when designing the topic taxonomy? @MMulaudzi-WMF CC

@Miriam yep exactly the Q2 discussions should be part of 1.1.3, and some Q1 awareness building and WikiProject identification are covered in 1.1.2. The idea is that these consultations/outreach processes feed into eachother in a way that keeps us from doing too many parallel outreach moments for different related things (from the perspective of organizers/editors).

Wonderful, thank you @Astinson! Is that ok if I assign this task to you and Isaac for now?

Miriam updated Other Assignee, added: Astinson.

Update:

  • I did a pass on updating the existing groundtruth mapping of WikiProject -> topics in collaboration with AS/EH. You can see the end result and a diff that compares it to the previous taxonomy. This is not finalized but hopefully is a good starting off point for our smaller group workshops on the specific areas and also lets me see what it would mean to put some of these changes into practice.
  • On that note, I did a pass of retraining the model based on these new topics. Details are below for each category (called mid-level-category) and include the number of articles that the model was tested on (n -- this approximately 10% of the total number of articles that the model was trained on) and the performance (avg_pre which is the average precision score). With regard to interpretation of avg_pre, which I have found is the best overall measure of how well-calibrated the model is:
  • Anything over like 0.9 is amazing (no further work needed)
  • Anything between 0.75 and 0.9 is quite acceptable but there might be some small improvements we should look for
  • Anything below 0.75 (especially as it gets around 0.5) is definitely something to explore. More specific details on what's going on can be found in the precision and recall results below which show the model is generally pretty precise -- i.e. when it applies a label like Sustainability, it's correct 2/3 of the time (and we'd have to look at those 1/3 of incorrect predictions to see if they're truly "wrong") but it really suffers in recall, so it only tags 1/3 of the articles we know to be about sustainability with that topic. In the past, we have used precision of >0.7 as the baseline for considering deployment for these sorts of models but ideally we get over 0.8 at least (4 out of every 5 predictions are definitely relevant).
  • Overall this suggests that some of the new society/environment-related topics are still a bit too hazy for the model to fully pick up on and generalize. I'll produce some datasets of sample outputs too and set up an API with the model so it's easier to see what's actually going on with these topics.
=== Mid Level Categories ===
                                                         n  ...   avg_pre
mid-level-category                                          ...          
Culture.Sports                                      421323  ...  0.990916
STEM.STEM*                                          288310  ...  0.939329
History and Society.History                         172086  ...  0.812263
Culture.Media.Film and Television                   137456  ...  0.955852
Culture.Media.Music                                 128297  ...  0.960705
History and Society.Politics and government         123212  ...  0.841499
Culture.Visual arts.Visual arts*                    120018  ...  0.848226
Culture.Literature and Languages                    122679  ...   0.82772
Culture.Philosophy and religion                     111922  ...  0.817572
History and Society.Military and warfare            112274  ...  0.852824
History and Society.Transportation                   91726  ...  0.941979
Culture.Women                                        77818  ...  0.660833
History and Society.Business and economics           74998  ...  0.670014
STEM.Earth and the Environment.Physical Geography    71613  ...  0.921147
STEM.Physics and Space                               57959  ...  0.947662
Culture.Visual arts.Architecture                     55401  ...  0.783706
Culture.Media.Entertainment                          52509  ...  0.772975
STEM.Engineering                                     54602  ...   0.82607
STEM.Medicine & Health                               51805  ...  0.844046
History and Society.Society and Culture              35227  ...  0.545802
History and Society.Education                        35212  ...  0.638314
STEM.Computing                                       35725  ...  0.871113
History and Society.Human Rights                     35128  ...  0.425688
STEM.Biology                                         34063  ...  0.805851
Culture.Visual arts.Comics and Anime                 24942  ...  0.883274
STEM.Technology                                      26483  ...  0.585959
STEM.Chemistry                                       25342  ...  0.880904
Culture.Food and drink                               24565  ...  0.852065
STEM.Earth and the Environment.Humans and the e...   26751  ...  0.651183
Culture.Performing arts                              22287  ...  0.672227
Culture.Media.Journalism                             20354  ...  0.534463
STEM.Mathematics                                     19137  ...  0.869299
STEM.Earth and the Environment.Sustainability        10671  ...  0.452361
Culture.Visual arts.Fashion                           7259  ...  0.669888
                                                   precision    recall  \
mid-level-category                                                       
Culture.Sports                                      0.986563  0.961937   
STEM.STEM*                                          0.921664   0.83507   
History and Society.History                         0.813958  0.658891   
Culture.Media.Film and Television                   0.932337  0.880034   
Culture.Media.Music                                 0.948667  0.903887   
History and Society.Politics and government         0.835157  0.711197   
Culture.Visual arts.Visual arts*                    0.854203  0.722042   
Culture.Literature and Languages                    0.864754   0.65519   
Culture.Philosophy and religion                     0.859226  0.666464   
History and Society.Military and warfare            0.855601  0.707341   
History and Society.Transportation                   0.93814  0.865371   
Culture.Women                                       0.743657  0.491159   
History and Society.Business and economics          0.765867  0.508427   
STEM.Earth and the Environment.Physical Geography   0.917432  0.814419   
STEM.Physics and Space                              0.934848  0.865991   
Culture.Visual arts.Architecture                    0.812041  0.637353   
Culture.Media.Entertainment                         0.823381  0.595555   
STEM.Engineering                                    0.876094  0.685927   
STEM.Medicine & Health                              0.873536  0.722536   
History and Society.Society and Culture             0.685215  0.414937   
History and Society.Education                       0.808388   0.48225   
STEM.Computing                                      0.881269  0.755017   
History and Society.Human Rights                    0.690182   0.26597   
STEM.Biology                                        0.851505  0.658721   
Culture.Visual arts.Comics and Anime                0.899285  0.791516   
STEM.Technology                                     0.710038  0.443643   
STEM.Chemistry                                      0.885143  0.751125   
Culture.Food and drink                                0.8816  0.740199   
STEM.Earth and the Environment.Humans and the e...  0.768689   0.50585   
Culture.Performing arts                             0.774552  0.531521   
Culture.Media.Journalism                            0.713815  0.393977   
STEM.Mathematics                                    0.889434  0.757485   
STEM.Earth and the Environment.Sustainability        0.66919  0.331365   
Culture.Visual arts.Fashion                         0.804892  0.562061

Weekly update:

  • I put together a simple UI for exploring how this new model prototype works: https://wiki-topic.toolforge.org/topic-prototype
  • Early feedback suggested that in some areas, we're not seeing good coverage of people (beyond the biography/geography labels) -- e.g., scientists not having their associated science topic show up. I've been looking into revitalizing some work I did a while back on using the occupation ontology on Wikidata for this. Where we'd maintain a mapping of high-level occupations to topics -- e.g., Physicist (Q169470) -> STEM.Physics_and_Space and then if someone has a claim saying that they are a nuclear astrophysicist (Q115979636), the code would follow that up to astrophysicist (Q752129) and then physicist (Q169470) and apply the appropriate label. Still some work to do to figure out how to weight the different topics that a person's multiple occupations might map to (or some occupations might map to multiple topics too) but it's promising and should be doable via a single SPARQL query (example).

Weekly update:

  • made some improvements to the UI so it's easier to interpret (including presenting the current topic outputs alongside thanks Giovanna!)
  • recorded a draft 15-minute intro to the topic space and this new generation of models and just shared with Alex for feedback. I will continue to iterate on this to try to make it as easy as possible for folks to understand the feedback space and help guide us. One example is I put together a simple updated taxonomy that shows what topics we've adjusted: https://www.mediawiki.org/wiki/User:Isaac_(WMF)/sandbox
  • still working on incorporating in the occupations but I have a good sense of what I'm doing there now so should be relatively quick