User Details
- User Since
- Aug 8 2017, 10:56 AM (374 w, 22 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Diego (WMF) [ Global Accounts ]
Mon, Oct 7
Fri, Oct 4
Wed, Oct 2
Confirm if the hypothesis was supported or contradicted
Fri, Sep 13
Progress update
- I’m working on building a set of keywords related to peacock behavior and promotional tone. To do this, I’m using a TF-IDF approach, a well-known method to identify terms (keywords) that characterize a set of documents.
- This and next week are short for me (taking several days off), so it might take a bit more time to finalize this.
- I also communicated with my manager that there might be the possibility of trying to build a product based on the fine-tune model. In case we decide to move forward, we would need to coordinate with her and other teams involved how to proceed.
Sep 7 2024
Progress update
- Experiments:
- As planned I studied the ability of the model fine tuned to detect peacock behavior to detect other promotion-related content issues, described in this data set.
- I run the model on 4 other datasets: {{fanpov}}, {{advert}}, {{autobiography}}, {{weasel}}
- The results show (see below) a similar behavior with the peacock detection task. The model shows a good precision and low recall (lower for templates different from peacock). This suggest that there is information about promotional tone that can be detect by the model, and depending on the setup the model could focus on precision or recall
- Coordination:
- We have a meeting with Peter Pelberg, Nicola Ayub , and Megan Neisler to discuss next steps.
- First, we decided that the model needs to be tested again a simple baseline, that can be just a string matching approach, looking for common peacock keywords. I’ll be working on this during the next week(s) (notice I’ll be OoO few days during the next two weeks)
- Peter is going to decide if we want to go deeper on this specific task, and analysis how other factors related to transform this model into a product (serving time, ux, etc) or work on other tasks that involves ML and user experiences
Sep 4 2024
@achou, just for my curiosity, is the "predict time" the total end-to-end period or total = preprocess predict?
Aug 30 2024
Progress update
Aug 23 2024
Progress update
This looks great @jsn.sherman. Do you know if there is an overlap on the revisions that returns an error for each model?
I'm just wondering if the ML fails on very long diffs (given that needs to process the text itself).
Aug 19 2024
Aug 14 2024
Aug 8 2024
@Samwalton9-WMF , just keep in mind that the scores from RRML and RRLA are different. This means that you maybe need to run new users' test to (re)define the thresholds.
Hi @Samwalton9-WMF , we choose RRLA because it was more stable, but since then, we made some updates to RRML (it was not only about serving time, but getting errors for some revisions), that aimed to make it more stable.
So, if there is interest to switch to RRML (for 47 languages with coverage), my recommendation would be to run some stress test on that service, and measure the % of errors and if Automoderator can tolerate them.
Aug 2 2024
Progress update
Jul 26 2024
Progress update
- I've been coordinating with ML-team to show code examples that make their (experimental) infrastructure to fail. They will be using this code as part of their use-case studies when testing new LLMs infrastructure.
- In the meantime I've been working on writing code to fine-tune smaller Language Models, this requires:
- Data preprocessing and cleaning (done)
- Experimental design (done)
- Run experiments on stats machine (in progress)
- Met with KR owner (Peter Pelberg) and explain the progress and next steps for this hypothesis.
Jul 23 2024
Hi! Apparently the data has missing again:
Jul 22 2024
Jul 19 2024
- Studied how to create prompts for Gemma2. Noticed the importance of using special tokens and format.
- Designed zero-shot experiment for detecting Peacock behavior.
- Wrote code for testing the Gemma2 instance hosted by the ML-team.
- The instance took more than 5 seconds per query.
- After few requests (around 200) the instance stop responding.
- O've reported this issue to ML-Team, my understanding is they will be working on fixing this during the next week (cc: Chris Albon)
Jul 18 2024
You are right @leila we should merge them.
Jul 12 2024
Based on our previous research, we have created a dataset containing 9276 articles affected by peacock and other related policy violations on English Wikipedia. For each of them we have negative (no policy violations) and positive examples: * Autobiography: 1472
- fanpov: 350
- peacock 2587
- weasel 805
- advert: 4062
- Total: 9276
Jul 5 2024
I'm resolving the task and track model's deployment in T369371
Jul 4 2024
- covid-data.wmf-research-tools.eqiad1.wikimedia.cloud (this one is shut-off so maybe just needs deleted?)
- wikipediaWikidata.wmf-research-tools.eqiad1.wikimedia.cloud
I've just removed these two
Thanks for this work @isarantopoulos!
Jun 28 2024
@Trokhymovych, please post here the models' performance results
To keep this task updated, models for Wikipedia are ready and can be found here:
@Trokhymovych has addresed the comments and submitted the merge request. Model binary can be found here.
I'm going to coordinate with research engineers to decide next steps.
Jun 25 2024
Just for the records, we have migrated the fact-checking API to another instance and deleted the old one.
Jun 24 2024
Jun 18 2024
Thanks @JAllemandou !
May 20 2024
@XiaoXiao-WMF can you please provide more context?
May 6 2024
@lbowmaker the proposed solution sounds ok to me. I have two questions around:
May 3 2024
@lbowmaker if understand correctly, there is no alternative for obtaining historical data for Wikidata edits? If this is the case, we can't keep the Wikidata Revert Risk model updated
May 2 2024
Apr 29 2024
This task has been resolved, please follow the model deployment here: T363718
Apr 17 2024
This was solved. More details here T341820
Mar 1 2024
- We have improve the model accuracy, currently I'm working on making the model faster, to be able to work in real time.
Feb 29 2024
@kostajh , to the best of my knowledge @KStoller-WMF is leading this project. We had a meeting on January and I gave my input there. I think other teams that have done community testing process can talk more about this. Technically we could go for targeting certain precision, what would involve different thresholds per wiki. Using the Knowledge Observatory data it should be easy to compute these numbers, however, maintenance could be hard, so my understanding was the decision was to go for a single threshold for all wikis.
Feb 28 2024
Feb 17 2024
my two cents:
Feb 9 2024
- In order to improve the interaction between structured and text data , I'm experimenting with a full pytorch approach.
Feb 5 2024
Jan 24 2024
Jan 22 2024
We are using this tasks as umbrella for reporting improvements and our coordination with products teams regarding the Revert Risk models.
Given that the model showed to be good enough for the Automoderator project, and also would be integrated on the MediaWiki Recent Changes feed (T352217), I think we can resolve this task and report future updates related to revert risk to the EPIC task; T314384
Jan 20 2024
- We are collecting preparing a new dataset (using diffs) to train the model.
- We are experimenting with language models, such as mBert and LaBSE to evaluate structured (claims) edits.
Jan 18 2024
@MunizaA , until we don't have enough training data we should treat temporary accounts as anonymous users. In practice this means to overwrite temporary users features.
So, basically
Jan 17 2024
Ideally, by the time we are deploying to pilot wikis, the model will understand that revisions made by temp accounts should be scored differently than if those revisions came from full accounts. I am not sure how much you'll be able to do, though, without a lot of real world data of temp account edits?
Dec 22 2023
- We have obtained 590 labels from 540 different revisions. Data is available here.
- This is the confusion matrix:
92 | 28 |
56 | 364 |
Given the following scores:
Revert Risk | ORES | |
Precision: | 0.93 | 0.91 |
F1: | 0.90 | 0.91 |
Ok! I understand.
Currently, Revert Risk uses several user's features. I think the "revision count" could be used as a replacement of the "anonymous" field. However, probably the most straight forward solution would be to replace the "anonymous" column for a "temporary" column.
Dec 18 2023
Hi @kostajh , I'm not sure if I'm understanding the question. Are you proposing to add the "user status" (temporary/full) as feature on Revert Risk?
Few comments:
Dec 8 2023
- The Privacy Engineering team has reviewed the model finding no privacy-related concerns with the model.
- The patch for adding revert risk on the recent changes' feed has been merged. This enables the option of integrate Revert Risk on MediaWiki. Now, I'm working in finding the adequate thresholds for RR scores (T351897) to add the corresponding mediawiki tags.
- Currently we have around 200 labels. WMDE is helping to increase this number.
- We are preparing a new dataset for training the de RR Wikidata model.
Dec 1 2023
No updates this week.
- The Moderation Tools team is running tests and community discussions to implement the Automoderator project, we are in coordination with them to learn about potential areas of improvement for RR.
- There has been some community initiatives to evaluate the quality of the RR models. In T336934 a group of rowiki editors had manually labeled a set of risky revisions. We have analyzed these results, showing reasonable good performance.
- The ML-team is working on integrating RRLA to recent changes feed T348298. We are working on defining the best thresholds for this integration T351897 .
- We have been working with Wikimedia Enterprise to clarify some doubts about the RRLA model T346095
Nov 24 2023
- I have presented the Revert Risk model Wikidata and the Annotool at the WikiProject LD4 Wikidata gathering.
- We have started collecting new annotations on the second Wikidata labeling. The campaign is available here. @Lydia_Pintscher is helping us to find more annotators (thanks!).
Nov 23 2023
I can't think a case where this is possible but I'll have a look.
Anyhow, I've done some cleaning, and merged the datasets, and then I've computed some scores:
These scores seem to be based on the prediction, not the score returned by the algorithm, so they seem a bit useless in the context of a reverter - the community will almost certainly not accept a 53% success rate. Can you advise on why you chose these and not the score-based results, which seem better?
I've done both, you can find them on the jupyter notebook. But in summary the precision is very similar (almost identical) to ORES rowing-damagging
Nov 21 2023
- Older revisions.
Concerns around our understanding of the limitations of the model for older edits given the training window. If there is a user that is looking at taking a full snapshot of either our current corpus of Wikipedia, or a past version, both include revisions from a broader window of time than the training window specifically and may show “latent” bad revisions that either perform differently with the LiftWing model or are uncaught.I am curious what you may recommend to evaluate older content that could be vandalized without us knowing due to a lack of revisions/content attention by editors.
I'm not completely sure if I'm understanding your question. What I can say is that any model would have certain time drift, that includes RR and ORES. I think the model's precision would decay if we use it for very old data, but probably it tends to a certain limit (I would assume that the same is true for ORES, and that model is probably already working close to it's boundaries). The Language Agnostic model shouldn't be difficult to run on a large old dataset, I understand that @fkaelin
and @Pablo had been working on running the model on large data, so maybe if you have an specific question to be answered, the four of us could try to design an experiment to answer that question.
- Performance on different types of pages. You already addressed this in part, but what I mean by different page types isn’t necessarily subject-related (though cross-language data is helpful as well) but instead based on the metadata of the page.
How does the model typically perform on revisions in pages with low/high pageviews, low/high amounts of content, more/less edits, etc. This is less critical for our use-case, but we are imagining cases where a user may want to create their own filtering system based on their tolerance for risk and may want an approach that divides article approaches based on metadata.
Let us know if there are potential low-risk exercises we can collaborate on to subsect the data.
I don't have such statistics, maybe the Knowledge Integrity Observatory have some data to answer this (@Pablo ?)
- What to know that we do not know.
This is what I was trying to pull on with the question on use. If ORES had fallen out of style among some users and/or grown in use with others- why? If we can understand points of friction with use (Usability? Performance? Different approach needed?) it will help us integrate learnings as we design similar features (credibility signals,
etc.).
I don't think there is clear pattern here. I think the adoption/attrition of these tools is opportunistic, in the sense that ppl used them according to their needs. With no other options, ppl would use what there is available. And even with more tools available, developers would use what fits better on their workflows, or even what the have seen working in the past. Unless we have dramatic difference between models' accuracy, I don't think that differences in models' quality is something that is easy to asses for developers.
Probably the attrition is related to the (lack of) success of the tools created using ML models, and not directly to the model itself (although that model quality probably has an impact on the tool success).
Great. Having some manual labels is always valuable.
I have done a quick check and I've seen there are few cases were the RR scores are not higher than 0.93. For example, this one:
Nov 14 2023
Hi @Strainu , here Diego from the WMF Research team.
Nov 13 2023
We would also include @cwylo !
Nov 6 2023
Hi, here Diego from Research. I'll take some of the questions you raised:
<3
Oct 31 2023
@MunizaA got it. Would this mean to create a another project?