This repository is for the Modul M.Inf.2801 Research Lab Rotation at the Georg-August Universität Göttingen.
This work is based on Rogue Scores (Grusky, ACL 2023) and the corresponding Rogue Scores: Reproducibility Guide. After evaluating over 2000 papers using the ROUGE score, they reported three conclusions:
- (A) Critical evaluation decisions and parameters are routinely omitted, making most reported scores irreproducible.
- (B) Differences in evaluation protocol are common, affect scores, and impact the comparability of results reported in many papers.
- (C) Thousands of papers use nonstandard evaluation packages with software defects that produce provably incorrect scores.
Our work primarily investigates the METEOR score, its implementations, and its use in scientific papers. We implement a semi-automated literature review and identify decisions and aspects responsible for discrepancies in the use of METEOR.
It also provides some review of BLEU scores and reproduces the paper review of ROUGE scores.
├── README.md
├── METEOR.ipynb # literature review of METEOR
├── meteor_code_review.csv # results of manual code review
├── 2024-03-15_Lübbers.pdf # short presentations of the results
├── ROUGE.ipynb # preliminary literature review of ROUGE
├── BLEU.ipynb # preliminary literature review of BLEU
├── meteorscores # directory containing docker image
METEOR is a parameterized metric. While this makes it flexible, it can be difficult to report correctly. There also are 5 different official versions of METEOR, v1.0 to 1.5. The metric evolved and gained new parameters that changed the score. The parameters in the official package are tuned to specific datasets and specific tasks. Any deviation in parameters or failure to report the used METEOR version can potentially hinder the reproducibility of the scoring.
We use the ACL Anthology Dataset containing citations of 73285 machine learning papers, most of them with full texts.
Initially, we searched the dataset for occurrences of the keyword meteor
to filter relevant papers.
All found papers are searched with regular expressions for
- Parameters as a string of letters and numbers
- Protocol variants taken from the official documentation
- Evaluation packages that were identified through online searches and manual codebase reviews.
We use regular expressions to search for links or mentions of code repositories (GitHub, GitLab, Bitbucket, Google Code, Sourceforge) within the papers.
We use the GitHub API to search all GitHub repositories for the keyword meteor
.
The decision for GitHub repositories was made because
- this was done in the original ROUGE paper
- GitHub repositories are by far the most used ones (see Evaluation)
All found repositories are manually reviewed for reproducibility. Detailed criteria for code can be found in Appendix B.
There are three possibilities for reproducibility, based on Rogue Scores (Grusky, ACL 2023):
- The paper cites the METEOR package and parameters
- The Paper cites the METEOR Package which needs no configuration
- The Codebase includes a complete METEOR evaluation.
The original METEOR score software is a Java package. It cannot be used directly in the popular machine-learning language Python. So there needs to be a wrapper or a reimplementation. Those are sources of failure. We check how those implementations may differ.
We download all METEOR packages available for Python3 with more than 3 citations, resulting in 6 total packages. They are identified by the code review and online search.
Identical to the ROUGE paper, we use the CNN / Daily Mail dataset. Human-written bullet point "highlights" are reference summaries. We use METEOR to evaluate the specimen model hypothesis against the references using a development set. See Appendix E in Rogue Scores (Grusky, ACL 2023).
Evaluation is performed on Lead-3. This summarizes an article by returning its first three sentences. See Appendix F in Rogue Scores (Grusky, ACL 2023).
We calculate the METEOR score for the generated summaries using the packages. We only checked packages compatible with Python3. Older packages were not considered for recency and simplicity. The CNN / Daily Mail development set consists of 13.000 test cases. We calculate the mean METEOR score for them. To run the packages, open the docker image and run
generate_packages_table()
- You need to install the following Python packages for the Jupyter notebooks:
pandas=2.1.4
numpy=1.26.2
requests=2.31.0
datasets=2.16.1
- The ACL Anthology dataset by Rohatgi et al. is available here.
acl-publication-info.74k.v2.parquet
has to be in\data
. - You need a GitHub API Key to search the repository for keywords. Add it at the top of the Jupyter Notebooks.
For software validation testing and the creation of the graphics, there is a docker image similar to the one provided for ROUGE. The full guide for the source image is here.
# create the image. Download stuff
make build
# start image
make start
# clean after your done
make clean
When you are done, use this to
# stop docker and return later to continue working
docker stop meteorscores
# clean everything. needs make start again
make clean
There are 73.285 papers in the dataset. 5.871 Papers without a full text get their error_download
variable set to True
.
Then, we identify papers containing the keyword meteor
in the full text. This sets the variable paper_meteor_prelim
of 1.613 papers to True
. Only those papers are included in the following experiments.
There may be some false positive papers found by keyword search because it could also match papers focusing on astronomical analysis. It would need a manual paper review to exclude them.
The regular expression to search for parameters anywhere in the text:
pattern = r"((?: -[a-z123](?: [a-z0-9.]{1,4})?){2,})"
The parameters are copied to a string in paper_meteor_params
.
Only 9 papers with parameters are found.
Those parameters are not related to METEOR.
This suggests METEOR is mostly run with the default configuration.
If no one messes with the tuned parameters, this ought to be a good thing.
The METEOR documentation states:
"For the majority of scoring scenarios, only the -l and -norm options should be used. These are the ideal settings for which language-specific parameters are tuned."
The regular expressions to search for protocol-related terms within 500 characters of "meteor":
regex_meteor_protocol = {
'rank': r'\b(?:rank|ranking|WMT09|WMT10)\b',
'adq': r'\b(?:adq|adequacy|NIST Open MT 2009)\b',
'hter': r'\b(?:hter|HUMAN-targeted translation edit rate|GALE P2|GALE P3)\b',
'li': r'\b(?:li|language[- ]independent)\b',
'tune': r'\b(?:tune|tuning|parameter optimization)\b',
'modules': r'\b(?:-m\s (?:exact|stem|synonym|paraphrase)|module|exact|stem|synonym|paraphrase)\b',
'normalize': r'\b(?:-norm|normalize|normalization|tokenize\s and\s lowercase|normalize\s punctuation)\b',
'lowercase': r'\b(?:-lower|lowercase|lowercasing|casing)\b',
'ignore_punctuation': r'\b(?:-noPunct|no\s Punctuation|ignore\s punctuation|remove\s punctuation)\b',
'character_based': r'\b(?:-ch|character[- ]based|calculate\s character[- ]based\s precision\s and\s recall)\b',
'verbose_output': r'\b(?:-vOut|verbose\s output|output\s verbose\s scores)\b'
}
The found protocol terms are added in a list in paper_meteor_protocol
.
METEOR protocol terms are mentioned, mostly ones related to the tasks:
- rank: parameters tuned to human rankings from WMT09 and WMT10
- adq: parameters tuned to adequacy scores from NIST Open MT 2009
- hter: parameters tuned to HTER scores from GALE P2 and
- li: language-independent parameters
The regular expressions to search for METEOR packages anywhere in the paper:
regex_meteor_packages = {
'Meteor_coco': r'(?:coco.*?meteor|meteor.*?coco)',
'pymeteor': r'pymeteor|zembrodt',
'generationeval_meteor': r'generationeval.*?meteor|meteor.*?generationeval|webnlg.*?meteor|meteor.*?webnlg',
'evaluatemetric_meteor': r'evaluatemetric.*?meteor|meteor.*?evaluate(?:-)?metric',
'fairseq_meteor': r'fairseq.*?meteor|meteor.*?fairseq',
'nlgeval_meteor': r'nlgeval.*?meteor|meteor.*?nlg(?:-)?eval',
'nltk_meteor': r'nltk.*?meteor|meteor.*?nltk',
'meteorscorer': r'meteorscorer',
'beer_meteor': r'beer.*?meteor|meteor.*?beer',
'compare_mt_meteor': r'compare(?:-)?mt.*?meteor|meteor.*?compare(?:-)?mt',
'pysimt_meteor': r'pysimt.*?meteor|meteor.*?pysimt',
'blend_meteor': r'blend.*?meteor|meteor.*?blend',
'stasis_meteor': r'stasis.*?meteor|meteor.*?stasis',
'huggingface' : r'hugging\s*face.*?evaluate.*?meteor|meteor.*?hugging\s*face.*?evaluate',
# Template for other versions
'Meteor_x': r'Meteor\s [xX]\.?[\d\.]*',
}
The found terms are added to a list in paper_meteor_packages
.
The most used packages are variations of Pycocoeval and NLTK.
Github | number of citations |
---|---|
coco-caption | 128 |
NLTK | 48 |
generation eval | 44 |
huggingface/evaluate | 37 |
fairseq | 29 |
nlg_eval | 8 |
The regular expression to search for codebases:
Found URLs are added to a list in code_meteor_url
.
227 unique GitHub repositories were found in the papers.
Found GitHub codebases are searched for the term meteor
.
If the keyword is found, it sets code_meteor_prelim
to True
.
55 of them contained the keyword meteor
.
All 55 code repositories were manually reviewed for reproducibility and what METEOR packages are used.
33 were found to be reproducible regarding the METEOR score.
This sets their code_meteor
variable to True
.
You get the following results:
GitHub | METEOR score |
---|---|
WebNLG/GenerationEval | 0.217303 |
salaniz/pycocoevalcap | 0.218336 |
bckim92/language-evaluation | 0.167222 |
Maluuba/nlg-eval | 0.221483 |
Yale-LILY/SummEval | 0.221483 |
nltk/nltk | 0.382306 |
huggingface/evaluate | 0.382306 |
facebookresearch/vizseq | 0.332000 |
Out of the 1.613 papers
- 23 % are deemed reproducible
- 34 % cite METEOR software packages
- 16 % release code
- 2 % release code with METEOR evaluation
- 0 % list METEOR configuration parameters
There seem to be two groups of implementations and something is off between them.
Packages with a lower score use the original Java METEOR implementation, its paraphrase file, and the wrapper from coco-caption. They are used in roughly 60 % of the papers. The wrapper was developed in cooperation with an author of METEOR, so we consider it correct. It was written for Python2 and it had to be changed a bit to work with the Python3 packages we investigated. Different adaptions for Python3 may have led to slightly different scores.
Packages with the higher score use the reimplementation of NLTK with wordnet paraphrases. They are used in roughly 40 % of the papers. There is a closed issue on GitHub regarding the higher scores with NLTK. NLTK implements METEOR v1.0. Implementing METEOR v1.5 is far more complex. It introduces paraphrases as an additional matching scheme and also word weights. Default parameters for METEOR v1.0 were tuned for adequacy and fluency, while v1.5 uses rank as its default task. NLTK developers are aware of the higher scores, but since they do not have an open-source implementation of METEOR v1.5, they stick with what they have.
They also use slightly different parameter values for alpha, beta, and gamma than the original JAVA implementation.
This could all be negated by citing the correct METEOR version when reporting METEOR scores. The NLTK documentation does not mention the METEOR version. This information is hidden in the Python code. So no one references the appropriate METEOR v1.0 paper. They all reference the most recent paper (METEOR v1.5).
Low reproducibility was expected, as there is no reason why papers using METEOR should be significantly better than papers using ROUGE. The numbers are a bit higher because METEOR scoring is usually done without any configuration.
Correctness is a more problematic issue. NLTK and its derivates are used in about 40 % of the papers citing METEOR packages. Chances are, researchers use them if they fit in their environment. Maybe they are already using them for a different task.
If you try to build on a paper, there are these possibilities:
- the paper reports the used packages and you use the same
- you choose another implementation, and it's from the same group
- the paper uses a wrapper and you use a reimplementation. Your scores will be higher by 10 points.
- the paper uses a reimplementation and you use a wrapper. Your scores will be lower by 10 points.
Chances are progress is hindered or at least obscured by false comparisons.
The only upside: METEOR is not used on leaderboards at Papers With Code.
All limitations stated in Rogue Scores (Grusky, ACL 2023) apply. Additionally:
- Our ACL Anthology Dataset only includes papers up to September 2022
- No manual paper review to filter for METEOR scoring
- Software validation testing was only done for packages available for Python3
Feel free to fork this repository and improve it.
- do a manual review of all papers
- continue the investigation on METEOR and BLEU
- build one docker image for investigating METEOR, BLEU, and ROUGE
Recreating the investigation for reproducibility with the same regular expressions as Rogue Scores (Grusky, ACL 2023).
- 25 list parameters
- 459 list protocol terms
- 1168 list variants
- 274 list packages
- 62 are deemed reproducible (excluding Code Review)
Finding | Rogue Scores | our reproduction |
---|---|---|
overall citations | 110.689 | 73.285 |
manually excluded | 887 | 0 |
papers in review | 2.834 | 2.593 |
reproducible | 20 % | 2 % |
list parameters | 5 % | 1 % |
cite software package | 35 % | 11 % |
Our relative numbers are significantly lower because we did no manual review of automatically identified ROUGE papers. This would have reduced the number of papers significantly. Also, our reproducible score is lower because we did not do a code review.
This is only the start of an investigation.
- 54 list parameters
- 3.825 list protocol terms
- 631 list variants
- 5.336 list packages
Finding | BLEU | ROUGE | METEOR |
---|---|---|---|
papers in review | 9.122 | 2.593 | 1.613 |
list parameters | 0.6 % | 1 % | 0 % |
cite software package | 58 % | 11 % | 34 % |