Tags: zhhao1/sacrebleu
Tags
Bugfix in using max_ngram_order (mjpost#174) * bugfix: corpus_score() was ignoring self.max_ngram_order (fixes mjpost#173) * added test case for max_ngram_order * simplified pytest build
Bugfix in using max_ngram_order (mjpost#174) * bugfix: corpus_score() was ignoring self.max_ngram_order (fixes mjpost#173) * added test case for max_ngram_order * simplified pytest build
Merge changes for 2.0.0 (mjpost#152) - Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes mjpost#121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes mjpost#46) - Signature: Formatting changed (mostly to remove ' ' separator as it was interfering with chrF ). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes mjpost#140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes mjpost#141). - BLEU: allow modifying max_ngram_order (fixes mjpost#156) - CHRF: Added multi-reference support, verified the scores against chrF .py, added test case. - CHRF: Added chrF support through `word_order` argument. Added test cases against chrF .py. Exposed it through the CLI (--chrf-word-order) (fixes mjpost#124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF , Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes mjpost#144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes mjpost#40 and fixes mjpost#78).
dataset: fix mTEDx hashes (mjpost#145) Fix the md5 sums for the newly added mTEDx test/valid sets
Added WMT20 newstest (mjpost#109) * Added WMT20 newstest (mjpost#103) * updated CHANGELOG and README Co-authored-by: Ozan Caglayan <[email protected]>
Refactoring & Fixes (mjpost#88) * Added Multi30k multimodal MT test set metadata * Refactored all tokenizers into respective classes (fixes mjpost#85) * Refactored all metrics into respective classes * Moved utility functions into utils.py * Implemented signatures using BLEUSignature and CHRFSignature classes, expose `Signature().info` * metrics: Signature().info is now exposed (fixes mjpost#75) * Simplified checking of Chinese characters (fixes mjpost#5) * Unified common regexp tokenization codes for tokenizers (fixes mjpost#27) * Fixed --detail failing when no test sets are provided * Fixed multi-reference BLEU failing when tab-delimited reference stream is used * Removed lowercase option for ChrF which was not functional (mjpost#85) * Simplified ChrF and used the same I/O logic as BLEU to allow for future multi-reference reading * Added score regression tests for chrF using reference chrF implementation * Added multi-reference & tokenizer & signature tests * Pin mecab version to 0.996.5 as the newer ones are incompatible (fixes mjpost#94) * bump version to 1.4.11
PreviousNext