What's New in v3.6
spaCy v3.6 adds the new SpanFinder
component to the core
spaCy library and new trained pipelines for Slovenian.
SpanFinder
The SpanFinder
component identifies potentially
overlapping, unlabeled spans by identifying span start and end tokens. It is
intended for use in combination with a component like
SpanCategorizer
that may further filter or label the
spans. See our
Spancat blog post for a more
detailed introduction to the span finder.
To train a pipeline with span_finder
spancat
, remember to add
span_finder
(and its tok2vec
or transformer
if required) to
[training.annotating_components]
so that the spancat
component can be
trained directly from its predictions:
In practice it can be helpful to initially train the span_finder
separately
before sourcing it (along with
its tok2vec
) into the spancat
pipeline for further training. Otherwise the
memory usage can spike for spancat
in the first few training steps if the
span_finder
makes a large number of predictions.
Additional features and improvements
- Language updates:
- Add initial support for Malay.
- Update Latin defaults to support noun chunks, update lexical/tokenizer settings and add example sentences.
- Support
spancat_singlelabel
inspacy debug data
CLI. - Add
doc.spans
rendering tospacy evaluate
CLI displaCy output. - Support custom token/lexeme attribute for vectors.
- Add option to return scores separately keyed by component name with
spacy evaluate --per-component
,Language.evaluate(per_component=True)
andScorer.score(per_component=True)
. This is useful when the pipeline contains more than one of the same component liketextcat
that may have overlapping scores keys. - Typing updates for
PhraseMatcher
andSpanGroup
.
Trained pipelines
New trained pipelines
v3.6 introduces new pipelines for Slovenian, which use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
sl_core_news_sm | 96.9 | 82.1 | 62.9 |
sl_core_news_md | 97.6 | 84.3 | 73.5 |
sl_core_news_lg | 97.7 | 84.3 | 79.0 |
sl_core_news_trf | 99.0 | 91.7 | 90.0 |
Pipeline updates
The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize “get” as a passive auxiliary.
The Danish pipeline da_core_news_trf
has been updated to use
vesteinn/DanskBERT
with
performance improvements across the board.
Notes about upgrading from v3.5
SpanGroup spans are now required to be from the same doc
When initializing a SpanGroup
, there is a new check to verify that all added
spans refer to the current doc. Without this check, it was possible to run into
string store or other errors.
One place this may crop up is when creating Example
objects for training with
custom spans:
Pipeline package version compatibility
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
run spacy download
to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate
.
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
meta.json
:
Updating v3.5 configs
To update a config from spaCy v3.5 with the new v3.6 settings, run
init fill-config
:
In many cases (spacy train
,
spacy.load
), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config
and debug data
.