Compute similarity between trees, e.g. dependency trees
For example, CoNLL-U's ['id', 'head']
fields form an adjacency list of a dependency tree.
Traversing an adjacency list is slower than reading a nested set.
Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.
import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
To extract a subtree we just need to iterate through the list ($O(n)$)
_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]
or ts.get_subtree(nested, sub_id=2)
import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]
We can extract the following patterns from one tree:
- Depth dimension
- Full subtrees
- Truncate leaves
- Sibling dimension
- All siblings
- Drop siblings (and their subtree)
- Placeholder attribute field
The function extract_subtrees
returns all subtrees of a tree.
The depth information is adjusted accordingly for each subtree.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
# [[1, 2, 0, 'd']],
# [[1, 2, 0, 'c']]
# ]
In the first step, the function trunc_leaves
removes leaves of the largest depth level.
The result is always an incomplete tree, and the lft
and rgt
values are not adjusted to indicate that there is a missing node.
In the next steps, the depth level is further removed down to depth=1
.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]
Hint: Run trunc_leaves
for each subtree extracted by extract_subtrees
. Call unique_trees
after each step.
Generate variants of a tree by dropping each node once.
Again, the result is always an incomplete tree, and the lft
and rgt
values are not adjusted to indicate that there is a missing node.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
# [[1, 8, 0, 'a']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
# [[1, 8, 0, 'a']]
# ]
Hints: Create subtrees with extract_subtrees
and trunc_leaves
, and run drop_nodes
on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. drop_nodes(drop_nodes(...))
.
The replace_attr
removes the data attribute of a node with a generic placeholder.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
# [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]
We recommend using the mmh3
hash function, and 32 permutations in datasketch.MinHash
.
- Create subtrees as shingle sets
- Jaccard Similarity between Dependency Trees
- Shingle Dependency Trees for datasketch's Minhash
Start jupyter to run the demo notebook
source .venv/bin/activate
jupyter lab
The treesimi
git repo is available as PyPi package
pip install treesimi
pip install git ssh://[email protected]/ulf1/treesimi.git
Install a virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
pytest
Publish
python setup.py sdist
twine upload -r pypi dist/*
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Please open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 433249742 (GU 798/27-1; GE 1119/11-1).
- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project 433249742
- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.