Skip to content
/ treesimi Public

Compute similarity between trees, e.g. dependency trees

License

Notifications You must be signed in to change notification settings

ulf1/treesimi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI version PyPi downloads DOI

treesimi: Shingling for measuring tree similarity

Compute similarity between trees, e.g. dependency trees

Convert an Adjacency List into a Nested Set Table

For example, CoNLL-U's ['id', 'head'] fields form an adjacency list of a dependency tree. Traversing an adjacency list is slower than reading a nested set. Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.

import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]

Demo: Query a nested set table

To extract a subtree we just need to iterate through the list ($O(n)$)

_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]

or ts.get_subtree(nested, sub_id=2)

Set node attributes

import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]

Convert Adjacency List with attributes

import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]

Extract subtree patterns

We can extract the following patterns from one tree:

  • Depth dimension
    • Full subtrees
    • Truncate leaves
  • Sibling dimension
    • All siblings
    • Drop siblings (and their subtree)
  • Placeholder attribute field

Full subtrees

The function extract_subtrees returns all subtrees of a tree. The depth information is adjusted accordingly for each subtree.

import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
#    [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
#    [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
#    [[1, 2, 0, 'd']],
#    [[1, 2, 0, 'c']]
# ]

Truncate leaves

In the first step, the function trunc_leaves removes leaves of the largest depth level. The result is always an incomplete tree, and the lft and rgt values are not adjusted to indicate that there is a missing node. In the next steps, the depth level is further removed down to depth=1.

import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]

Hint: Run trunc_leaves for each subtree extracted by extract_subtrees. Call unique_trees after each step.

Drop sibling nodes

Generate variants of a tree by dropping each node once. Again, the result is always an incomplete tree, and the lft and rgt values are not adjusted to indicate that there is a missing node.

import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
#   [[1, 8, 0, 'a']],
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
#   [[1, 8, 0, 'a']]
# ]

Hints: Create subtrees with extract_subtrees and trunc_leaves, and run drop_nodes on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. drop_nodes(drop_nodes(...)).

Placeholder attribute field

The replace_attr removes the data attribute of a node with a generic placeholder.

import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
#   [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
#   [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']], 
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']], 
#   [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]

Demo Notebooks about Shingling for MinHash

We recommend using the mmh3 hash function, and 32 permutations in datasketch.MinHash.

Start jupyter to run the demo notebook

source .venv/bin/activate
jupyter lab

Appendix

Installation

The treesimi git repo is available as PyPi package

pip install treesimi
pip install git ssh://[email protected]/ulf1/treesimi.git

Commands

Install a virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

  • Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
  • Run Unit Tests: pytest

Publish

python setup.py sdist 
twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Acknowledgements

The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 433249742 (GU 798/27-1; GE 1119/11-1).

Maintenance

  • till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project 433249742
  • since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.

About

Compute similarity between trees, e.g. dependency trees

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Languages