Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cost is not optimal #89

Open
aeft opened this issue May 18, 2024 · 0 comments
Open

The cost is not optimal #89

aeft opened this issue May 18, 2024 · 0 comments

Comments

@aeft
Copy link

aeft commented May 18, 2024

from_tree = json.build_tree(["de", 1])
to_tree = json.build_tree([2])

print(from_tree.diff(to_tree).edited_cost())

for edit in from_tree.get_all_edits(to_tree):
    print(edit)

output:

4                                                                                                                                                                                                       
Insert(to_insert=IntegerNode(2), insert_into=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                              
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))
Remove(IntegerNode(1), remove_from=ListNode((StringNode('de'), IntegerNode(1))))

But the minimal cost should be 3.
output:

3                                                                                                                                                                                                       
Remove(StringNode('de'), remove_from=ListNode((StringNode('de'), IntegerNode(1))))                                                                                                                      
Match(match_from=IntegerNode(1), match_to=IntegerNode(2), cost=1)

I found the problem is in EditDistance._best_match:

dcost = (self.costs[row - 1][col - 1], self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1], self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col], self.path_costs[row - 1][col])
diag_is_best = dcost <= lcost and dcost <= ucost
if diag_is_best:
    make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
if diag_is_best and \ 
        self.edit_matrix[row][col].bounds() < self.edit_matrix[row][0].bounds() and \
        self.edit_matrix[row][col].bounds() < self.edit_matrix[0][col].bounds():  # condition1
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost:  # condition2
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else: # condition3
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol]   1
self.costs[row][col] = self.costs[brow][bcol]   edit.bounds().upper_bound
# In the above example, when we call _best_match(1, 1).
# value information:
# self.edit_matrix[row][col].bounds() = (2,2)
# self.edit_matrix[0][col].bounds() = (2,2)
# ucost = (2, 1)
# dcost = (0, 0)
# lcost = (1, 1)

brow = row, bcol = col-1 (condition3 success) because self.edit_matrix[row][col].bounds() == self.edit_matrix[0][col].bounds() (condition1 fail) and ucost > dcost (condition2 fail).
But actually self.costs[row][col-1] self.edit_matrix[0][col].bounds().upper_bound > self.costs[row-1][col-1] self.edit_matrix[row][col].bounds().upper_bound. The optimal self.costs[row][col] should be self.costs[row-1][col-1] self.edit_matrix[row][col].bounds().upper_bound.

Here is a fix (success in the above example, may need further testing):

make_distinct(self.edit_matrix[row][col], self.edit_matrix[row][0], self.edit_matrix[0][col])
dcost = (self.costs[row - 1][col - 1]   self.edit_matrix[row][col].bounds().upper_bound, self.path_costs[row - 1][col - 1])
lcost = (self.costs[row][col - 1]   self.edit_matrix[0][col].bounds().upper_bound, self.path_costs[row][col - 1])
ucost = (self.costs[row - 1][col]   self.edit_matrix[row][0].bounds().upper_bound, self.path_costs[row - 1][col])
if dcost <= lcost and dcost <= ucost:
    brow, bcol, edit = row - 1, col - 1, self.edit_matrix[row][col]
elif ucost <= dcost and ucost <= lcost:
    brow, bcol, edit = row - 1, col, self.edit_matrix[row][0]
else:
    brow, bcol, edit = row, col - 1, self.edit_matrix[0][col]
self.path_costs[row][col] = self.path_costs[brow][bcol]   1
self.costs[row][col] = self.costs[brow][bcol]   edit.bounds().upper_bound

In summary, we should use the final real cost to guide transition. Like the canonical implementation of the Levenshtein distance metric:

dist[row][col] = min(dist[row - 1][col]   1,
                                 dist[row][col - 1]   1,
                                 dist[row - 1][col - 1]   cost)

If this is a bug, I would be happy to provide a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant