Skip to content

difflib.GestaltSequenceMatcher implementation #144132

@dg-pb

Description

@dg-pb

Feature or enhancement

Proposal:

Increase quality of results by being able to run Longest Common Subsequence with autojunk=False with little to no degradation in performance.

Implemented GestaltSequenceMatcher by default gives equivalent results to SequenceMatcher(autojunk=False).

case Myers difflib(aj=1) difflib(aj=0) Gestalt Gestalt(balancing=0.66)
[L] diff_key ( 15, 15) 8 5 -3 5 -3 5 -3 8 0
[L] diff_mid1 ( 28, 27) 11 11 0 11 0 11 0 11 0
[L] diff_mid2 ( 128, 128) 73 72 -1 72 -1 72 -1 72 -1
[L] diff_mid3 ( 131, 131) 63 50 -13 50 -13 50 -13 57 -6
[L] difflib.py ( 2041, 2334) 1883 1878 -5 1879 -4 1879 -4 1879 -4
[L] linux_sched.c (11373, 11832) 10844 10793 -51 10837 -7 10837 -7 10844 0
[L] react.js ( 3555, 3337) 2631 2568 -63 2581 -50 2581 -50 2581 -50
[L] webpack.js ( 2327, 3429) 929 772 -157 899 -30 899 -30 916 -13
-----
Total: 16442 16149 -293 16334 -108 16334 -108 16368 -74
Avg % diff: 0% 9.94% 8.11% 8.11% 1.80%
Runtime (s): 3.311 0.108 2.789 0.093 0.113

Table above shows the number of individual match points.
Which is not a definitive metric.
More definitive metric could be a number of alignment points sacrificed for the sake of quality.

However, it can be safely assumed that autojunk=False / new matcher produces higher quality diffs by not resorting to approximations. Luckily, it also increases number of match points.

New optional balancing argument can be turned to reduce the chance of greedily committing to unbalanced matches. This often produces more concise diffs with more lines matched. It produces more concise diffs with more lines matched, while retaining block-oriented nature.

For visual inspection, can see outputs of different algorithms for tailored case HERE!

Legacy

O(n) solution is not that complex - 100 of Python lines.

I hacked in initial implementation and ran context_diff on the difflib.py - the change itself.

import difflib, time

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib.py') as f:
    s1 = f.read().splitlines(keepends=True)

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib_new.py') as f:
    s2 = f.read().splitlines(keepends=True)

list(difflib.context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
# 2.4 ms (Current)
# 2.9 ms (New)

So 30% faster.
So 20% slower.
Also, this is a fairly rough plug-in - there is a chance that the gap can be bridged.

O(n) solution should avoid some of the user dissatisfaction:

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

Made a couple of posts in unrelated thread.
Performance tables are there showing that this is faster even for sizes of 10.
https://discuss.python.org/t/include-time-and-space-complexity-in-documentation/105683/32

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions