-
-
Notifications
You must be signed in to change notification settings - Fork 34k
Description
Feature or enhancement
Proposal:
Increase quality of results by being able to run Longest Common Subsequence with autojunk=False with little to no degradation in performance.
Implemented GestaltSequenceMatcher by default gives equivalent results to SequenceMatcher(autojunk=False).
| case | Myers | difflib(aj=1) | difflib(aj=0) | Gestalt | Gestalt(balancing=0.66) |
|---|---|---|---|---|---|
| [L] diff_key ( 15, 15) | 8 | 5 -3 | 5 -3 | 5 -3 | 8 0 |
| [L] diff_mid1 ( 28, 27) | 11 | 11 0 | 11 0 | 11 0 | 11 0 |
| [L] diff_mid2 ( 128, 128) | 73 | 72 -1 | 72 -1 | 72 -1 | 72 -1 |
| [L] diff_mid3 ( 131, 131) | 63 | 50 -13 | 50 -13 | 50 -13 | 57 -6 |
| [L] difflib.py ( 2041, 2334) | 1883 | 1878 -5 | 1879 -4 | 1879 -4 | 1879 -4 |
| [L] linux_sched.c (11373, 11832) | 10844 | 10793 -51 | 10837 -7 | 10837 -7 | 10844 0 |
| [L] react.js ( 3555, 3337) | 2631 | 2568 -63 | 2581 -50 | 2581 -50 | 2581 -50 |
| [L] webpack.js ( 2327, 3429) | 929 | 772 -157 | 899 -30 | 899 -30 | 916 -13 |
| ----- | |||||
| Total: | 16442 | 16149 -293 | 16334 -108 | 16334 -108 | 16368 -74 |
| Avg % diff: | 0% | 9.94% | 8.11% | 8.11% | 1.80% |
| Runtime (s): | 3.311 | 0.108 | 2.789 | 0.093 | 0.113 |
Table above shows the number of individual match points.
Which is not a definitive metric.
More definitive metric could be a number of alignment points sacrificed for the sake of quality.
However, it can be safely assumed that autojunk=False / new matcher produces higher quality diffs by not resorting to approximations. Luckily, it also increases number of match points.
New optional balancing argument can be turned to reduce the chance of greedily committing to unbalanced matches. This often produces more concise diffs with more lines matched. It produces more concise diffs with more lines matched, while retaining block-oriented nature.
For visual inspection, can see outputs of different algorithms for tailored case HERE!
Legacy
O(n) solution is not that complex - 100 of Python lines.
I hacked in initial implementation and ran context_diff on the difflib.py - the change itself.
import difflib, time
with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib.py') as f:
s1 = f.read().splitlines(keepends=True)
with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib_new.py') as f:
s2 = f.read().splitlines(keepends=True)
list(difflib.context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
# 2.4 ms (Current)
# 2.9 ms (New)So 30% faster.
So 20% slower.
Also, this is a fairly rough plug-in - there is a chance that the gap can be bridged.
O(n) solution should avoid some of the user dissatisfaction:
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
Made a couple of posts in unrelated thread.
Performance tables are there showing that this is faster even for sizes of 10.
https://discuss.python.org/t/include-time-and-space-complexity-in-documentation/105683/32