Exploring Python's Difflib: A Hidden Treasure for Developers
Written on
Chapter 1: Introduction to Difflib
In recent times, I've been delving into Python's built-in libraries, and it's been quite an enjoyable experience. Python offers a plethora of intriguing features that present ready-made implementations and solutions for various challenges.
Among these, I want to highlight one particular built-in library — Difflib. As it comes pre-installed with Python 3, there's no need for additional downloads or installations; just import it like this:
import difflib as dl
Let's dive in!
Section 1.1: Identifying Changed Elements
For many of you familiar with Git, you might have encountered a raw code file exhibiting conflicts, such as this:
<<<<<<< HEAD:file.txt
Hello world
Goodbye
>>>>>>> 77976da35a11db4580b80ae27e8d65caf5208086:file.txt
In graphical Git tools like Atlassian Sourcetree, conflicts are represented similarly. The minus sign indicates a line has been removed in the updated version, while the plus sign denotes an addition not present in the previous version.
In Python, we can replicate this functionality effortlessly using Difflib with just one line of code. The first function I'll demonstrate is context_diff().
Let's create two lists with some string elements:
s1 = ['Python', 'Java', 'C++', 'PHP']
s2 = ['Python', 'JavaScript', 'C', 'PHP']
Now, the magic happens when we generate a comparison "report":
dl.context_diff(s1, s2)
The context_diff() function yields a Python generator, allowing us to loop through it and print the results:
for diff in dl.context_diff(s1, s2):
print(diff)
The output will clearly indicate that the 2nd and 3rd elements differ, marked with an exclamation mark (!), while the 1st and 4th elements, being the same, will not be highlighted. If we complicate the scenario further by changing the two lists:
s1 = ['Python', 'Java', 'C++', 'PHP']
s2 = ['Python', 'Java', 'PHP', 'Swift']
The output will show that "C++" has been removed from the original list, and "Swift" has been added to the new one.
Now, if you want to display something akin to the screenshot from Sourcetree, Python has you covered with the unified_diff() function:
dl.unified_diff(s1, s2)
This function will "unify" the two lists, generating outputs that are more visually intuitive.
Section 1.2: Accurate Difference Indication
In the previous section, we focused on identifying differences at the row level. Is it possible to drill down even further? Absolutely! If we wish to compare at the character level, we can utilize the ndiff() function.
For instance, consider the following lists:
['tree', 'house', 'landing']
['tree', 'horse', 'lending']
These words appear quite similar, but only the 2nd and 3rd have a single character variation. Let's see what the function reveals:
It will not only highlight the changes (with minus and plus signs) but also indicate the specific letters that differ, with caret (^) indicators marking the differences.
Section 1.3: Finding Close Matches
Have you ever mistyped "teh" and had it auto-corrected to "the"? I certainly have! With Difflib, you can easily implement this feature in your Python application using the get_close_matches() function.
Suppose we have a list of potential candidates and an input. This function will help us identify the closest match:
dl.get_close_matches('thme', ['them', 'that', 'this'])
In this case, it successfully identifies "them" as the closest match to "thme" (a typo), rather than "this" or "that". However, be aware that if no suitable matches are found, an empty list will be returned.
You can control the "similarity" threshold by passing a float value between 0 and 1 to the cutoff parameter. A value of 0.1 will allow for broader matching.
Section 1.4: Modifying Strings A to B
If you're familiar with Information Retrieval, you might recognize that the above functions leverage the Levenshtein Distance. This metric estimates the difference between two textual terms based on the minimum number of substitutions, insertions, and deletions required to transform one term into the other.
For this article, I will bypass the algorithm specifics, but you can explore the Levenshtein Distance Wiki page for more details.
Interestingly, using Difflib, we can also implement the steps involved in applying the Levenshtein Distance between two strings. This can be accomplished with the SequenceMatcher class in Difflib.
Let's take two strings, "abcde" and "fabdc," and see how we can modify the first into the second. We start by instantiating the class:
s1 = 'abcde'
s2 = 'fabdc'
seq_matcher = dl.SequenceMatcher(None, s1, s2)
Next, we can utilize the get_opcodes() method to obtain a list of tuples that detail:
- The type of modification (insert, equal, or delete)
- The starting and ending positions of the source string
- The starting and ending positions of the target string
We can present this information in a more readable format:
for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
print(f'{tag:7} s1[{i1}:{i2}] --> s2[{j1}:{j2}] {s1[i1:i2]!r:>6} --> {s2[j1:j2]!r}')
Very cool, right? You might have noticed that the first argument passed to SequenceMatcher is None. This indicates that some characters may be "ignored" during processing. To illustrate, consider this example:
seq_matcher = dl.SequenceMatcher(lambda c: c in 'abc', s1, s2)
In this case, the letters "abc" are excluded from the comparison but treated as a whole.
Finally, here's a practical application of this function.
Section 1.5: Summary
In this article, I've introduced the Python built-in library, Difflib. It can generate reports showcasing the differences between two lists or strings, assist in finding the closest matching strings based on an input, and even facilitate more advanced functions through the SequenceMatcher class.
If you find my articles beneficial, consider supporting me and countless other writers by joining Medium Membership! (Click the link above)
The first video, "Find the Difference - Leetcode 389 - Python," explains how to tackle the problem using Python and provides insights into the Leetcode platform.
The second video, "Find the Difference | Leetcode 389 | Theory Explained + Python Code," elaborates on the theory behind the problem and demonstrates the Python code implementation.