How to use the command 'comm' (with examples)
The comm
command is a powerful utility available in Unix/Linux environments designed to compare two sorted files line by line. It helps users identify and manipulate differences and similarities between the two files by producing distinct columns of unique and common lines. This comm
command is particularly useful in data analysis, software development, and system administration for processing textual data, identifying configurations, or managing lists.
Use case 1: Produce three tab-separated columns: lines only in first file, lines only in second file and common lines
Code:
comm file1 file2
Motivation:
You may need to have a comprehensive view of what is unique to each file and what they share, such as when analyzing differences and similarities between two datasets or configuration lists. This can be particularly useful in scenarios like selecting unique product codes or syncing differences between two versions of a text document.
Explanation:
comm
: The base command used to compare two files.file1
: The first input file to be compared.file2
: The second input file to be compared.
In this form, comm
generates three columns of output: the first displays lines unique to file1
, the second shows lines unique to file2
, and the third contains lines common to both files. Ensure both files are sorted beforehand; otherwise, the output will be inaccurate.
Example Output:
Alice
Bob
Charlie
David
Edward
Fiona
In this example, “Bob” and “David” are unique to file1, “Edward” is unique to file2, and “Alice”, “Charlie”, and “Fiona” are common to both.
Use case 2: Print only lines common to both files
Code:
comm -12 file1 file2
Motivation:
When you need to extract only the shared information between two files, such as finding common entries in two databases or lists or identifying mutual contacts from two address books. This helps reduce redundancy and focus on shared data.
Explanation:
-12
: This option suppresses the first and second columns of unique lines, showing only the third column (common lines).file1
: The first sorted file.file2
: The second sorted file.
The -12
flag is pivotal, as it ensures that only common lines between file1
and file2
are displayed.
Example Output:
Alice
Charlie
Fiona
The output lists names (e.g., Alice, Charlie, Fiona) that appear in both file1
and file2
.
Use case 3: Print only lines common to both files, reading one file from stdin
Code:
cat file1 | comm -12 - file2
Motivation:
This use case embodies flexibility by allowing data to be piped directly to comm
. It is beneficial when the first file is generated or transformed on-the-fly or when working within a chain of command execution, thereby avoiding the need to create a temporary file.
Explanation:
cat file1
: Reads the content offile1
and outputs it tostdout
.|
: Pipes the output fromcat
into thecomm
command.-12
: Suppresses unique lines from both files, only printing common lines.-
: Represents a standard input substitute forfile1
.file2
: The second sorted file.
Handling one file via stdin
provides flexibility in workflows and scripting.
Example Output:
Alice
Charlie
Fiona
The result remains consistent, showing lines common to both files.
Use case 4: Get lines only found in first file, saving the result to a third file
Code:
comm -23 file1 file2 > file1_only
Motivation:
This usage allows for isolation of records unique to the first data set and saving them for future use, such as generating a report or update file. Efficiently managing your data often means extracting unique differences that require follow-up actions.
Explanation:
-23
: Suppresses the second and third columns, displaying lines unique tofile1
.file1
: The first input file, to find unique lines in.file2
: The second file, which content is compared against.> file1_only
: Redirects the unique lines fromfile1
into the file namedfile1_only
.
Output redirection (>
) is used to save the result into a specific file instead of displaying it on the screen.
Example Output:
Assuming the content was redirected, file1_only
will contain:
Bob
David
This isolates the lines exclusive to file1
.
Use case 5: Print lines only found in second file, when the files aren’t sorted
Code:
comm -13 <(sort file1) <(sort file2)
Motivation:
Files are often unsorted in their raw or natural state, necessitating proper sorting before analysis. This example shows the dynamic duo of comm
and sort
via process substitution, useful during real-time data comparison or rapid scripting tasks.
Explanation:
-13
: Prints lines unique to the second file by suppressing the first and third columns.<(sort file1)
: This process substitution sorts and feeds a sorted version offile1
intocomm
.<(sort file2)
: Similarly, sortsfile2
before feeding it intocomm
.
Process substitution, <(...)
, creates temporary sorted outputs, allowing comm
to work seamlessly on unsorted input.
Example Output:
Edward
The output demonstrates that “Edward” is found solely in file2
.
Conclusion
The comm
command serves as a versatile and efficient tool for text comparison tasks in Unix/Linux systems. From simple column separation of file differences to complex, on-the-fly data piping, comm
offers a breadth of functionality in text processing that ensures productive data management and analysis.