How to Use the Command 'uniq' (with Examples)

How to Use the Command 'uniq' (with Examples)

The uniq command is a utility for filtering out duplicate lines from input data, making it essential for data cleaning, text processing, and various other tasks. It reads a file or input line by line and returns unique entries, as well as additional details based on command-line arguments. A key aspect of using uniq effectively is sorting the input, as it only detects duplicates in consecutive lines. This functionality is particularly beneficial when managing large datasets, log files, or any scenario requiring the identification of unique records.

Use Case 1: Display Each Line Once

Code:

sort path/to/file | uniq

Motivation:
Sometimes, we face situations where we need to remove duplicate entries to get a clean list of unique lines from a file or input stream. This is particularly useful in data processing tasks where repeated entries can skew analysis results or when preparing data for further downstream processing.

Explanation:

  • sort: This command sorts the input file line by line. Sorting is necessary because uniq alone cannot detect non-consecutive duplicates. By ensuring that repeated entries are adjacent, the effectiveness of uniq is maximized.
  • uniq: This command outputs each unique line once. After sorting, it picks the first of every set of duplicate lines, which achieves the aim of filtering out all repeated lines.

Example Output:
Given a file containing the lines:

apple
orange
banana
apple
orange
grape

After applying sort path/to/file | uniq, the output will be:

apple
banana
grape
orange

Use Case 2: Display Only Unique Lines

Code:

sort path/to/file | uniq -u

Motivation:
When analyzing datasets, sometimes the interest lies in identifying which entries are not duplicated, rather than listing each entry once. This use case arises when we want to extract only those entries that are unique in the context of the entire input.

Explanation:

  • sort: As previously explained, sorting is crucial to ensure that duplicate lines become consecutive so uniq can process them properly.
  • uniq -u: Within the sorted input, the -u flag makes uniq show only lines that are unique, meaning those that do not repeat elsewhere in the file.

Example Output:
For a given file:

apple
orange
banana
apple
orange
grape
pear

The command sort path/to/file | uniq -u will yield:

banana
grape
pear

Use Case 3: Display Only Duplicate Lines

Code:

sort path/to/file | uniq -d

Motivation:
There are scenarios where knowing which entries are duplicated in a dataset is more important than knowing what is unique. For instances such as identifying frequently appearing errors in log files, duplications may indicate underlying issues needing attention.

Explanation:

  • sort: Pre-sorting the file gathers all duplicates, ensuring they are adjacent.
  • uniq -d: This option instructs uniq to print only those lines that appear more than once in the input.

Example Output:
With the file:

apple
orange
banana
apple
orange
grape

Applying sort path/to/file | uniq -d results in:

apple
orange

Use Case 4: Display Number of Occurrences of Each Line Along with That Line

Code:

sort path/to/file | uniq -c

Motivation:
Quantifying the frequency of occurrence of each line is fundamental in text analysis and reporting. This capability helps in generating statistics, such as counting occurrences of user IDs in a log file for audit purposes.

Explanation:

  • sort: To ensure duplicates are aligned consecutively which is necessary for proper uniqueness checks.
  • uniq -c: This flag causes uniq to prefix each line with the number of times it has occurred.

Example Output:
For a file with:

apple
orange
banana
apple
orange
grape

The command sort path/to/file | uniq -c provides:

      2 apple
      1 banana
      1 grape
      2 orange

Use Case 5: Display Number of Occurrences of Each Line, Sorted by the Most Frequent

Code:

sort path/to/file | uniq -c | sort -nr

Motivation:
When tasked with identifying the most frequently occurring entries, sorting the counts in descending order provides immediate insights into which entries are the most prevalent. Useful in data analytics, marketing, and operations to prioritize high-frequency items or issues.

Explanation:

  • sort: Initially sorts input data to align duplicates.
  • uniq -c: Counts occurrences of each line.
  • sort -nr: Re-sorts the output of uniq -c numerically in reverse order, highlighting lines by their frequency from highest to lowest.

Example Output:
Given a file:

apple
orange
banana
apple
orange
grape
orange
apple

The command sort path/to/file | uniq -c | sort -nr outputs:

      3 apple
      3 orange
      1 banana
      1 grape

Conclusion:

The uniq command serves as a powerful tool for managing and analyzing textual data by filtering, counting, and identifying duplicate and unique entries. Through various configurations and combinations with sorting, uniq offers versatility for numerous text processing tasks, always ensuring clarity and precision in results.

Related Posts

How to use the command 'sc_warts2json' (with examples)

How to use the command 'sc_warts2json' (with examples)

The sc_warts2json command is a versatile tool used to convert warts files, which are collected using the scamper tool, into a JSON format.

Read More
How to Use the Command 'xwininfo' (with Examples)

How to Use the Command 'xwininfo' (with Examples)

The xwininfo command is a powerful tool used in the X Window System to display extensive information about windows presented in a graphical interface.

Read More
How to Create Files using 'mkfile' Command (with examples)

How to Create Files using 'mkfile' Command (with examples)

The mkfile command allows users to create empty files of a specified size, useful in various scenarios such as testing disk performance, simulating file sizes, or reserving space for future use.

Read More