How to use the command duperemove (with examples)
- Linux
- December 25, 2023
Duperemove is a command-line tool that is used to find duplicate filesystem extents and optionally schedule them for deduplication. It works by identifying small parts of a file inside the filesystem called extents, which can be referenced multiple times when parts of the content of the files are identical. Duperemove provides various options to customize its behavior, including the ability to use a hash file to store extent hashes, limit I/O and CPU threads, and work with specific filesystems like Btrfs and XFS.
Use case 1: Search for duplicate extents in a directory and show them
Code:
duperemove -r path/to/directory
Motivation: This use case is helpful when you want to find duplicate extents in a directory and get a detailed report of the duplicates it found. It allows you to identify duplicate files or parts of files and take appropriate actions, such as deleting the duplicates or optimizing storage usage.
Explanation:
duperemove
: The command itself.-r
: Specifies that we want to recursively search for duplicate extents in the given directory.path/to/directory
: The path to the directory where the command should search for duplicates.
Example output:
Searching for duplicate extents in path/to/directory...
No duplicate extents found.
Use case 2: Deduplicate duplicate extents on a Btrfs or XFS (experimental) filesystem
Code:
duperemove -r -d path/to/directory
Motivation: If you are using a supported filesystem like Btrfs or XFS, this use case allows you to deduplicate the identified duplicate extents. By deduplicating duplicate extents, you can save storage space by eliminating redundant data.
Explanation:
-d
: Specifies that we want to schedule the duplicate extents for deduplication.path/to/directory
: The path to the directory where the command should search for duplicates.
Example output:
Checking duplicate extents in path/to/directory...
Found 10 duplicate extents.
Scheduled 8 extents for deduplication.
Use case 3: Use a hash file to store extent hashes
Code:
duperemove -r -d --hashfile=path/to/hashfile path/to/directory
Motivation: When searching for duplicate extents, Duperemove needs to calculate and compare hash values of the extents. This can take a lot of memory, especially for large directories. Using a hash file can reduce memory usage and allow you to reuse the hash values on subsequent runs of the command.
Explanation:
--hashfile=path/to/hashfile
: Specifies the path to the hash file that will store the extent hashes.- All other arguments remain the same as in the previous example.
Example output:
Checking duplicate extents in path/to/directory...
Found 10 duplicate extents.
Scheduled 8 extents for deduplication.
Hash file created: path/to/hashfile
Use case 4: Limit I/O threads and CPU threads
Code:
duperemove -r -d --hashfile=path/to/hashfile --io-threads=N --cpu-threads=N path/to/directory
Motivation: In some cases, you may want to limit the number of I/O and CPU threads that Duperemove uses. This can be useful if you want to control resource usage to prevent excessive system load or prioritize other tasks.
Explanation:
--io-threads=N
: Specifies the number of I/O threads to be used (N is the desired number).--cpu-threads=N
: Specifies the number of CPU threads to be used (N is the desired number).- All other arguments remain the same as in the previous examples.
Example output:
Checking duplicate extents in path/to/directory...
Found 10 duplicate extents.
Scheduled 8 extents for deduplication.
Using 4 I/O threads and 2 CPU threads.
Conclusion:
Duperemove is a powerful command-line tool for finding duplicate filesystem extents and deduplicating them. It offers flexibility in terms of customization options and supports popular filesystems like Btrfs and XFS. By using Duperemove, you can optimize your storage usage and eliminate redundant data, leading to better overall system performance.