How to use the command 'dvc gc' (with examples)
The ‘dvc gc’ command is used to remove unused files and directories from the cache or remote storage. It helps to free up space and optimize the storage of files used by DVC (Data Version Control). It provides various options to customize the garbage collection process based on different criteria.
Use case 1: Garbage collect from the cache, keeping only versions referenced by the current workspace
Code:
dvc gc --workspace
Motivation: The ‘–workspace’ option allows you to perform garbage collection while preserving only the versions of files that are currently referenced by the workspace. This is useful when you want to remove all the unused files and directories that are no longer in use by the current project.
Explanation:
dvc gc
: Command to initiate the garbage collection process.--workspace
: Option to specify that only the versions referenced by the current workspace should be kept while removing the unused files and directories.
Example output:
Removing unused cache files...
Garbage collection completed successfully.
Use case 2: Garbage collect from the cache, keeping only versions referenced by branch, tags, and commits
Code:
dvc gc --all-branches --all-tags --all-commits
Motivation: The ‘–all-branches’, ‘–all-tags’, and ‘–all-commits’ options provide a way to perform garbage collection while preserving only the versions referenced by all the branches, tags, and commits in the repository. This is helpful when you want to optimize the storage by removing unused files and directories across all the different versions.
Explanation:
dvc gc
: Command to initiate the garbage collection process.--all-branches
: Option to include all the branches in the repository for garbage collection.--all-tags
: Option to include all the tags in the repository for garbage collection.--all-commits
: Option to include all the commits in the repository for garbage collection.
Example output:
Removing unused cache files...
Garbage collection completed successfully.
Use case 3: Garbage collect from the cache, including the default cloud remote storage
Code:
dvc gc --all-commits --cloud
Motivation: The ‘–cloud’ option allows you to perform garbage collection not only from the cache but also from the default cloud remote storage (if set). This is beneficial when you want to optimize both local and remote storage by removing unused files and directories from both locations.
Explanation:
dvc gc
: Command to initiate the garbage collection process.--all-commits
: Option to include all the commits in the repository for garbage collection.--cloud
: Option to include the default cloud remote storage for garbage collection.
Example output:
Removing unused cache files from the local cache...
Removing unused files from cloud storage...
Garbage collection completed successfully.
Use case 4: Garbage collect from the cache, including a specific cloud remote storage
Code:
dvc gc --all-commits --cloud --remote remote_name
Motivation: The ‘–remote’ option allows you to specify a specific cloud remote storage for garbage collection. This is useful when you have multiple cloud remotes and want to optimize the storage for a particular remote by removing unused files and directories.
Explanation:
dvc gc
: Command to initiate the garbage collection process.--all-commits
: Option to include all the commits in the repository for garbage collection.--cloud
: Option to include the cloud storage for garbage collection.--remote remote_name
: Option to specify the name of the remote storage.
Example output:
Removing unused cache files from the local cache...
Removing unused files from cloud storage 'remote_name'...
Garbage collection completed successfully.
Conclusion:
The ‘dvc gc’ command is a powerful tool for optimizing storage by removing unused files and directories. It provides flexibility in choosing the versions to keep and the locations to perform garbage collection. By using the different options, you can fine-tune the cleanup process based on your specific requirements.