How to Use the Command 'dvc gc' (with Examples)
The dvc gc
command in DVC (Data Version Control) is a powerful tool used to clean up unused files and directories from the cache and any configured remote storage. As data science and machine learning projects often involve handling large datasets and numerous model iterations, efficiently managing storage becomes crucial. By using dvc gc
, you can free up space by removing data that is no longer needed, ensuring your cache or remote storage doesn’t become cluttered with outdated or irrelevant files. Below, we explore several use cases of the dvc gc
command with examples and detailed explanations.
Use Case 1: Garbage Collect from the Cache, Keeping Only Versions Referenced by the Current Workspace
Code:
dvc gc --workspace
Motivation:
Imagine you’re actively working on a project, and you’ve gone through multiple experiments, creating various data and model files. However, once you settle on the versions you’re going to use or deploy, many of these intermediate files become obsolete for your current work. You might not want to keep these files around, as they just eat up disk space. By using the --workspace
option, you ensure that your cache only retains the files currently in use in your workspace, thus clearing the clutter.
Explanation:
dvc gc
: This part of the command calls the garbage collection function in DVC, aimed at cleaning up unused data from the project cache.--workspace
: This option signals DVC to keep only the data that’s currently referenced by your project’s workspace. Any data not in the current workspace reference is considered for removal.
Example Output:
Files removed:
data/processed/old_data.csv
models/interim/checkpoint-092.pt
cache size decreased by 3.5 GB
Use Case 2: Garbage Collect from the Cache, Keeping Only Versions Referenced by Branch, Tags, and Commits
Code:
dvc gc --all-branches --all-tags --all-commits
Motivation:
In larger projects with extensive branching and tagging strategies, you might create lots of temporary branches or tags for various purposes. However, as your project evolves, many of these become obsolete. It’s unnecessary to keep all versioned files if they are not referenced by any active branch, tag, or commit anymore. This command helps to maintain storage efficiency across the entire versioning history of your project.
Explanation:
--all-branches
: This option ensures that any file versions referenced by any branch within the project are preserved.--all-tags
: File versions associated with any project tags will not be removed.--all-commits
: This option tells DVC to retain all file versions associated with any commit in any branch.
Example Output:
Unused cache files removed, reducing storage by 10.2 GB
Total files removed: 145
Remaining cache size: 24.7 GB
Use Case 3: Garbage Collect from the Cache, Including the Default Cloud Remote Storage (if set)
Code:
dvc gc --all-commits --cloud
Motivation:
In scenarios where a project extensively uses cloud storage as a remote backend to store data, ensuring that only essential data persists in the remote storage becomes crucial. This is particularly important to control costs associated with cloud storage services. By specifying --cloud
, you instruct DVC to also target the cloud storage alongside the local cache, ensuring that unused data across all commits from the cloud storage is also purged.
Explanation:
--all-commits
: Similar to previous use, this option retains all versions referenced from project commits.--cloud
: This addition tells DVC to extend garbage collection beyond local cache to the default cloud remote, handling storage efficiently across all locations.
Example Output:
Cloud cache cleanup completed, freeing up 15.3 GB
Number of cloud files deleted: 78
Use Case 4: Garbage Collect from the Cache, Including a Specific Cloud Remote Storage
Code:
dvc gc --all-commits --cloud --remote remote_name
Motivation:
When managing remote storage across multiple cloud services or regions, you may want to target a specific remote storage for cleanup operations. This is useful when you have different remotes for specific purposes or different stages of analysis. Using this command ensures that you have precise control over which remote storage you intend to clean up, providing flexibility in managing your data dependencies across different environments.
Explanation:
--all-commits
: As seen before, it keeps data from all project commits intact.--cloud
: It extends the operation to the cloud as well, not just local cache.--remote remote_name
: This specific option lets you choose which cloud remote storage to target for garbage collection, as identified by itsremote_name
.
Example Output:
Garbage collection on 'remote_name' completed successfully
Released 8.7 GB from specified remote storage
Files removed: 43
Conclusion
The dvc gc
command is instrumental in maintaining a tidy and efficient data environment, especially in projects with numerous data iterations and differing storage needs. By selectively removing unused data, whether local or remote, it not only saves space but also potentially reduces costs and enhances performance by streamlining data management. Understanding these use cases allows project maintainers to make informed decisions about their storage clean-up strategies with DVC.