Finding Duplicate Files with PowerShell.

For one of the projects, we needed a PowerShell script that would find duplicate files in the network folders of the server. For Windows, there are a large number of utilities for finding and removing duplicate files, but most of them are paid or poorly suited for automation.

Since files can have different names, but identical content, it will be incorrect to compare files simply by their names. It is better to get the hashes of all files and find the same ones among them.

The following one-line PowerShell script allows you to recursively scan the specified folder (including subfolders) and find duplicate files. In this example, two files of identical files with identical hashes were found:

Get-ChildItem –path C:\Share\ -Recurse | Get-FileHash | Group-Object -property hash | Where-Object { $_.count -gt 1 } | ForEach-Object { $_.group | Select-Object Path, Hash }

This one-line PowerShell command is useful for finding duplicates, but its performance leaves a lot to be desired. If there are a lot of files in the directory and the hash is calculated for each, it will take quite a lot of time. It's easier to first compare files by size (this is a ready-made file attribute that does not need to be calculated). We will get the hash only for files with the same size:

$file_dublicates = Get-ChildItem –path C:\Share\ -Recurse| Group-Object -property Length| Where-Object { $_.count -gt 1 }| Select-Object –Expand Group| Get-FileHash | Group-Object -property hash | Where-Object { $_.count -gt 1 }| ForEach-Object { $_.group | Select-Object Path, Hash }

You can compare the performance of both commands on a test folder using Measure-Command :

Measure-Command {your_powershell_command}

On a small directory with 2000 files, the difference in the speed of the second command will be several orders of magnitude faster (10 minutes vs 3 seconds).

You can prompt the user to select files that can be deleted. To do this, the list of duplicate files must be piped to the Out-GridView cmdlet:

$file_dublicates | Out-GridView -Title "Select files to delete" -OutputMode Multiple –PassThru|Remove-Item –Verbose –WhatIf

The user in the table can select files to delete (to select multiple files, you need to hold down ctrl ) and click Ok.

Instead of deleting, you can move the selected files to another directory using Move-Item.

You can use a script to replace duplicate files with hard links. This approach will save files in place and significantly save disk space. Script code:

param(
[Parameter(Mandatory=$True)]
[ValidateScript({Test-Path -Path $_ -PathType Container})]
[string]$dir1,
[Parameter(Mandatory=$True)]
[ValidateScript({(Test-Path -Path $_ -PathType Container) -and $_ -ne $dir1})]
[string]$dir2
)
Get-ChildItem -Recurse $dir1, $dir2 |
Group-Object Length | Where-Object {$_.Count -ge 2} |
Select-Object -Expand Group | Get-FileHash |
Group-Object hash | Where-Object {$_.Count -ge 2} |
Foreach-Object {
$f1 = $_.Group[0].Path
Remove-Item $f1
New-Item -ItemType HardLink -Path $f1 -Target $_.Group[1].Path | Out-Null
#fsutil hardlink create $f1 $_.Group[1].Path
}
To run the file, use the following command format:
.\hardlinks.ps1 -dir1 d:\fldr1 -dir2 d:\fldr2

This script can be used to find and replace duplicate static files ( that don't change! ) with symbolic hard links.

On Windows Server, you can use the built-in Data Deduplication component of the File Server role to radically solve the problem of duplicate files. However, when using deduplication and incremental backup, you will encounter difficulties in restoring.

You can also use the console utility dupemerge to replace duplicate files with hard links.

Отправить комментарий

Добавлять новые комментарии запрещено.*

Новые Старые