首頁猿問從包含數百萬個文件的目錄...

從包含數百萬個文件的目錄 (bash/python/perl) 中通過精確匹配有效地查找數千個文件

Python

呼喚遠方 2022-11-24 15:24:25

我在 Linux 上，我試圖從包含數百萬個文件的目錄 (SOURCE_DIR) 中查找數千個文件。我有一個我需要查找的文件名列表，存儲在一個文本文件 (FILE_LIST) 中。該文件的每一行都包含一個名稱，對應于 SOURCE_DIR 中的一個文件，文件中有數千行。## FILE_LIST contain single word file names, each per line#Name0001#Name0002#..#Name9999我想將文件復制到另一個目錄 (DESTINATION_DIR)。我寫了下面的循環，里面有個循環一個一個找。#!/bin/bashFILE_LIST='file.list'## FILE_LIST contain single word file names, each per line#Name0001#Name0002#..#Name9999SOURCE_DIR='/path/to/source/files' # Contain millions of files in sub-directoriesDESTINATION_DIR='/path/to/destination/files' # Files will be copied to herewhile read FILE_NAMEdo echo $FILE_NAME for FILE_NAME_WITH_PATH in `find SOURCE_DIR -maxdepth 3 -name "$FILE_NAME*" -type f -exec readlink -f {} \;`; do echo $FILE cp -pv $FILE_NAME_WITH_PATH $DESTINATION_DIR; donedone < $FILE_LIST這個循環花費了很多時間，我想知道是否有更好的方法來實現我的目標。我進行了搜索，但沒有找到解決我的問題的方法。如果已經存在，請直接告訴我解決方案，或者請建議對上述代碼進行任何調整。如果有另一種方法甚至是 python/perl 解決方案，我也很好。感謝您的時間和幫助！

查看完整描述

4 回答

泛舟湖上清波郎朗

TA貢獻1818條經驗獲得超3個贊

注意下面添加的處理不同目錄中相同名稱的代碼

需要找到要復制的文件，因為它們沒有給出路徑（不知道它們在哪個目錄中），但是重新搜索每個文件非常浪費，大大增加了復雜性。

相反，首先為每個文件名構建一個具有完整路徑名的散列。

一種方法，使用 Perl，利用快速核心模塊File::Find

use warnings;

use strict;

use feature 'say';

use File::Find;

use File::Copy qw(copy);

my $source_dir = shift // '/path/to/source'; # give at invocation or default

my $copy_to_dir = '/path/to/destination';

my $file_list = 'file_list_to_copy.txt';

open my $fh, '<', $file_list or die "Can't open $file_list: $!";

my @files = <$fh>;

chomp @files;

my %fqn;

find( sub { $fqn{$_} = $File::Find::name unless -d }, $source_dir );

# Now copy the ones from the list to the given location

foreach my $fname (@files) {

copy $fqn{$fname}, $copy_to_dir

or do {

warn "Can't copy $fqn{$fname} to $copy_to_dir: $!";

next;

};

}

剩下的問題是關于可能存在于多個目錄中的文件名，但是我們需要得到一個規則來決定接下來要做什么。?

我忽略了問題中使用的最大深度，因為它無法解釋并且在我看來是與極端運行時相關的修復（？）。此外，文件被復制到一個“平面”結構中（不恢復其原始層次結構），從問題中得到提示。

最后，我只跳過目錄，而其他各種文件類型都有自己的問題（復制鏈接需要小心）。要僅接受普通文件，請更改unless -d 為if -f.

?澄清說，確實，不同目錄中可能存在同名文件。那些應該復制到相同的名稱，在擴展名之前以序號為后綴。

為此，我們需要檢查一個名稱是否已經存在，并在構建哈希時跟蹤重復的名稱，因此這將花費更長的時間。那么如何解釋重名有一個小難題呢？我在 arrayrefs 中使用另一個哈希值，其中只保留了被欺騙的名稱? ；這簡化并加快了工作的兩個部分。

my (%fqn, %dupe_names);

find( sub {

return if -d;

(exists $fqn{$_})

? push( @{ $dupe_names{$_} }, $File::Find::name )

: ( $fqn{$_} = $File::Find::name );

}, $source_dir );

令我驚訝的是，即使現在對每個項目運行測試，它的運行速度也比不考慮重復名稱的代碼慢一點點，在 25 萬個文件上分布在一個龐大的層次結構中。

三元運算符中賦值周圍的括號是必需的，因為運算符可能被賦值給（如果最后兩個參數是有效的“左值”，就像它們在這里一樣），因此需要小心分支內的賦值。

然后在%fqn按照帖子的主要部分復制之后，還復制其他同名文件。我們需要分解文件名以便在之前添加枚舉.ext；我使用核心File::Basename

use File::Basename qw(fileparse);

foreach my $fname (@files) {

next if not exists $dupe_names{$fname}; # no dupe (and copied already)

my $cnt = 1;

foreach my $fqn (@{$dupe_names{$fname}}) {

my ($name, $path, $ext) = fileparse($fqn, qr/\.[^.]*/);

copy $fqn, "$copy_to_dir/${name}_$cnt$ext";

or do {

warn "Can't copy $fqn to $copy_to_dir: $!";

next;

};

++$cnt;

}

（已完成基本測試，但僅此而已）

我可能會使用undef而不是$path上面的方法來指示該路徑未使用（同時這也避免了分配和填充標量），但為了那些不熟悉模塊的子返回的內容的人清楚，我將其保留為這種方式。

筆記。對于具有重復項的文件，將有副本fname.ext、fname_1.ext等。如果您希望將它們全部編入索引，則首先將fname.ext（在目標位置，它已通過復制%fqn）重命名為fname_1.ext，并將計數器初始化更改為my $cnt = 2;。

?請注意，這些文件不一定是相同的文件。

反對回復 2022-11-24

智慧大石

TA貢獻1946條經驗獲得超3個贊

我懷疑速度問題（至少部分）來自您的嵌套循環 - 對于每個FILE_NAME，您都在運行 afind并循環其結果。下面的 Perl 解決方案使用動態構建正則表達式的技術（適用于大型列表，我已經在 100k+ 單詞的列表上進行了測試），這樣你只需要遍歷文件一次并讓正則表達式引擎處理其余部分；這相當快。

請注意，根據我對您的腳本的閱讀，我做了幾個假設：您希望模式在文件名的開頭區分大小寫，并且您希望在目標中重新創建與源相同的目錄結構（設置$KEEP_DIR_STRUCT=0如果你不想要這個）。此外，我正在使用不完全是最佳實踐的解決方案，find而不是使用 Perl 自己的解決方案，File::Find因為它可以更容易地實現您正在使用的相同選項（例如-maxdepth 3） - 但它應該可以正常工作，除非有名稱中帶有換行符的任何文件。

該腳本僅使用核心模塊，因此您應該已經安裝了它們。

#!/usr/bin/env perl

use warnings;

use strict;

use File::Basename qw/fileparse/;

use File::Spec::Functions qw/catfile abs2rel/;

use File::Path qw/make_path/;

use File::Copy qw/copy/;

# user settings

my $FILE_LIST='file.list';

my $SOURCE_DIR='/tmp/source';

my $DESTINATION_DIR='/tmp/dest';

my $KEEP_DIR_STRUCT=1;

my $DEBUG=1;

# read the file list

open my $fh, '<', $FILE_LIST or die "$FILE_LIST: $!";

chomp( my @files = <$fh> );

close $fh;

# build a regular expression from the list of filenames

# explained at: https://www.perlmonks.org/?node_id=1179840

my ($regex) = map { qr/^(?:$_)/ } join '|', map {quotemeta}

sort { length $b <=> length $a or $a cmp $b } @files;

# prep dest dir

make_path($DESTINATION_DIR, { verbose => $DEBUG } );

# use external "find"

my @cmd = ('find',$SOURCE_DIR,qw{ -maxdepth 3 -type f -exec readlink -f {} ; });

open my $cmd, '-|', @cmd or die $!;

while ( my $srcfile = <$cmd> ) {

chomp($srcfile);

my $basename = fileparse($srcfile);

# only interested in files that match the pattern

next unless $basename =~ /$regex/;

my $newname;

if ($KEEP_DIR_STRUCT) {

# get filename relative to the source directory

my $relname = abs2rel $srcfile, $SOURCE_DIR;

# build new filename in destination directory

$newname = catfile $DESTINATION_DIR, $relname;

# create the directories in the destination (if necessary)

my (undef, $dirs) = fileparse($newname);

make_path($dirs, { verbose => $DEBUG } );

}

else {

# flatten the directory structure

$newname = catfile $DESTINATION_DIR, $basename;

# warn about potential naming conflicts

warn "overwriting $newname with $srcfile\n" if -e $newname;

}

# copy the file

print STDERR "cp $srcfile $newname\n" if $DEBUG;

copy($srcfile, $newname) or die "copy('$srcfile', '$newname'): $!";

}

close $cmd or die "external command failed: ".($!||$?);

您可能還想考慮使用硬鏈接而不是復制文件。

反對回復 2022-11-24

暮色呼如

TA貢獻1853條經驗獲得超9個贊

和rsync

我不知道這對于數百萬個文件會有多快，但這是一種使用rsync.

按以下格式設置您的格式file.list（例如：如 with $ cat file.list | awk '{print "+ *" $0}'）。

+ *Name0001

+ *Name0002

...

+ *Name9999

在命令中file.list使用--include=from選項調用：rsync

$ rsync -v -r --dry-run --filter="+ **/" --include-from=/tmp/file.list --filter="- *" /path/to/source/files /path/to/destination/files

選項說明：

-v : Show verbose info.

-r : Traverse directories when searching for files to copy.

--dry-run : Remove this if preview looks okay

--filter="+ *./" : Pattern to include all directories in search

--include-from=/tmp/file.list : Include patterns from file.

--filter="- *" : Exclude everything that didn't match previous patterns.

期權訂單很重要。

--dry-run如果詳細信息看起來可以接受，請刪除。

測試rsync版本 3.1.3。

反對回復 2022-11-24

HUX布斯

TA貢獻1876條經驗獲得超6個贊

這是帶有的 bashv4+ 解決方案find，但不確定速度。

#!/usr/bin/env bash

files=file.list

sourcedir=/path/to/source/files

destination=/path/to/destination/files

mapfile -t lists < "$files"

total=${#lists[*]}

while IFS= read -rd '' files; do

counter=0

while ((counter < total)); do

if [[ $files == *"${lists[counter]}" ]]; then

echo cp -v "$files" "$destination" && unset 'lists[counter]' && break

((counter++))

done

lists=("${lists[@]}")

total=${#lists[*]}

(( ! total )) && break ##: if the lists is already emtpy/zero, break.

done < <(find "$sourcedir" -type f -print0)

如果在 file.list 和 source_directory 中的文件中找到匹配項，則innerbreak將退出內部循環，因此它不會處理 file.list 直到最后，它會刪除"${lists[@]}"（這是一個數組）中的條目,unset所以下一個內部循環將跳過已經匹配的文件。

文件名沖突應該不是問題，unset并且內部break確保了這一點。不利的一面是，如果您在不同的子目錄中有多個文件要匹配。

如果速度是您所追求的，那么請使用通用腳本語言，例如python,perl和 friends

循環內（極慢的）模式匹配的替代方法是grep

#!/usr/bin/env bash

files=file.list

source_dir=/path/to/source/files

destination_dir=/path/to/destination/files

while IFS= read -rd '' file; do

cp -v "$file" "$destination_dir"

done < <(find "$source_dir" -type f -print0 | grep -Fzwf "$files")

-zfromgrep是一個 GNU 擴展。

echo如果您認為輸出正確，請刪除。

反對回復 2022-11-24

4 回答
0 關注
169 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

從包含數百萬個文件的目錄 (bash/python/perl) 中通過精確匹配有效地查找數千個文件

從包含數百萬個文件的目錄 (bash/python/perl) 中通過精確匹配有效地查找數千個文件

4 回答

添加回答