detect if "already exists" are true copies and delete or rename source file ?

Started by maxandersen, October 30, 2016, 12:23:48 PM

Previous topic - Next topic

chuck lee

I have the same problem for a while.  Just found a way to workaround, I wrote a function that output the duplicate files and keep the newer or larger one compared to the original. Combining find, sed and sort commands.  "%-3uc" is the copy number I used to identify the duplicate file since I have a lot of file names ending with xxx-01.jpg or xxx_01.jpg.  Therefore, upper case letter is a better choice.  You can change it to fit your need.


file_rm_dup(){
  # ((++func_counter))
  # sec_title "" "$FUNCNAME() --> `basename ${BASH_SOURCE[0]}` --> `basename $0`"
 
  [[ "$file_ulb" == "u" ]] && tier_2=3 || tier_2=4
 
  find .  -iname '*-AA[A-Z].*' -type f -printf "%p %T@ %s\n" | sed -rn 's/(.*)(-AA[A-Z])(\..*) (.*)(\..*) (.*)/\1\3 \1\2\3 \4 \6/p' | sort -b -s -S 10M -k 1,1 -k "$tier_2","$tier_2"nr > dup_out

  if [[ -s dup_out ]] ; then
    f_dest_tmp=
    while read -r line;  do # read a line each loop
      read f_dest f_src f_src_sec f_src_size <<< $line
       
        # f_dest: filename order and also destination filename(without copy number -AA[A-Z])
        # f_src  : duplicate filename
        # f_src_sec  : file seconds since epoch
        # f_src_size : file size
         
        # echo "dest_tmp: $f_dest_tmp
         
      if [[ "$f_dest" != "$f_dest_tmp" ]]; then
        f_dest_sec=$(stat -c%Y "$f_dest")
        f_dest_size=$(stat -c%s "$f_dest")
         
          #  %Y: time of last data modification, seconds since Epoch;
          #  %s: total size, in bytes
           
        [[ "$file_ulb" = u && "$f_src_sec" -gt "$f_dest_sec" || "$file_ulb" = l && "$f_src_size" -gt "$f_dest_size" ]] && mv -f -T "$f_src" "$f_dest" || rm "$f_src"
       
        f_dest_tmp="$f_dest"
      else
        rm "$f_src"
      fi
    done < dup_out
  fi
  rm dup_out
}



In 'find' command, %p is  filename, %T@ is time of last data modification, seconds since Epoch, but I keep only the integer part after 'sed' command, and %s is file size.  Output to a file called dup_out. The first field is the original file name, the 2nd field is the duplicate file name, the 3rd field is the seconds number and the last field is the file size in bytes.  The first index field should be the original file name.  The 2nd index(the 3rd field) in my case is the time of modification, newer if number is larger.  In 'Sort' command, the variable $tier_2 is 3, treat the field as number and sort in reverse order. 

Here is an example output:

./20210306_154312_00_90_8.jpg ./20210306_154312_00_90_8-AAB.jpg 1649512702 6562827
./8.jpg ./8-AAC.jpg 1649512702 6562827
./8.jpg ./8-AAB.jpg 1649512702 6562827
./t/202103xx_8.jpg ./t/202103xx_8-AAC.jpg 1649512702 6560771
./t/202103xx_8.jpg ./t/202103xx_8-AAB.jpg 1649512702 6560771

Then, a while loop to check which file to keep.  Since, files are sort in order, the only comparison is between the original and the first duplicate file.  In this example,

./8.jpg ./8-AAB.jpg 1649512702 6562827
./t/202103xx_8.jpg ./t/202103xx_8-AAB.jpg 1649512702 6560771

These two files are removed directly without comparison.  Look through the man find to choose the printf arguments you need and sort it accordingly.  Hope this will give you some help.

BTW, instead of comparing the modification time or size of files, we can use md5sum to check two files are identical or not?  For example:

md5sum 201801xx_8.jpg 201801Xx_8.jpg


1db75b25856087ac0e45a9a891e3e97c  201801xx_8.jpg
1db75b25856087ac0e45a9a891e3e97c  201801Xx_8.jpg

Then keep the one with the filename you want.