Find command for the million Jul 29 2018

One of the most useful Unix commands, and mostly underutilized, is find. To get the most out of it there are two key features that we need to explore and understand.

  1. How to iterate over the results
  2. Limitation on the number of results we can operate on

Learn by doing

Lets create a few files and directories to use as examples:

1
2
3
4
mkdir -p one/two/three
for i in {1..10}; do touch "./one/file_${i}"; done
for i in {1..10}; do touch "./one/two/file_${i}"; done
for i in {1..10}; do touch "./one/two/three/file_${i}"; done

This should generate the following file structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
.
└── one
    ├── file_1
    ├── file_10
    ├── file_2
    ├── file_3
    ├── file_4
    ├── file_5
    ├── file_6
    ├── file_7
    ├── file_8
    ├── file_9
    └── two
        ├── file_1
        ├── file_10
        ├── file_2
        ├── file_3
        ├── file_4
        ├── file_5
        ├── file_6
        ├── file_7
        ├── file_8
        ├── file_9
        └── three
            ├── file_1
            ├── file_10
            ├── file_2
            ├── file_3
            ├── file_4
            ├── file_5
            ├── file_6
            ├── file_7
            ├── file_8
            └── file_9

Now if we wish to "find" all the directories inside the current directory we could use the following command:

1
find ./ -type d 

and we should see on the screen:

1
2
3
4
./
.//one
.//one/two
.//one/two/three

Same if we wanted to obtain all the normal files in the current directory we could use the command:

1
find ./ -type f

And we should see on the screen:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
.//one/file_10
.//one/file_7
.//one/file_9
.//one/file_8
.//one/file_1
.//one/file_6
.//one/two/file_10
.//one/two/file_7
.//one/two/file_9
.//one/two/file_8
.//one/two/file_1
.//one/two/file_6
.//one/two/three/file_10
.//one/two/three/file_7
.//one/two/three/file_9
.//one/two/three/file_8
.//one/two/three/file_1
.//one/two/three/file_6
.//one/two/three/file_3
.//one/two/three/file_4
.//one/two/three/file_5
.//one/two/three/file_2
.//one/two/file_3
.//one/two/file_4
.//one/two/file_5
.//one/two/file_2
.//one/file_3
.//one/file_4
.//one/file_5
.//one/file_2

Notice that the results returned might not be in the order we might expect, this is because some filesystems reorder directories entries as part of their optimization process so if you wish them to be in a specific order you'll need to specify it, most likely by sending the results to the sort command using a pipe, like the following:

1
find ./ -type f | sort

Iterate through results

Ok so now that we have the basics covered, let's see how to iterate through the results. A common and useful pattern we can find ourselves doing repeatedly is to fix the file permissions to something like the following:

(* The execute permission for the directories is so anyone can list the content of a directory)

Now that we have defined our objective, let's see how to do it using find.

Exec

We can pass the option exec to find to execute a command on each of the results, the command for regular files will look like this:

1
find . -type f -exec chmod 644 {} \;

and the command for directories will look like this:

1
find . -type d -exec chmod 755 {} \;

The first part, find . -type f and find . -type d, we already know what it does. The interesting part is: -exec {} \;. If the find command encounters the pattern: {} it will replace that pattern for the current occurence of a match.

Run the following command, and it'll be clear.

1
find . -type f -name "*4" -exec echo "Find found {} AND again: {}!" \;

You should see:

1
2
3
Find found ./one/two/three/file_4 AND again: ./one/two/three/file_4!
Find found ./one/two/file_4 AND again: ./one/two/file_4!
Find found ./one/file_4 AND again: ./one/file_4!

So now we are only left with one more piece of the puzzle: \;. The exec parameter expects you to finish the command with one of two options ; or +, the semicolon tells find to execute the command one time for each of the results and the plus sign tells execute to send as many matches as possible to the command. This is better explained by an example:

1
find . -type f -name "*4" -exec ls {} \;

This will call ls on each file that matches so we would see:

1
2
3
./one/two/three/file_4
./one/two/file_4
./one/file_4

and if we call it with the plus sign:

1
find . -type f -name "*4" -exec ls {} \+

ls will receive a list of files to list, the generated command would look something like:
ls ./one/file4 ./one/two/file4 ./one/two/three/file_4 and you'll see:

1
./one/file_4           ./one/two/file_4       ./one/two/three/file_4

The limitation on the length of the command line

A limit on the length of the command line is defined by the operating system, we can check it using the command:

1
getconf ARG_MAX

So this is the value that find uses to know how many parameters to send to the command. This is very useful if we have a long list of files we need to operate on, let's say we have close to a million files in a directory and we want to delete them, we would normally try to do:

1
rm *

But will get an error: Argument list too long, this is because the star operator will expand all the matches into the command line and be confronted with the ARG_MAX limit. We can see now how useful it could be to use exec with the plus option to overcome this limitation.

Fast file deletion

If you ever tried deleting cached files on a filesystem that contains hundreds of thousands of files you know it's like looking at paint dry, it takes a long time. Find can make this process faster.

We could use the command:

1
find . -type f -exec {} rm \;

This will call rm on each of the files, which should be equivalent to just calling rm * . But find gives us the option delete that makes this process faster. Instead of iterating through each element we can just run the following command:

1
find . -type f -delete

And this will be orders of magnitude faster, you could run the following test:

1
for i in $(seq 1 500000); do echo testing >> $i.txt; done

And then we can measure:

1
time find ./ -type f -exec rm {} \;

Create the half a million files and compare it with:

1
time find ./ -type f -exec rm {} \+

Create the half a million files and compare it with:

1
time find ./ -type f -delete

These are my results:

Using rm with the semicolon:

1
2
3
real    23m7.661s
user    5m37.056s
sys     11m28.556s

Using rm with plus option:

1
2
3
real    1m34.861s
user    0m1.203s
sys     1m14.445s

Using -delete:

1
2
3
real    1m30.699s
user    0m0.885s
sys     1m10.668s

As you can see -delete was faster, rm with the plus option was a close one but still 4 seconds slower, but going one by one which most people do takes a long time. If you are interested in more details and comparisons to other methods (using Perl and rsync) check this link.

Hope this gives you a better understanding of the find command, explore its other options and share with me any tricks you've discovered.

Ps. If you want me to write about a specific topic let me know and I'll do my best to share my experience.


** If you want to check what else I'm currently doing be sure to follow me on twitter @rderik or subscribe to the newsletter. If you want to send me a direct message you can send it to derik@rderik.com.