[Novalug] bash commands

Michael Henry LUG-user at drmikehenry.com
Sat Jan 19 10:16:07 EST 2008


All,

Nino Pereira started an interesting thread on the use of `grep`
with the following question:

 > I'd like to search all the files in a directory
 > for the occurrence of a particular
 > string. It must be something like
 >
 > do (until you have had all the files)
 >   cat *.f | grep --with-filename 'string'
 > od

There were many helpful replies mentioning a number of good techniques.
For example, Richard Rognlie provided two general-purpose invocations
combining `find`, `xargs`, and `grep`:

 > cd WHEREVER
 >
 > then either
 >         find . -name \*.f -print0 | xargs -0 grep 'string' /dev/null
 >
 >
 > since not all unixes support print0 in their finds, you might have to
 > resort to the following instead,
 >
 >         find . -name \*.f -print | xargs grep 'string' /dev/null
 >
 >
 > (why /dev/null?   just in case grep gets invoked with a single filesname
 > argument, this adds a 2nd filename... which will cause grep to report
 > which file matched)

The first invocation using `-print0` is best for general-purpose use
because the filenames located by `find` are separated by NUL characters
instead of newlines, avoiding potentially serious filename
mis-interpretations.  For example, a file named "the file.f" will fail
on the second invocation (without `-print0`); instead of searching the
single file "the file.f", grep will be told to search the pair of files
"the" and "file.f".  For a more detailed example of this phenomenon, see
http://calypso.tux.org/pipermail/novalug/2007-October/008060.html

Richard rightly points out that support for `-print0` is not
necessarily available on older variants of Unix; however, it has been
part of GNU `find` for a very long time, and essentially all[1] Linux
systems use GNU `find`.  It's good to know the portability work-arounds
necessary to work on variants of Unix, but when on Linux, I like to take
advantage of the GNU extensions in the tools when they make the
invocation more robust or easier to type.

Along those lines, the practice of adding `/dev/null` to the command
line of `grep` is another portability work-around for variants of `grep`
that don't have the `--with-filename` option originally mentioned by
Nino.  In GNU `grep`, the `--with-filename` option (with short form
`-H`) forces the filename to be displayed even if only a single filename
is specified on the command line.  When `-H` is available, it is a more
concise solution to the problem.

Jeff Stoner pointed out the built-in support for recursion in GNU grep:
 > Got GNU grep?
 >
 > grep -r 'string' *.f

This example is subtly incorrect (if I understand Nino's original
question correctly).  All files matching the pattern `*.f` are to be
searched for `string`.  Without the `-r` flag, the above invocation
correctly searches `*.f` in the current directory.  The `-r` flag tells
`grep` to recursively search all directories provided on the command
line; but only paths matching `*.f` were provided, so only *directories*
matching `*.f` will be descended because of the `-r` flag.  In addition,
`grep` is totally unaware of the glob pattern `*.f`, so it has no way of
knowing it should restrict its search to only files matching `*.f` in
the recursed directories.  Once it has descended recursively into a
directory, `grep` will scan all files in that directory for the desired
pattern.  See below for more explanation of glob patterns and
command-line parsing.

Joel Fouse pointed out some commonly used `grep` options:

 > -i  ...  search case-insensitively
 > -r  ...  search recursively (in the example above, you wouldn't even
 > need the -r if all the files are in the same folder)
 > -l   ...  show only the filenames that contain matches rather than
 > displaying each matching line
 > -c  ...  show all filenames scanned with the number of matching lines
 > contained in each (including zero)

I'd add one of my favorite `grep` options that really adds benefit for
me: `--color`.  The invocation `grep --color string` causes GNU `grep`
to display the matching text in a highlighted color on the terminal.
This makes it *much* easier for me to pick out the matches in the screen
output.  In my `~/.bashrc` file, I alias `grep` to `grep --color` so I
don't have to type it in:

   alias grep='grep --color'

Then, for interactive use of `grep`, I get colorization for free, but in
scripts the unexpected colorization is gone.  (Note: `grep` is smart
enough to turn off colorization when the output is not going to a
terminal so you need not worry about getting spurious ANSI escape
sequences in those cases.)

Joel goes on to say:

 > Grep is your friend.  It's very rare that I need to use some other
 > utility in conjunction with it (cat, find, etc.).

I guess usage patterns vary by the person.  I use `grep` by itself quite
a bit, but I'd say I'm often in the need to combine `grep` with `find`
and `xargs` to get full control over the files to search.  Often I need
to skip over Subversion[2] hidden directories (named `.svn`) and other
junk files: object files (*.o), backup files (*~, *.bak), etc.  This
speeds up my search and reduces spurious hits on uninteresting files.
For this purpose, I've got a small shell script that accepts `grep`
options and invokes `find`, `xargs`, and `grep` in the standard way
indicated by Richard, but with some additional arguments to `find` that
strip away the known-uninteresting files.  This improves my search speed
and results while reducing my typing.

Finally, one last topic piqued my interest.  A discussion of globbing
began with Richard and DonJr discussing the limit on a glob size:

DonJr wrote:
 > On Fri, 2008-01-18 at 15:21 -0500, Richard Rognlie wrote:
 >> Unless the "pattern" you are trying to match (say *.f) results in a very
 >> large number of matches, and the bash command line balks...
 >>
 >> I don't know what that number is, but it's on the order of 1024...
 >
 > Depends on the arch and the amount of aviable memory.
 > I've seen "glob" break with as little as a 100 files on older systems.

Globbing is the process of using wildcard characters to match existing
names in the filesystem.  On a Unix system, generally globbing is
performed by the shell before passing arguments to an external command
(like `grep`).  This allows a single implementation of globbing to be
built into the shell and frees all of the external programs from
worrying about it.  It's important to realize that globbing is completed
before the external program is even started, so the program doesn't see
the glob pattern.  So in the following invocation:

   grep 'pattern' *.f

The shell (generally Bash on Linux) will expand the glob `*.f` into a
list of individual filenames, then pass that list to `grep` like this:

   grep 'pattern' file1.f file2.f

That's why the `-r` flag for grep doesn't have enough information to do
what the user probably intended by this:

   grep -r 'pattern' *.f

The glob pattern `*.f` has been removed before `grep` even starts, so it
really sees this command line:

   grep -r 'pattern' file1.f file2.f

Unless a directory name is passed on the command line, `grep` will not
descend any directories, and even if it does descend, it does not have a
glob pattern to apply to the files in the subdirectories, so it searches
through *all* files in the subdirectories.

One implication of this consolidation of globbing support is that
command lines can grow rather long.  Each Unix-like system has a hard
limit on the maximum length for a command line.  If the `*.f` glob
expansion is performed in a directory with a large number of matching
files, the command line may not be able to hold all of them.  The
`xargs` utility was built to handle this particular problem.  `xargs`
accepts a list of arguments via standard input and groups the arguments
together in batches that will fit on a command line.  `xargs` then
invokes a supplied external command enough times to process all
arguments read from standard input.  In this way, the kernel's
command-line-length limitation may be circumvented.  Often, this also
mandates the use of a tool like the `find` command, because `find` has
built-in support for globbing.  This allows `find` to process much
larger lists of filenames because it is not internally limited by the
command-line-length restriction.

In Richard's example invocation (repeated here):

   find . -name \*.f -print0 | xargs -0 grep 'string' /dev/null

it's interesting to point out why there is a backslash before the `*.f`
in the `find` command's argument list.  This is because we don't want
the shell to expand the glob `*.f`; instead, we want the literal glob
pattern to be passed as an argument to `find` so that it may perform
glob matching internally.  The backslash acts as an escape character to
prevent the shell from treating it as a glob character.  I generally put
such arguments in single quotes instead so that the entire glob is
protected (whether it needs it or not).  But for a single glob
character, the backslash is shorter to type.

The maximum command line length can vary depending on kernel
configuration.  You can query the current value using the getconf[3]
command:

   $ getconf ARG_MAX
   131072

So on my Linux computer, the command line is limited to around 128KB of
arguments.  The beauty of `xargs` is that it knows the value of ARG_MAX
and will never exceed it, so you don't have to worry about it yourself.

Globbing is part of the POSIX standard.  Basic globbing support is
available on every conforming shell, but some shells have extensions for
fancier globbing.  Richard pointed out the following about globbing:

 > Glob is the term for converting the command line you enter into the
 > command line that the command actually gets.
 > [...]
 > it might also apply to syntax like  {a,b,c}{1,2,3}f (which
 > expands to  a1f a2f 3f b1f b2f b3f c1f c2f c3f

Bash actually draws a distinction between globbing and brace expansion.
Globbing does a pattern match against files that already exist in the
filesystem; brace expansion does not query the filesystem in any way to
perform the expansion.  Sometimes, this distinction doesn't matter, but
at times the behavior is very different.  For example, in an empty
directory, note the difference of behavior:

   $ echo {a,b}.txt
   a.txt b.txt
   $ echo [ab].txt
   [ab].txt

In the first case, brace expansion converts `{a,b}.txt` into `a.txt` and
`b.txt` (two separate arguments); in the second example, there are no
files that match the glob `[ab].txt`, so Bash leaves the glob untouched
as a single argument.  If you now create the single file `a.txt` in the
directory, notice the change:

   $ touch a.txt
   $ echo {a,b}.txt
   a.txt b.txt
   $ echo [ab].txt
   a.txt

Using brace expansion, there is no change; both `a.txt` and `b.txt` are
printed.  But with globbing, the `[ab].txt` expands to match the `a.txt`
file.

I often use brace expansion when I'm copying or renaming a file that is
deep in a directory.  Rather than type the following:

   mv deep/path/to/file deep/path/to/newfile

I type this:

   mv deep/path/to/{file,newfile}

and let brace expansion convert the second into the first.

It works for making a backup file as well:

   cp deep/path/to/file{,.bak}

which Bash expands to:

   cp deep/path/to/file deep/path/to/file.bak

Michael Henry

[1]: Embedded versions of Linux, or rescue diskettes, or other
small-footprint distributions of Linux may well have a stripped-down 
version
of `find` to save disk space, but the general-purpose distros I've seen
have all shipped with GNU `find`.

[2]: http://subversion.tigris.org/

[3]: 
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds2/getconf.htm


More information about the Novalug mailing list