[Novalug] Filename handling: correctness vs. convenience

Michael Henry lug-user at drmikehenry.com
Sat Feb 28 11:41:24 EST 2009


All,

In the Unix culture, there has been a historical bias against filenames
with "odd" characters.  Filenames containing whitespace, punctuation, or
other unusual characters are harder to deal with at the command line and
in scripts because these unusual characters are often used as
delimiters.  Unusual characters must be escaped in some way for the
shell to deal with them correctly.

Because of the inconvenience caused by these unusual characters, they
tend to be avoided in Unix filenames.  As a result of this scarcity of
unusual filenames, there are often shortcuts that can be taken to avoid
the hassle fully general filename handling.  These shortcuts truly save
a lot of time, and I use them myself frequently; however, to paraphrase
a great philosopher[1], "A man's got to know his (tools') limitations".
Many of the shortcuts are in essence buggy approximations to a robust
general-case solution, but they are so much more convenient than the
fully correct solution and they are perfectly correct in a number of
interesting special cases.  It is my goal to point out some of these
shortcuts and their limitations, as well as the more awkward but
general-purpose solutions.

So to raise awareness of the general problem, I offer a couple of
suggestions for your consideration.  I'll be using Bash for my examples,
as that's the most common shell on Linux.  I'll try to keep my examples
generic enough to apply to any Bourne-shell derivative, but almost all
of my experience has been with Bash (and almost none with C-shell
derivatives), so though I believe the concepts generalize to most
shells, the details may vary if you are not using Bash.  You may want to
follow along in a temporary directory.  Throughout, commands typed at
the shell prompt begin with a dollar-sign::

    $ mkdir tmp
    $ cd tmp

Consider the task of creating a new file named ``dummy_file``::

    $ touch dummy_file
    $ ls -Q
    "dummy_file"

Note that the ``-Q`` option to GNU ``ls`` puts quotation marks around
the filenames to make the filename boundaries obvious.  The ``touch``
command updates the time stamp on a file, creating a new file if
necessary.  The shell splits the command line into the two words
``touch`` and ``dummy_file`` at the whitespace between the words.
Because neither the command nor the filename contained whitespace, no
special effort is required to invoke the command correctly.  But suppose
the desired filename had contained a space, such as ``dummy file``.  The
same simple invocation now produces an unintended result, creating two
new files (``dummy`` and ``file``)::

    $ touch dummy file
    $ ls -Q
    "dummy"  "dummy_file"  "file"

Some characters are always treated literally by the shell and taken at
"face value".  At a minimum, these include letters, numbers, and some
punctuation characters like period (``.``), slash (``/``), and
underscore(``_``).  Some other characters are always treated specially
unless some kind of quoting or escaping mechanism is used; these special
characters include whitespace, backslash (``\``), quotation marks, and
others.

To express a filename containing a special character requires some kind
of quoting or escaping.  One way to escape a special character is to use
a backslash before the special character::

    $ touch dummy\ file
    $ ls -Q
    "dummy"  "dummy file"  "dummy_file"  "file"

Now the desired ``dummy file`` is properly created.  The backslash
informs the shell to treat the escaped space character literally,
causing ``dummy file`` to be passed as a single argument to ``touch``.

Failure to escape special characters can have disastrous effects.  If the
goal is to remove ``dummy file``, the following command line is
erroneous; it will incorrectly remove the unrelated files ``dummy`` and
``file``::

    $ rm dummy file
    $ ls -Q
    "dummy file"  "dummy_file"

As an alternative to using backslash to escape the space, the filename
can be surrounded by quotes.  There is a distinction between single- and
double-quotes.  In general, single-quotes are "more powerful" in the
sense that all characters between a pair of single-quotes will be
treated literally[2].  Within double-quotes, whitespace is treated
literally but certain other characters are still treated specially; in
particular, the dollar sign (``$``) is used to expand shell variables.
For purely literal strings, I find single-quotes to be more convenient,
but double-quotes are required when expanding shell variables as
explained later.

To remove the special file ``dummy file``, quotes can be used to protect
the space from special treatment::

    $ rm 'dummy file'
    $ ls -Q
    "dummy_file"

Notice that quoting a filename is permissible even when it's not
necessary.  Though the filename ``dummy_file`` has no special
characters, it's legal to quote it anyway in the following command to
remove the file::

    $ rm 'dummy_file'
    $ ls -Q

Things get more interesting when shell variables are used.  A shell
variable can be assigned in many ways.  The following example sets the
variable ``filename`` to the value ``dummy_file``.  The subsequent
``echo`` command shows the value of ``filename``::

    $ export filename=dummy_file
    $ echo $filename
    dummy_file

When the shell encounters the variable expansion ``$filename`` on the
command line, it replaces it with the value of the variable.  Before the
``echo`` command is run, the shell updates that command line to become
``echo dummy_file``.  Similarly, the ``touch`` command could be used to
create a file named according to the value of the ``filename`` variable::

    $ touch $filename
    $ ls -Q
    "dummy_file"

As before, the command line ``touch $filename`` is replaced by the shell
with ``touch dummy_file`` before the ``touch`` command is run.
Command-line parsing continues after the variable replacement, but in
this case nothing interesting happens.  But suppose the variable held a
filename with a space::

    $ export filename='dummy file'
    $ echo $filename
    dummy file

The previous ``touch`` command would now erroneously create the two files
``dummy`` and ``file`` instead of the desired file ``dummy file``::

    $ touch $filename
    $ ls -Q
    "dummy"  "dummy_file"  "file"

The bug is the lack of quoting around the variable expansion of
``$filename``.  After expanding the variable, the shell continues to
parse the command line.  Because the expansion was not quoted, the space
in the filename is treated as an argument delimiter, so the ``touch``
command receives two distinct arguments, ``dummy`` and ``file``.  A
corrected invocation of ``touch`` that creates the filename specified by
the ``filename`` variable follows::

    $ touch "$filename"
    $ ls -Q
    "dummy"  "dummy file"  "dummy_file"  "file"

Notice the required quotes in the command.  Within double-quotes, the
shell will expand shell variables such as ``$filename``, but the
resulting string will not be further processed for argument splitting.
Therefore, the expanded filename will be treated as a single argument to
the ``touch`` command, yielding the correct result.

Here might be a good place to point out the tension between correctness
and convenience.  In the general case, a command such as ``touch
$filename`` that lacks proper quoting around a shell variable is buggy,
because in the most general case the script writer has no control over
the naming of arbitrary files in the filesystem.  When a script must
work correctly in the general case, such quoting is required.  But when
the files of interest are known to contain no unusual characters,
quoting is not required.  Especially for one-liners and "throw-away"
scripts, the author often knows that the filenames contain no spaces or
other characters requiring special processing, so for convenience the
author leaves out the otherwise mandatory quoting.

In my view, such shortcuts are valuable productivity enhancers, as long
as the author is aware that he is cutting corners and does not come to
view the shortcut as the correct idiom for the general case.  Therefore,
whenever I suggest a shortcut idiom to a fellow hacker, I like to point
out where the shortcut fails in the general case.  This is especially
important for quoting because it is so easy to overlook.  It also tends
to work fine during testing due to the relative scarcity of Unix
filenames with unusual characters, then fail miserably in the wild.  In
addition, bad quoting can create some very serious security
vulnerabilities.

A common idiom for processing files in a directory is the shell ``for``
loop.  The ``for`` loop takes a list of space-separated words and
iterates across them.  For example::

    $ for i in one two three; do echo $i; done
    one
    two
    three

To iterate over words with unusual characters, the words must be
quoted.  For example, to iterate over the two phrases ``first   phrase``
and ``second   phrase`` (note the three spaces in each phrase)::

    $ for i in "first   phrase" "second   phrase"; do echo "$i"; done
    first   phrase
    second   phrase

Notice that for correctness, there must also now be quotes around the
expansion of the variable ``i`` in the ``echo`` command, in order to
preserve the three spaces in each phrase.  Without the quotes, the
``echo`` command will see each phrase as a list of words, and it will
put only a single space between them::

    $ for i in "first   phrase" "second   phrase"; do echo $i; done
    first phrase
    second phrase

Frequently the list of words will be taken from a filename glob (a
pattern to match filenames).  Here is an example that is analogous
to ``ls *``::

    $ for i in *; do echo "$i"; done
    dummy
    dummy file
    dummy_file
    file

It's common to use such a ``for`` loop to do something to each file
individually.  For example, to make a backup of each file::

    $ for i in *; do cp "$i" "$i".bak; done
    $ ls -Q
    "dummy"      "dummy file"  "dummy file.bak"  "file"
    "dummy.bak"  "dummy_file"  "dummy_file.bak"  "file.bak"

This is safe for arbitrary filenames because of the quoting of the
variable expansions.  The idiomatic use of filename globs does, however,
run into trouble when there are too many files.  Because the shell
expands the glob in-place, the size of the command line grows.  On a
typical Linux system, the command line is limited to around 128 KBytes,
after which the glob expansion will overflow the command line.

One common technique to get around the command-line length limit is to
pipe a list of filenames into another program.  For example::

    $ ls *.bak | while read i; do echo "$i"; done
    dummy.bak
    dummy file.bak
    dummy_file.bak
    file.bak

This uses the ``read`` command read a line of input at a time, assigning
a shell variable to the value of the entire line.  This works for
filenames with spaces and some other special characters, but in
particular does not work correctly for filenames containing newlines.
Though newlines are very rare in filenames, it's possible to create one,
so this idiom is only a shortcut, not a fully general solution.  For
example, consider the filename ``dummy\nfile``, where the embedded
``\n`` indicates a newline.  First, a little cleanup::

    $ rm *.bak

Now the following command will create a file with a newline::

    $ touch $'dummy\nfile'
    $ ls -Q dummy*file
    "dummy file"  "dummy_file"  "dummy\nfile"

Notice how the ``while read i`` idiom fails to treat ``dummy\nfile`` as
a single file::

    $ ls dummy*file | while read i; do echo The file is: "$i"; done
    The file is: dummy file
    The file is: dummy_file
    The file is: dummy
    The file is: file

Whereas the filename with the space is treated correctly,
``dummy\nfile`` is erroneously treated as the two separate files
``dummy`` and ``file``.  

Another approach for dodging the command line length limit is the use of
``find`` and ``xargs``.   ``find`` generates a list of matched filenames
to the standard output, and ``xargs`` then splits those filenames at
whitespace and uses them as command-line arguments to another program.
As an example::

    $ find -name 'dummy*file' | xargs ls -Q
    "./dummy"  "./dummy"  "./dummy_file"  "file"  "file"

Notice that both filenames containing whitespace were split incorrectly
and treated as the pair of filenames ``dummy`` and ``file``.  To correct
this erroneous behavior, you can use the ``-print0`` option to GNU
``find`` and the corresponding ``-0`` option to GNU ``xargs`` (these
switches are unfortunately not portable to all older Unix systems, but
they work on many modern Unix systems like Linux).  Correct behavior is
achieved using this idiom::

    $ find -name 'dummy*file' -print0 | xargs -0 ls -Q
    "./dummy file"  "./dummy_file"  "./dummy\nfile"

I hope this discussion has brought some light to what I feel are
some historically dark corners in the Unix culture.  It's surprisingly
easy to write scripts that behave properly in every tested case but which
fail spectacularly when presented with unusual filenames in the wild
(especially when black hats have the chance to choose the filenames).

Michael Henry

[1]: http://www.imdb.com/title/tt0070355/quotes

[2]: Note that you can't use single-quotes to quote another
single-quote.



More information about the Novalug mailing list