[Novalug] bash & grep question - best for optimizing?

James Ewing Cottrell 3rd JECottrell3 at Comcast.NET
Tue Nov 14 19:17:49 EST 2006


Ross Patterson wrote:

> At 17:42 11/14/2006, James Ewing Cottrell 3rd wrote:
>
>> One could argue, and I will, that hacking -r into grep is an 
>> abomination. The tool for recursing is FIND.
>
> For general purpose tools, I agree.  Of course, in that case, "ls" is 
> even more of an abomination than anyone who reads the man page 
> thinks.  For the Only Correct Answer to the "show me my files" 
> question is, naturally, "find . -print".

I agree in theory. But ls is such a basic command. Still, the lesson 
should have been learned from doing this. Or at the very least, from BSD 
implementing cp -r incorrectly (following symlinks).

> For special purpose tools, especially those that are intended (as grep 
> was) to process large volumes of "stuff", performance rules the roost, 
> and you do whatever you need to get it.  Back in the day when 
> everything was a uniprocessor system, "grep -r" was faster than "find 
> | grep".

What do you mean "back in the day?" The -r option to grep is a recent 
invention. And given that just about everything is I/O bound, there is 
more than enough parallelism.

>  Today, when many modern systems are at least hyperthreaded, there 
> might be genuine performance benefit to splitting the task across two 
> processes.  Then again, you said "find | xargs grep", not "find | 
> grep", because grep lacks a "take the list of files from stdin" mode...

OK, I am convinced that grep should be able to read a list of files from 
stdin. Good Point. In fact, perhaps a general option should be thrown at 
as many commands as would make sense.

> ...and that means that you'll have potentially thousands of processes 
> (1 find, 1 xargs, many greps), so maybe it won't perform.  As always, 
> Performance Analysts's Answer Number Three applies: "It depends."

Yes, it does depend, but remember...."We are I/O Bound" here. Ferrying a 
list of names through a pipe hardly makes a difference. And xargs 
bundles up many many arguments into the commands it executes. The 
default -s (command size) is 128K, so most commands generated will eat 
well over 1000 args.

>> Hacking every tool's features into every other tool violate the one 
>> of the ancient UNIX Fundamentals that "A Tool does one thing and does 
>> it well".
>
> Please tell me that you use the One True and Holy Editor: ed.  It does 
> one thing and one thing only, and you don't need to be polydactyl to 
> use it.

Editors are not Tools. They are Complete User Environments. And yet, the 
King of editors, emacs, relies heavily on composition. The foundation is 
C, the interior design is Lisp, and heavy use is made of external utilities.

>> but a better way is to do somethings like:
>>
>>    find . -type f | xargs grep -l pat |
>>    while read file
>>    do process "$file"; done
>
>
> Hey, while we're at it, what's with all this post-sh stuff?  "while"?  
> "do"?  Sheesh.  Any problem that needs to be solved should be solved 
> using pre-existing filters in a pipeline without new code.
>
> :-)

Piping a list of file names to "while read x" is a Design Pattern, one 
worth knowing. Unfortunately, it seems to be all too rare, probably 
because of decades using t?csh.

> Ross 

JIM



More information about the Novalug mailing list