[Novalug] bash & grep question - best for optimizing?

Richard Rognlie rrognlie at gamerz.net
Mon Nov 13 09:29:25 EST 2006


On Mon, Nov 13, 2006 at 09:08:40AM -0500, Nick Danger wrote:
> 
> I have a large volume of files.(*) I would like to run a grep through
> them and then act on the files that match. Easy enough to do. The
> question I have is, which way is best?
> 
> 1. This is a two layer deep hashed structure, and I have 4 patterns I
> want to match. I can either do a "grep -rl" at the top level, or cd into
> each hash (down 2 layers) and do a "grep " in that directory.
> 
> 2. Should I do one grep for each pattern, or a single grep with multiple
> matches?
> 
> There are anywhere from 200,000 to 250,000 files in there, so its not
> exactly a speedy process and so any few mins I can eek out of my shell
> script, I'd like to :-)

If any given directory is going to be large, I'd highly recommend you do 
	egrep -r

For portability, since not all forms of grep grok the -r flag (yes, I know
I'm dating myself, but...) I'd even go so far as to change the command to

	find DIR -type f -print0 | xargs -0 egrep WHATEVER /dev/null

the /dev/null at the end of the egrep is important.   else you run the risk
(however small) that xargs might choose to run egrep on a single file, and 
you'd not see which file matched (if it did).   /dev/null forces the cmd 
have at least 2 filenames which would then force the egrep to spit out the
filename that matched.

You're assured portability at that point.

on the "1 grep or 2" front... remember... disk is slow, cpu is fast.

So, it's almost always faster to do a more complex egrep than it is to 
do two separate greps.

	egrep '(pattern1|pattern2)'

-- 
 /  \__  | Richard Rognlie / Sendmail Ninja / Gamerz.NET Lackey
 \__/  \ | http://www.gamerz.net/~rrognlie    <rrognlie at gamerz.net>
 /  \__/ | Creator of pbmserv at gamerz.net
 \__/    |                Helping reduce world productivity since 1994



More information about the Novalug mailing list