[Novalug] bash & grep question - best for optimizing?

Russell Evans russell-evans at qwest.net
Mon Nov 13 11:11:28 EST 2006


On Mon, 13 Nov 2006 09:08:40 -0500
"Nick Danger" <nick at hackermonkey.com> wrote:

> 
> I have a large volume of files.(*) I would like to run a grep through
> them and then act on the files that match. Easy enough to do. The
> question I have is, which way is best?
> 
> 1. This is a two layer deep hashed structure, and I have 4 patterns I
> want to match. I can either do a "grep -rl" at the top level, or cd
> into each hash (down 2 layers) and do a "grep " in that directory.
> 
> 2. Should I do one grep for each pattern, or a single grep with
> multiple matches?
> 
> There are anywhere from 200,000 to 250,000 files in there, so its not
> exactly a speedy process and so any few mins I can eek out of my shell
> script, I'd like to :-)
> 
> Thanks
> -Nick
> 
> (*) This is a mail spooler, or as we call it "where mail goes to die."
> Generally if it doesn't get spooled off in a few hours, it sits there
> the entire time until it expires out. I know we have issues with
> accepting too much email on spooler itself, but I'm not fixing postfix
> right now (someone else is working on that), Im just trying to remove
> mail that matches specific patterns.



You might want to look at Swish++
http://swishplusplus.sourceforge.net/

http://swishplusplus.sourceforge.net/features.html#mail_news
Intelligently index mail and news files
SWISH++ indexes words in headers and associates them with the name of
the headers as meta names that can later be queried against
specifically, e.g.:

        search subject = big-bang

Similarly, words in vCard fields are associated with the names of the
fields as meta names that can also later be queried against, e.g.:

        search title = professor
        search org = SLAC

Additionally, plain and enriched text, and HTML in any one of ASCII,
ISO-8859-1, UTF-7, or UTF-8 character sets in any one of 7-bit, 8-bit,
quoted-printable, or base-64 encodings is decoded and converted
on-the-fly thus properly indexing encoded bodies and attachments.

Lastly, attachments having other MIME types can be filtered on-the-fly
before being indexed, e.g., convert Microsoft Word or PDF attachments
to plain text.

Thank you
Russell



More information about the Novalug mailing list