[Novalug] bash & grep question - best for optimizing?
Russell Evans
russell-evans at qwest.net
Mon Nov 13 11:11:28 EST 2006
On Mon, 13 Nov 2006 09:08:40 -0500
"Nick Danger" <nick at hackermonkey.com> wrote:
>
> I have a large volume of files.(*) I would like to run a grep through
> them and then act on the files that match. Easy enough to do. The
> question I have is, which way is best?
>
> 1. This is a two layer deep hashed structure, and I have 4 patterns I
> want to match. I can either do a "grep -rl" at the top level, or cd
> into each hash (down 2 layers) and do a "grep " in that directory.
>
> 2. Should I do one grep for each pattern, or a single grep with
> multiple matches?
>
> There are anywhere from 200,000 to 250,000 files in there, so its not
> exactly a speedy process and so any few mins I can eek out of my shell
> script, I'd like to :-)
>
> Thanks
> -Nick
>
> (*) This is a mail spooler, or as we call it "where mail goes to die."
> Generally if it doesn't get spooled off in a few hours, it sits there
> the entire time until it expires out. I know we have issues with
> accepting too much email on spooler itself, but I'm not fixing postfix
> right now (someone else is working on that), Im just trying to remove
> mail that matches specific patterns.
You might want to look at Swish++
http://swishplusplus.sourceforge.net/
http://swishplusplus.sourceforge.net/features.html#mail_news
Intelligently index mail and news files
SWISH++ indexes words in headers and associates them with the name of
the headers as meta names that can later be queried against
specifically, e.g.:
search subject = big-bang
Similarly, words in vCard fields are associated with the names of the
fields as meta names that can also later be queried against, e.g.:
search title = professor
search org = SLAC
Additionally, plain and enriched text, and HTML in any one of ASCII,
ISO-8859-1, UTF-7, or UTF-8 character sets in any one of 7-bit, 8-bit,
quoted-printable, or base-64 encodings is decoded and converted
on-the-fly thus properly indexing encoded bodies and attachments.
Lastly, attachments having other MIME types can be filtered on-the-fly
before being indexed, e.g., convert Microsoft Word or PDF attachments
to plain text.
Thank you
Russell
More information about the Novalug
mailing list