[Novalug] Converting HTML with sed and regular expressions
Ben Creitz
creitz at gmail.com
Mon Apr 9 07:07:48 EDT 2007
On 4/8/07, Jim Ide <jimsmaillists at yahoo.com> wrote:
> I have several HTML files that contain lines similar to these:
>
> <P CLASS="western" STYLE="margin-bottom: 0in"><FONT FACE="Comic Sans MS, cursive">
> <!--
> here is a
> multi line
> comment
> -->
>
> I want to:
> 1. change the <P *> lines to <p>
> 2. delete the <FONT> elements
> 3. remove the comments
>
> I am using sed as follows:
>
> sed -f fix.sed.txt < in.html > out.html
>
> fix.sed.txt contains the following:
>
> s/<FONT*>//g
> s/<P*>/<p>/g
> s/<!--*-->//g
>
> These sed regexps have no effect. What am I doing wrong?
Parsing HTML with regular expressions usually turns out to be more
complicated than it initially seems, especially if the source HTML is
sloppy and you have to cover a bunch of corner cases. Luckily it is
fun. Assuming an opening font tag does not span a line break (sed is
looking at one line at a time), you could try this:
s/<FONT[^>]*>//g
... in other words, match '<FONT' followed by any amount of anything
that is not '>', followed by '>'. Don't forget to match the closing
tags either:
s/<\/FONT>//g
Your approach with <P> should be similar. Multiline comments will
require the advice of somebody else... they are tricky because, as I
mentioned, sed is matching one line at a time. You could remove the
first and last line of a comment, but you need a way to know that
what's in between was also part of the comment.
If you happen to have a Mongtomery County, MD library card, you can
read the full-text of "Mastering Regular Expressions" online here:
http://www.montgomerycountymd.gov/Apps/Libraries/researchinfo/safari_remote.asp
Of course, there are a ton of other online resources for regexes.
Have fun and good luck!
-Ben
More information about the Novalug
mailing list