[Novalug] Converting HTML with sed and regular expressions

Ben Creitz creitz at gmail.com
Mon Apr 9 07:07:48 EDT 2007


On 4/8/07, Jim Ide <jimsmaillists at yahoo.com> wrote:
> I have several HTML files that contain lines similar to these:
>
> <P CLASS="western" STYLE="margin-bottom: 0in"><FONT FACE="Comic Sans MS, cursive">
> <!--
> here is a
> multi line
> comment
> -->
>
> I want to:
> 1. change the <P *> lines to <p>
> 2. delete the <FONT> elements
> 3. remove the comments
>
> I am using sed as follows:
>
> sed -f fix.sed.txt < in.html > out.html
>
> fix.sed.txt contains the following:
>
> s/<FONT*>//g
> s/<P*>/<p>/g
> s/<!--*-->//g
>
> These sed regexps have no effect.  What am I doing wrong?

Parsing HTML with regular expressions usually turns out to be more
complicated than it initially seems, especially if the source HTML is
sloppy and you have to cover a bunch of corner cases.  Luckily it is
fun.  Assuming an opening font tag does not span a line break  (sed is
looking at one line at a time), you could try this:

  s/<FONT[^>]*>//g

... in other words, match '<FONT' followed by any amount of anything
that is not '>', followed by '>'.  Don't forget to match the closing
tags either:

  s/<\/FONT>//g

Your approach with <P> should be similar.  Multiline comments will
require the advice of somebody else... they are tricky because, as I
mentioned, sed is matching one line at a time.  You could remove the
first and last line of a comment, but you need a way to know that
what's in between was also part of the comment.

If you happen to have a Mongtomery County, MD library card, you can
read the full-text of "Mastering Regular Expressions" online here:

http://www.montgomerycountymd.gov/Apps/Libraries/researchinfo/safari_remote.asp

Of course, there are a ton of other online resources for regexes.
Have fun and good luck!

-Ben


More information about the Novalug mailing list