[Novalug] Converting HTML with sed and regular expressions

David Raleigh Arnold dra at openguitar.com
Mon Apr 9 02:52:45 EDT 2007


On Sunday 08 April 2007 23:12, Jim Ide wrote:
> Hello -
> 
> I have several HTML files that contain lines similar to these:
> 
> <P CLASS="western" STYLE="margin-bottom: 0in"><FONT FACE="Comic Sans 
MS, cursive">
> <!--
> here is a
> multi line
> comment
> -->
> 
> I want to:
> 1. change the <P *> lines to <p>
> 2. delete the <FONT> elements
> 3. remove the comments
> 
> I am using sed as follows:
> 
> sed -f fix.sed.txt < in.html > out.html
> 
> fix.sed.txt contains the following:
> 
> s/<FONT*>//g
> s/<P*>/<p>/g
> s/<!--*-->//g
> 
> These sed regexps have no effect.  What am I doing wrong?

The "*" is a modifier meaning "any number of this, the previous
character, including none, so
s/<FONT*>//g would erase <FON>, <FONT>, <FONTT>, <FONTTT>, etc.

The second command doesn't accomplish much.  Would it be better
to have actual paragraphs?  Substitute an empty line for the tag.

s/<\/*[pP][^>]*>/\
/g

or probably:
s/<\/*[pP][^>]*>/\n/g

for the comments, you don't have anything else on those lines anyway,
do you?

/<!--/,/-->/ d

or

/<!--/ d
/-->/ d

Do you wish to just strip all one-line tags?

s/<[^>]*>//g

You don't say what you want to convert the html to.  daveA


More information about the Novalug mailing list