[Novalug] Converting HTML with sed and regular expressions
David Raleigh Arnold
dra at openguitar.com
Mon Apr 9 02:52:45 EDT 2007
On Sunday 08 April 2007 23:12, Jim Ide wrote:
> Hello -
>
> I have several HTML files that contain lines similar to these:
>
> <P CLASS="western" STYLE="margin-bottom: 0in"><FONT FACE="Comic Sans
MS, cursive">
> <!--
> here is a
> multi line
> comment
> -->
>
> I want to:
> 1. change the <P *> lines to <p>
> 2. delete the <FONT> elements
> 3. remove the comments
>
> I am using sed as follows:
>
> sed -f fix.sed.txt < in.html > out.html
>
> fix.sed.txt contains the following:
>
> s/<FONT*>//g
> s/<P*>/<p>/g
> s/<!--*-->//g
>
> These sed regexps have no effect. What am I doing wrong?
The "*" is a modifier meaning "any number of this, the previous
character, including none, so
s/<FONT*>//g would erase <FON>, <FONT>, <FONTT>, <FONTTT>, etc.
The second command doesn't accomplish much. Would it be better
to have actual paragraphs? Substitute an empty line for the tag.
s/<\/*[pP][^>]*>/\
/g
or probably:
s/<\/*[pP][^>]*>/\n/g
for the comments, you don't have anything else on those lines anyway,
do you?
/<!--/,/-->/ d
or
/<!--/ d
/-->/ d
Do you wish to just strip all one-line tags?
s/<[^>]*>//g
You don't say what you want to convert the html to. daveA
More information about the Novalug
mailing list