SREhttp/2 home page || demos and services

HTML to Text

Convert an HTML document to text

advanced options
If you are a SUPERUSER for this SREhttp/2 web site, you can specify a fully qualified filename of a file on this server (the drive letter must be specified)

Advanced Options

The following parameters can be used to customize HTML_TXT.
Note that:

Basics
Use high ascii characters.
Width of text file (in characters) Infinite length for non-table lines
Record HTML syntax errors (in text file):
Emphasis
Capitalization emphasis used for these strong emphasis codes.
Underscores_between_words used for these underline codes.
"quotations" placed around these emphasis codes.

Left , and right emphasis quote characters

Section Emphasis
and list bullets
Precede <TITLE> with , and follow <TITLE> with
Left , and right link quote characters
Left , and right , IMG label quote characters how to display IMG information
Preceed H1 headings with , and follow with .
Precede Hn headings (n=2...7) with , and follow with . Or .... display Hn headings as a hierarchical outline? YES || NO

UL bullets: MENU bullets:
OL numbers:
TABLES
Display tables using:
tabular display, unordered lists, paragraph breaks and <HR> rules
Display nested tables using:
tabular display unordered lists paragraph breaks and <HR> rules
If lists or paragraphs are selected above, this "table replacement" should occur for:
Display empty rows and empty tables: YES NO
Table borders (may be overridden by BORDER= attribute):
<TD WIDTH > attribute:
Suppress COLSPAN and ROWSPAN: YES NO
Table (and TEXTAREA) fill character Minimum column width adjustment factor
FORMS
Ignore <SELECT SIZE > attribute
Characters to use as:
OPTION bullets UnSelected: Selected:
CHECK Boxes: UnChecked: Checked:
RADIO buttons UnChecked: Checked:
SUBMIT and RESET quote characters Left Right ,
TEXT input box quote characters. Left:
Fill character:
Right
Miscellaneous
Minimize number of blank lines: YES || NO
Trimming long strings.
Width of a character (in pixels)
Top of page || ||


Notes

Using high ascii characters
High ascii (non-keyboard) characters are often useful as bullets, lines, and emphasis characters. Often, you can specify a high ascii character by entering a 3 (decimal) digit value.

For example: X and 88 are equivalent.

Yes, this example is not "high ascii"! That's because there is a small problem using high ascii -- the actual character displayed is context specific. For example, browsers might use a URL encoding rule to display high ascii characters, while text editors might use a country specific code-page. For example, compare the "code page" rendition (verbally described) and the URL mapping your browser is using

  • 219 : often a black box : URL mapping= Û
  • 174 : often a <<: URL mapping= ®
  • 186 : often a || : URL mapping= º
  • 251 : often a "square root" sign : URL mapping= û

Using infinite line lengths
If you intend to import your text file into a word processor, or into any program that can wrap long lines, you probably should use infinite line lengths. This basically means that each paragraph is on a single line; with seperate paragraphs on seperate lines.

Note that this does not apply to lines in a table.

How to display IMG information
Enter one of the following values:
  • -3 : Do not display (ignore all IMG elements)
  • -2 : Just display the pre and post img characters
  • -1 : Display a reference number (and generate a SRC and ALT reference list at the bottom of the text file)
  • 0 : Display all characters in an ALT attribute
  • 1 : Display, at most, current linelength characters from an ALT attribute
  • nnn: Display, at most, nnn characters from an ALT attribute.
    For example 30 means display at most 30 characters of the ALT attribute

Using special quote characters for emphasis
To indicate different kinds of textual emphasis, such as italics, the text can be bracketed by special "quote characters". Typically, a pair of quote characters are used: one for the left side (preceding the text), and another for the right side (following the text). For example:
  • [ and ] are the default quote characters for image labels;
  • the high-ascii equivalents of << and >> are the defaults for links.

Minimum column width adjustment
The minimum column width adjustmen is used to augment cell widths. Non-zero value will increase narrow cell widths, and decrease wide cells.
  • Small values (say, 6) are useful when short words are being clipped in narrow columns
  • Large values (say, 60) will tend to make all cells the same size.
  • 0 means "no adjustment"

Character width
Character width (in pixels) is used to to convert pixel widths (as used in WIDTH attributes of table cells) into character equivalents. By default, HTML_TXT assumes that an 80 character wide text file is being mapped to a 640 pixel wide screen, hence the default character width is 8.

If you increase LINELEN (say, to 128), you should consider adjusting the CHARWIDTH (say, to 5).

Hierarchical outline
HTML_TXT can use <Hn> (n=1,.2,..,7) to create a hierarchical outline; a list of numbered list of section titles, with the numbering reflecting section and subsection.

For example:

 I)Main section

This is the main section

 I.a)Subsection
  Subsection 1 starts here
  and we also have a

 I.a.1) Sub subsection

  which contains very lttle

 I.b)Sub section 2
      This is the second subsection.
  
In the above example: <H2>, <H3>, <H4>, and an <H3> heading could have created the I), I.a), I.a.1), and I.b) headings (respectively).Note that since <H1> is considered a "page heading", outline numbering starts with <H2> headings.

Converting tables to lists
In some cases, the display of complicated tables may be quite messy -- such as when the level of nesting becomes large. Should this happen, you can try displaying tables as unordered (<UL>) lists, or as seperate paragraphs seperated by horizontal rules (<HR>).

Since problems are most likely to occur with nested tables (rather then main tables), you can select whether to convert main tables, or just nested tables. In fact, you can even select whether to convert all nested tables, or just highly nested tables (that is, tables within a nested table).