to.etc.lexer
Class HtmlTextScanner

java.lang.Object
  extended by to.etc.util.TextScanner
      extended by to.etc.lexer.HtmlTextScanner

public class HtmlTextScanner
extends TextScanner

Helper class to scan HTML and remove invalid constructs.

Author:
Frits Jalvingh Created on Feb 22, 2010

Constructor Summary
HtmlTextScanner()
           
 
Method Summary
 java.util.Map<java.lang.String,to.etc.lexer.HtmlTextScanner.TagInfo> getMap()
           
static java.lang.String htmlRemoveAll(java.lang.String html, boolean lf)
           
static void htmlRemoveAll(java.lang.StringBuilder outsb, java.lang.String text, boolean lf)
           
static java.lang.String htmlRemoveUnsafe(java.lang.String html)
           
static void htmlRemoveUnsafe(java.lang.StringBuilder outsb, java.lang.String text)
          This scans the input, and only copies "safe" html, which is HTML with only simple constructs.
 void scan(java.lang.StringBuilder sb, java.lang.String html)
          Scan HTML and remove unsafe tags and attributes.
 void scanAndRemove(java.lang.StringBuilder sb, java.lang.String html, boolean includelf)
          Remove all HTML tags and collapse whitespace.
 
Methods inherited from class to.etc.util.TextScanner
accept, accept, append, append, append, clear, copy, copy, copy, currentChar, eof, getBuffer, getCopied, getInt, getLastInt, inc, index, LA, LA, length, nextChar, sb, scanDelimited, scanInt, scanLetters, scanWord, setIndex, setString, skip, skipWS
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlTextScanner

public HtmlTextScanner()
Method Detail

getMap

public java.util.Map<java.lang.String,to.etc.lexer.HtmlTextScanner.TagInfo> getMap()

scan

public void scan(java.lang.StringBuilder sb,
                 java.lang.String html)
Scan HTML and remove unsafe tags and attributes. The result is garantueed to be safe and well-formed.

Parameters:
sb -
html -

scanAndRemove

public void scanAndRemove(java.lang.StringBuilder sb,
                          java.lang.String html,
                          boolean includelf)
Remove all HTML tags and collapse whitespace.

Parameters:
sb -
html -
includelf -

htmlRemoveUnsafe

public static void htmlRemoveUnsafe(java.lang.StringBuilder outsb,
                                    java.lang.String text)
This scans the input, and only copies "safe" html, which is HTML with only simple constructs. It checks to make sure the resulting document is xml-safe (well-formed), if the input is not well-formed it will add or remove tags until the result is valid.

Parameters:
sb -
html -

htmlRemoveUnsafe

public static java.lang.String htmlRemoveUnsafe(java.lang.String html)

htmlRemoveAll

public static void htmlRemoveAll(java.lang.StringBuilder outsb,
                                 java.lang.String text,
                                 boolean lf)

htmlRemoveAll

public static java.lang.String htmlRemoveAll(java.lang.String html,
                                             boolean lf)