Class TagParseState

  • Direct Known Subclasses:
    HTMLParseState, XMLFuzzyParseState, XMLParseState

    public class TagParseState
    extends SingleCharacterReceiver
    This class represents a basic xml/html tag parser. It is capable of recognizing the following xml and html constructs:
     '<' <token> <attrs> '>' ... '</' <token> '>'
     '<' <token> <attrs> '/>'
     '<?' <token> <attrs>  '?>'
     '<![' [<token>] '[' ... ']]>'
     '<!' <token> ... '>'
     '<!--' ... '-->'
     
    Each of these, save the comment, has supporting protected methods that will be called by the parsing engine. Overriding these methods will allow an extending class to perform higher-level data extraction and parsing. Of these, the messiest is the <! ... > construct, since there can be multiple nested btags, cdata-like escapes, and qtags inside. Ideally the parser should produce a sequence of preparsed tokens from these tags. Since they can be nested, keeping track of the depth is also essential, so we do that with a btag depth counter. Thus, in this case, it is not the state that matters, but the btag depth, to determine if the parser is operating inside a btag.
    • Field Detail

      • TAGPARSESTATE_SAWLEFTANGLE

        protected static final int TAGPARSESTATE_SAWLEFTANGLE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWEXCLAMATION

        protected static final int TAGPARSESTATE_SAWEXCLAMATION
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWDASH

        protected static final int TAGPARSESTATE_SAWDASH
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_COMMENT

        protected static final int TAGPARSESTATE_IN_COMMENT
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWCOMMENTDASH

        protected static final int TAGPARSESTATE_SAWCOMMENTDASH
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWSECONDCOMMENTDASH

        protected static final int TAGPARSESTATE_SAWSECONDCOMMENTDASH
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_TAG_NAME

        protected static final int TAGPARSESTATE_IN_TAG_NAME
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_ATTR_NAME

        protected static final int TAGPARSESTATE_IN_ATTR_NAME
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_TAG_SAW_SLASH

        protected static final int TAGPARSESTATE_IN_TAG_SAW_SLASH
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_END_TAG_NAME

        protected static final int TAGPARSESTATE_IN_END_TAG_NAME
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE

        protected static final int TAGPARSESTATE_IN_ATTR_LOOKING_FOR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_SINGLE_QUOTES_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_DOUBLE_QUOTES_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_NAME

        protected static final int TAGPARSESTATE_IN_QTAG_NAME
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_ATTR_NAME

        protected static final int TAGPARSESTATE_IN_QTAG_ATTR_NAME
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_SAW_QUESTION

        protected static final int TAGPARSESTATE_IN_QTAG_SAW_QUESTION
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_QTAG_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE

        protected static final int TAGPARSESTATE_IN_QTAG_ATTR_LOOKING_FOR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_QTAG_SINGLE_QUOTES_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_QTAG_DOUBLE_QUOTES_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE

        protected static final int TAGPARSESTATE_IN_QTAG_UNQUOTED_ATTR_VALUE
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_BRACKET_TOKEN

        protected static final int TAGPARSESTATE_IN_BRACKET_TOKEN
        See Also:
        Constant Field Values
      • TAGPARSESTATE_NEED_FINAL_BRACKET

        protected static final int TAGPARSESTATE_NEED_FINAL_BRACKET
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_BANG_TOKEN

        protected static final int TAGPARSESTATE_IN_BANG_TOKEN
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_CDATA_BODY

        protected static final int TAGPARSESTATE_IN_CDATA_BODY
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWRIGHTBRACKET

        protected static final int TAGPARSESTATE_SAWRIGHTBRACKET
        See Also:
        Constant Field Values
      • TAGPARSESTATE_SAWSECONDRIGHTBRACKET

        protected static final int TAGPARSESTATE_SAWSECONDRIGHTBRACKET
        See Also:
        Constant Field Values
      • TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH

        protected static final int TAGPARSESTATE_IN_UNQUOTED_ATTR_VALUE_SAW_SLASH
        See Also:
        Constant Field Values
      • currentState

        protected int currentState
      • bTagDepth

        protected int bTagDepth
        The btag depth, which indicates btag behavior when > 0.
      • accumBuffer

        protected java.lang.StringBuilder accumBuffer
        This is the only buffer we actually accumulate stuff in.
      • currentTagNameBuffer

        protected java.lang.StringBuilder currentTagNameBuffer
      • currentAttrNameBuffer

        protected java.lang.StringBuilder currentAttrNameBuffer
      • currentValueBuffer

        protected java.lang.StringBuilder currentValueBuffer
      • currentTagName

        protected java.lang.String currentTagName
      • currentAttrName

        protected java.lang.String currentAttrName
      • currentAttrList

        protected java.util.List<AttrNameValue> currentAttrList
      • inAmpersand

        protected boolean inAmpersand
        Whether we've seen an ampersand
      • ampBuffer

        protected java.lang.StringBuilder ampBuffer
        Buffer of characters seen after ampersand.
      • mapLookup

        protected static final java.util.Map<java.lang.String,​java.lang.String> mapLookup
    • Constructor Detail

      • TagParseState

        public TagParseState()
    • Method Detail

      • acceptNewTag

        protected boolean acceptNewTag()
        Allow parsing within tag.
      • newBuffer

        protected java.lang.StringBuilder newBuffer()
        Allocate the buffer.
      • noteTag

        protected boolean noteTag​(java.lang.String tagName,
                                  java.util.List<AttrNameValue> attributes)
                           throws ManifoldCFException
        This method gets called for every tag. Override this method to intercept tag begins.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteEndTag

        protected boolean noteEndTag​(java.lang.String tagName)
                              throws ManifoldCFException
        This method gets called for every end tag. Override this method to intercept tag ends.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteQTag

        protected boolean noteQTag​(java.lang.String tagName,
                                   java.util.List<AttrNameValue> attributes)
                            throws ManifoldCFException
        This method is called for every <? ... ?> construct, or 'qtag'. Override it to intercept such constructs.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteBTag

        protected boolean noteBTag​(java.lang.String tagName)
                            throws ManifoldCFException
        This method is called for every <! <token> ... > construct, or 'btag'. Override it to intercept these.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteEndBTag

        protected boolean noteEndBTag()
                               throws ManifoldCFException
        This method is called for the end of every btag, or any time there's a naked '>' in the document. Override it if you want to intercept these.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteEscaped

        protected boolean noteEscaped​(java.lang.String token)
                               throws ManifoldCFException
        Called for the start of every cdata-like tag, e.g. <![ <token> [ ... ]]>
        Parameters:
        token - may be empty!!!
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteEndEscaped

        protected boolean noteEndEscaped()
                                  throws ManifoldCFException
        Called for the end of every cdata-like tag.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteBTagToken

        protected boolean noteBTagToken​(java.lang.String token)
                                 throws ManifoldCFException
        This method gets called for every token inside a btag.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteNormalCharacter

        protected boolean noteNormalCharacter​(char thisChar)
                                       throws ManifoldCFException
        This method gets called for every character that is not part of a tag etc. Override this method to intercept such characters.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • noteEscapedCharacter

        protected boolean noteEscapedCharacter​(char thisChar)
                                        throws ManifoldCFException
        This method gets called for every character that is found within an escape block, e.g. CDATA. Override this method to intercept such characters.
        Returns:
        true to halt further processing.
        Throws:
        ManifoldCFException
      • attributeDecode

        protected static java.lang.String attributeDecode​(java.lang.String input)
        Decode an html attribute
      • mapChunk

        protected static java.lang.String mapChunk​(java.lang.String input)
        Map an entity reference back to a character
      • isWhitespace

        protected static boolean isWhitespace​(char x)
        Is a character markup language whitespace?
      • isPunctuation

        protected static boolean isPunctuation​(char x)
        Is a character markup language punctuation?