Package org.apache.nutch.parse
Class ParserFactory
- java.lang.Object
-
- org.apache.nutch.parse.ParserFactory
-
-
Field Summary
Fields Modifier and Type Field Description static StringDEFAULT_PLUGINWildcard for default plugins.
-
Constructor Summary
Constructors Constructor Description ParserFactory(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected List<Extension>getExtensions(String contentType)Finds the best-suited parse plugin for a given contentType.ParsergetParserById(String id)Function returns aParserinstance with the specifiedextId, representing its extension ID.Parser[]getParsers(String contentType, String url)Function returns an array ofParsers for a given content type.
-
-
-
Field Detail
-
DEFAULT_PLUGIN
public static final String DEFAULT_PLUGIN
Wildcard for default plugins.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ParserFactory
public ParserFactory(Configuration conf)
-
-
Method Detail
-
getParsers
public Parser[] getParsers(String contentType, String url) throws ParserNotFound
Function returns an array ofParsers for a given content type. The function consults the internal list of parse plugins for the ParserFactory to determine the list of pluginIds, then gets the appropriate extension points to instantiate asParsers.- Parameters:
contentType- The contentType to return theArrayofParsers for.url- The url for the content that may allow us to get the type from the file suffix.- Returns:
- An
ArrayofParsers for the given contentType. If there were plugins mapped to a contentType via theparse-plugins.xmlfile, but never enabled via theplugin.includesNutch conf, then those plugins won't be part of this array, i.e., they will be skipped. So, if the ordered list of parsing plugins fortext/plainwas[parse-text,parse-html, parse-rtf], and onlyparse-htmlandparse-rtfwere enabled viaplugin.includes, then this ordered Array would consist of twoParserinterfaces,[parse-html, parse-rtf]. - Throws:
ParserNotFound- if there is a runtime error locating a parser for the given content type and url
-
getParserById
public Parser getParserById(String id) throws ParserNotFound
Function returns aParserinstance with the specifiedextId, representing its extension ID. If the Parser instance isn't found, then the function throws aParserNotFoundexception. If the function is able to find theParserin the internalPARSER_CACHEthen it will return the already instantiated Parser. Otherwise, if it has to instantiate the Parser itself , then this function will cache that Parser in the internalPARSER_CACHE.- Parameters:
id- The string extension ID (e.g., "org.apache.nutch.parse.rss.RSSParser", "org.apache.nutch.parse.rtf.RTFParseFactory") of theParserimplementation to return.- Returns:
- A
Parserimplementation specified by the parameterid. - Throws:
ParserNotFound- If the Parser is not found (i.e., registered with the extension point), or if the there aPluginRuntimeExceptioninstantiating theParser.
-
-