Package org.apache.nutch.indexer.filter
Class MimeTypeIndexingFilter
- java.lang.Object
-
- org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
-
- All Implemented Interfaces:
Configurable,IndexingFilter,Pluggable
public class MimeTypeIndexingFilter extends Object implements IndexingFilter
AnIndexingFilterthat allows filtering of documents based on the MIME Type detected by Tika
-
-
Field Summary
Fields Modifier and Type Field Description static StringMIMEFILTER_REGEX_FILE-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description MimeTypeIndexingFilter()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description NutchDocumentfilter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)Adds fields or otherwise modifies the document that will be indexed for a parse.ConfigurationgetConf()static voidmain(String[] args)Main method for invoking this toolvoidsetConf(Configuration conf)
-
-
-
Field Detail
-
MIMEFILTER_REGEX_FILE
public static final String MIMEFILTER_REGEX_FILE
- See Also:
- Constant Field Values
-
-
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface:IndexingFilterAdds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Specified by:
filterin interfaceIndexingFilter- Parameters:
doc- document instance for collecting fieldsparse- parse data instanceurl- page urldatum- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException- if an error occurs during during filtering
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
main
public static void main(String[] args) throws IOException, IndexingException
Main method for invoking this tool- Parameters:
args- run with no arguments to print help- Throws:
IOException- if there is a fatal I/O error processing the input argsIndexingException- if there is a fatal error whils indexing
-
-