Package org.apache.nutch.indexer.links
Class LinksIndexingFilter
- java.lang.Object
-
- org.apache.nutch.indexer.links.LinksIndexingFilter
-
- All Implemented Interfaces:
Configurable,IndexingFilter,Pluggable
public class LinksIndexingFilter extends Object implements IndexingFilter
AnIndexingFilterthat addsoutlinksandinlinksfield(s) to the document. In case that you want to ignore the outlinks that point to the same host as the URL being indexed use the following settings in your configuration file: <property> <name>index.links.outlinks.host.ignore</name> <value>true</value> </property> The same configuration is available for inlinks: <property> <name>index.links.inlinks.host.ignore</name> <value>true</value> </property> To store only the host portion of each inlink URL or outlink URL add the following to your configuration file. <property> <name>index.links.hosts.only</name> <value>false</value> </property>
-
-
Field Summary
Fields Modifier and Type Field Description static StringLINKS_INLINKS_HOSTstatic StringLINKS_ONLY_HOSTSstatic StringLINKS_OUTLINKS_HOST-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description LinksIndexingFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description NutchDocumentfilter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)Adds fields or otherwise modifies the document that will be indexed for a parse.ConfigurationgetConf()voidsetConf(Configuration conf)
-
-
-
Field Detail
-
LINKS_OUTLINKS_HOST
public static final String LINKS_OUTLINKS_HOST
- See Also:
- Constant Field Values
-
LINKS_INLINKS_HOST
public static final String LINKS_INLINKS_HOST
- See Also:
- Constant Field Values
-
LINKS_ONLY_HOSTS
public static final String LINKS_ONLY_HOSTS
- See Also:
- Constant Field Values
-
-
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface:IndexingFilterAdds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Specified by:
filterin interfaceIndexingFilter- Parameters:
doc- document instance for collecting fieldsparse- parse data instanceurl- page urldatum- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException- if an error occurs during during filtering
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
-