Package org.apache.nutch.protocol.file
Class File
- java.lang.Object
-
- org.apache.nutch.protocol.file.File
-
- All Implemented Interfaces:
Configurable,Pluggable,Protocol
public class File extends Object implements Protocol
This class is a protocol plugin used for file: scheme. It createsFileResponseobject and gets the content of the url from it. Configurable parameters arefile.content.limitandfile.crawl.parentin nutch-default.xml defined under "file properties" section.- Author:
- John Xing
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.LoggerLOG-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description File()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ConfigurationgetConf()Get theConfigurationobjectProtocolOutputgetProtocolOutput(Text url, CrawlDatum datum)Creates aFileResponseobject corresponding to the url and return aProtocolOutputobject as per the content receivedcrawlercommons.robots.BaseRobotRulesgetRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)No robots parsing is done for file protocol.static voidmain(String[] args)Quick way for running this class.voidsetConf(Configuration conf)Set theConfigurationobjectvoidsetMaxContentLength(int maxContentLength)Set the length after at which content is truncated.
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
Set theConfigurationobject- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfigurationobject- Specified by:
getConfin interfaceConfigurable
-
setMaxContentLength
public void setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.- Parameters:
maxContentLength- max content in bytes
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Creates aFileResponseobject corresponding to the url and return aProtocolOutputobject as per the content received- Specified by:
getProtocolOutputin interfaceProtocol- Parameters:
url- Text containing the urldatum- The CrawlDatum object corresponding to the url- Returns:
ProtocolOutputobject for the content of the file indicated by url
-
main
public static void main(String[] args) throws Exception
Quick way for running this class. Useful for debugging.- Parameters:
args- run with no args to print help- Throws:
Exception- if there is a fatal error running this class with the given input
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.- Specified by:
getRobotRulesin interfaceProtocol- Parameters:
url- URL to checkdatum- page datumrobotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContentis appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-