Package org.apache.nutch.protocol.ftp
Class Ftp
- java.lang.Object
-
- org.apache.nutch.protocol.ftp.Ftp
-
- All Implemented Interfaces:
Configurable,Pluggable,Protocol
public class Ftp extends Object implements Protocol
This class is a protocol plugin used for ftp: scheme. It createsFtpResponseobject and gets the content of the url from it. Configurable parameters areftp.username,ftp.password,ftp.content.limit,ftp.timeout,ftp.server.timeout,ftp.password,ftp.keep.connectionandftp.follow.talk. For details see "FTP properties" section innutch-default.xml.
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.LoggerLOG-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Ftp()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidfinalize()intgetBufferSize()ConfigurationgetConf()Get theConfigurationobjectProtocolOutputgetProtocolOutput(Text url, CrawlDatum datum)Creates aFtpResponseobject corresponding to the url and returns aProtocolOutputobject as per the content receivedcrawlercommons.robots.BaseRobotRulesgetRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)Get the robots rules for a given urlstatic voidmain(String[] args)For debugging.voidsetConf(Configuration conf)Set theConfigurationobjectvoidsetFollowTalk(boolean followTalk)Set followTalk i.e.voidsetKeepConnection(boolean keepConnection)Whether to keep ftp connection.voidsetMaxContentLength(int length)Set the length after at which content is truncated.voidsetTimeout(int to)Set the timeout.
-
-
-
Method Detail
-
setTimeout
public void setTimeout(int to)
Set the timeout.- Parameters:
to- a maximum timeout in milliseconds
-
setMaxContentLength
public void setMaxContentLength(int length)
Set the length after at which content is truncated.- Parameters:
length- max content length in bytes
-
setFollowTalk
public void setFollowTalk(boolean followTalk)
Set followTalk i.e. to log dialogue between our client and remote server. Useful for debugging.- Parameters:
followTalk- if true will follow, false by default
-
setKeepConnection
public void setKeepConnection(boolean keepConnection)
Whether to keep ftp connection. Useful if crawling same host again and again. When set to true, it avoids connection, login and dir list parser setup for subsequent URLs. If it is set to true, however, you must make sure (roughly): (1) ftp.timeout is less than ftp.server.timeout (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) Otherwise there will be too many "delete client because idled too long" messages in thread logs.- Parameters:
keepConnection- if true we will keep the connection, false by default
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Creates aFtpResponseobject corresponding to the url and returns aProtocolOutputobject as per the content received- Specified by:
getProtocolOutputin interfaceProtocol- Parameters:
url- Text containing the ftp urldatum- The CrawlDatum object corresponding to the url- Returns:
ProtocolOutputobject for the url
-
main
public static void main(String[] args) throws Exception
For debugging.- Parameters:
args- run with no args for help- Throws:
Exception- if there is an error running this program
-
setConf
public void setConf(Configuration conf)
Set theConfigurationobject- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfigurationobject- Specified by:
getConfin interfaceConfigurable
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Get the robots rules for a given url- Specified by:
getRobotRulesin interfaceProtocol- Parameters:
url- URL to checkdatum- page datumrobotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContentis appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
getBufferSize
public int getBufferSize()
-
-