Class Robots
- java.lang.Object
-
- org.apache.manifoldcf.crawler.connectors.rss.Robots
-
public class Robots extends java.lang.ObjectThis class is a cache of a specific robots data. It is loaded and fetched according to standard robots rules; namely, caching for up to 24 hrs, format and parsing rules consistent with http://www.robotstxt.org/wc/robots.html. The apache Httpclient is used to fetch the robots files, when necessary. An instance of this class should be constructed statically in order for the caching properties to work to maximum advantage.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected classRobots.HostThis class maintains status for a given host.protected static classRobots.RecordThis class represents a record in a robots.txt file.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidprotected java.util.MapcacheThis is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.protected ThrottledFetcherfetcherFetcher to use to get the data from whereverprotected intrefCountReference countprotected static java.lang.StringROBOT_CONNECTION_TYPERobots connection type valueprotected static java.lang.StringROBOT_FILE_NAMERobot file name valueprotected static intROBOT_TIMEOUT_MILLISECONDSRobots fetch timeout value
-
Constructor Summary
Constructors Constructor Description Robots(ThrottledFetcher fetcher)Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected static booleandoesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)Recursive method for matching specification to path.protected static booleandoesPathMatch(java.lang.String path, java.lang.String spec)Check if path matches specificationbooleanisFetchAllowed(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, java.lang.String protocol, int port, java.lang.String hostName, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)Decide whether a specific robot can crawl a specific URL.protected static java.lang.StringmakeReadable(java.lang.String inputString)Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).voidnoteConnectionEstablished()Note that a connection has been established.voidnoteConnectionReleased()Note that a connection has been released, and free resources if no reason to retain them.voidpoll()Clean idle stuff out of cache
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
ROBOT_TIMEOUT_MILLISECONDS
protected static final int ROBOT_TIMEOUT_MILLISECONDS
Robots fetch timeout value- See Also:
- Constant Field Values
-
ROBOT_CONNECTION_TYPE
protected static final java.lang.String ROBOT_CONNECTION_TYPE
Robots connection type value- See Also:
- Constant Field Values
-
ROBOT_FILE_NAME
protected static final java.lang.String ROBOT_FILE_NAME
Robot file name value- See Also:
- Constant Field Values
-
fetcher
protected ThrottledFetcher fetcher
Fetcher to use to get the data from wherever
-
refCount
protected int refCount
Reference count
-
cache
protected java.util.Map cache
This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.
-
-
Constructor Detail
-
Robots
public Robots(ThrottledFetcher fetcher)
Constructor.
-
-
Method Detail
-
noteConnectionEstablished
public void noteConnectionEstablished()
Note that a connection has been established.
-
noteConnectionReleased
public void noteConnectionReleased()
Note that a connection has been released, and free resources if no reason to retain them.
-
poll
public void poll()
Clean idle stuff out of cache
-
isFetchAllowed
public boolean isFetchAllowed(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, java.lang.String protocol, int port, java.lang.String hostName, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, org.apache.manifoldcf.agents.interfaces.ServiceInterruptionDecide whether a specific robot can crawl a specific URL. A ServiceInterruption exception is thrown if the fetch itself fails in a transient way. A permanent failure (such as an invalid URL) with throw a ManifoldCFException.- Parameters:
userAgent- is the user-agent string used by the robot.from- is the email address.protocol- is the name of the protocol (e.g. "http")port- is the port number (-1 being the default for the protocol)hostName- is the fqdn of the hostpathString- is the path (non-query) part of the URL- Returns:
- true if fetch is allowed, false otherwise.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionorg.apache.manifoldcf.agents.interfaces.ServiceInterruption
-
makeReadable
protected static java.lang.String makeReadable(java.lang.String inputString)
Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
-
doesPathMatch
protected static boolean doesPathMatch(java.lang.String path, java.lang.String spec)Check if path matches specification
-
doesPathMatch
protected static boolean doesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)Recursive method for matching specification to path.
-
-