Class Robots


  • public class Robots
    extends java.lang.Object
    This class is a cache of a specific robots data. It is loaded and fetched according to standard robots rules; namely, caching for up to 24 hrs, format and parsing rules consistent with http://www.robotstxt.org/wc/robots.html. The apache Httpclient is used to fetch the robots files, when necessary. An instance of this class should be constructed statically in order for the caching properties to work to maximum advantage.
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      protected class  Robots.Host
      This class maintains status for a given host.
      protected static class  Robots.Record
      This class represents a record in a robots.txt file.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static java.lang.String _rcsid  
      protected java.util.Map cache
      This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.
      protected ThrottledFetcher fetcher
      Fetcher to use to get the data from wherever
      protected int refCount
      Reference count
      protected static java.lang.String ROBOT_CONNECTION_TYPE
      Robots connection type value
      protected static java.lang.String ROBOT_FILE_NAME
      Robot file name value
      protected static int ROBOT_TIMEOUT_MILLISECONDS
      Robots fetch timeout value
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected static boolean doesPathMatch​(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)
      Recursive method for matching specification to path.
      protected static boolean doesPathMatch​(java.lang.String path, java.lang.String spec)
      Check if path matches specification
      boolean isFetchAllowed​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, java.lang.String protocol, int port, java.lang.String hostName, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)
      Decide whether a specific robot can crawl a specific URL.
      protected static java.lang.String makeReadable​(java.lang.String inputString)
      Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
      void noteConnectionEstablished()
      Note that a connection has been established.
      void noteConnectionReleased()
      Note that a connection has been released, and free resources if no reason to retain them.
      void poll()
      Clean idle stuff out of cache
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • ROBOT_TIMEOUT_MILLISECONDS

        protected static final int ROBOT_TIMEOUT_MILLISECONDS
        Robots fetch timeout value
        See Also:
        Constant Field Values
      • ROBOT_CONNECTION_TYPE

        protected static final java.lang.String ROBOT_CONNECTION_TYPE
        Robots connection type value
        See Also:
        Constant Field Values
      • ROBOT_FILE_NAME

        protected static final java.lang.String ROBOT_FILE_NAME
        Robot file name value
        See Also:
        Constant Field Values
      • fetcher

        protected ThrottledFetcher fetcher
        Fetcher to use to get the data from wherever
      • refCount

        protected int refCount
        Reference count
      • cache

        protected java.util.Map cache
        This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.
    • Method Detail

      • noteConnectionEstablished

        public void noteConnectionEstablished()
        Note that a connection has been established.
      • noteConnectionReleased

        public void noteConnectionReleased()
        Note that a connection has been released, and free resources if no reason to retain them.
      • poll

        public void poll()
        Clean idle stuff out of cache
      • isFetchAllowed

        public boolean isFetchAllowed​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      java.lang.String throttleGroupName,
                                      java.lang.String protocol,
                                      int port,
                                      java.lang.String hostName,
                                      java.lang.String pathString,
                                      java.lang.String userAgent,
                                      java.lang.String from,
                                      java.lang.String proxyHost,
                                      int proxyPort,
                                      java.lang.String proxyAuthDomain,
                                      java.lang.String proxyAuthUsername,
                                      java.lang.String proxyAuthPassword,
                                      org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                      int connectionLimit)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        Decide whether a specific robot can crawl a specific URL. A ServiceInterruption exception is thrown if the fetch itself fails in a transient way. A permanent failure (such as an invalid URL) with throw a ManifoldCFException.
        Parameters:
        userAgent - is the user-agent string used by the robot.
        from - is the email address.
        protocol - is the name of the protocol (e.g. "http")
        port - is the port number (-1 being the default for the protocol)
        hostName - is the fqdn of the host
        pathString - is the path (non-query) part of the URL
        Returns:
        true if fetch is allowed, false otherwise.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
      • makeReadable

        protected static java.lang.String makeReadable​(java.lang.String inputString)
        Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
      • doesPathMatch

        protected static boolean doesPathMatch​(java.lang.String path,
                                               java.lang.String spec)
        Check if path matches specification
      • doesPathMatch

        protected static boolean doesPathMatch​(java.lang.String path,
                                               int pathIndex,
                                               java.lang.String spec,
                                               int specIndex)
        Recursive method for matching specification to path.