public class WebcrawlerConnector extends BaseRepositoryConnector
| Modifier and Type | Class and Description |
|---|---|
protected static class |
WebcrawlerConnector.CanonicalizationPolicies
Class representing a list of canonicalization rules
|
protected static class |
WebcrawlerConnector.CanonicalizationPolicy
Class representing a URL regular expression match, for the purposes of determining canonicalization policy
|
protected class |
WebcrawlerConnector.DocumentURLFilter
This class describes the url filtering information (for crawling and indexing) obtained from a digested DocumentSpecification.
|
protected static class |
WebcrawlerConnector.EvaluatorToken
Evaluator token.
|
protected static class |
WebcrawlerConnector.EvaluatorTokenStream
Token stream.
|
protected class |
WebcrawlerConnector.FeedContextClass |
protected class |
WebcrawlerConnector.FeedItemContextClass |
protected static class |
WebcrawlerConnector.FetchStatus |
protected static class |
WebcrawlerConnector.MappingRule
Class representing a mapping rule
|
protected static class |
WebcrawlerConnector.MappingRules
Class that represents all mappings
|
protected static class |
WebcrawlerConnector.NameValue
Name/value class
|
protected class |
WebcrawlerConnector.OuterContextClass
This class handles the outermost XML context for the feed document.
|
protected class |
WebcrawlerConnector.ProcessActivityHTMLHandler
Class that describes HTML handling
|
protected class |
WebcrawlerConnector.ProcessActivityLinkHandler
This class is the handler for links that get added into a IProcessActivity object.
|
protected class |
WebcrawlerConnector.ProcessActivityRedirectionHandler
Class that describes redirection handling
|
protected class |
WebcrawlerConnector.ProcessActivityXMLHandler
Class that describes XML handling
|
protected class |
WebcrawlerConnector.RDFContextClass |
protected class |
WebcrawlerConnector.RDFItemContextClass |
protected class |
WebcrawlerConnector.RSSChannelContextClass |
protected class |
WebcrawlerConnector.RSSContextClass |
protected class |
WebcrawlerConnector.RSSItemContextClass |
protected class |
WebcrawlerConnector.UrlsetContextClass |
protected class |
WebcrawlerConnector.UrlsetItemContextClass |
| Modifier and Type | Field and Description |
|---|---|
static String |
_rcsid |
static String |
ACTIVITY_FETCH |
static String |
ACTIVITY_LOGON_END |
static String |
ACTIVITY_LOGON_START |
static String |
ACTIVITY_PROCESS |
static String |
ACTIVITY_ROBOTSPARSE |
protected static DataCache |
cache
This is where we keep data around between the getVersions() phase and the processDocuments() phase.
|
protected int |
connectionTimeoutMilliseconds
Connection timeout, milliseconds.
|
protected CookieManager |
cookieManager
The cookie manager used by this instance
|
protected CredentialsDescription |
credentialsDescription
The credentials description
|
protected DNSManager |
dnsManager
The DNS manager currently used by this instance
|
protected static String |
FETCH_LOGIN |
protected static String |
FETCH_ROBOTS |
protected static String |
FETCH_STANDARD |
protected String |
from
The email address for this connector instance
|
protected static String[] |
interestingMimeTypeArray
This represents a list of the mime types that this connector knows how to extract links from.
|
protected static Set<String> |
interestingMimeTypeMap |
protected boolean |
isInitialized
This flag is set when the instance has been initialized
|
protected static List<String> |
potentiallyExcludedHeaders |
protected String |
proxyAuthDomain
Proxy auth domain
|
protected String |
proxyAuthPassword
Proxy auth password
|
protected String |
proxyAuthUsername
Proxy auth user name
|
protected String |
proxyHost
Proxy host
|
protected int |
proxyPort
Proxy port
|
static String |
REL_LINK |
static String |
REL_REDIRECT |
protected static Set<String> |
reservedHeaders |
protected static int |
RESULT_NO_DOCUMENT |
protected static int |
RESULT_NO_VERSION |
protected static int |
RESULT_RETRY_DOCUMENT |
protected static int |
RESULT_VERSION_NEEDED |
protected static int |
RESULTSTATUS_FALSE |
protected static int |
RESULTSTATUS_NOTYETDETERMINED |
protected static int |
RESULTSTATUS_TRUE |
protected static int |
ROBOTS_ALL |
protected static int |
ROBOTS_DATA |
protected static int |
ROBOTS_NONE |
protected RobotsManager |
robotsManager
The robots manager currently used by this instance
|
protected int |
robotsUsage
Robots usage flag
|
protected static int |
SESSIONSTATE_LOGIN
We're in 'login mode'
|
protected static int |
SESSIONSTATE_NORMAL
Normal fetch of content document.
|
protected int |
socketTimeoutMilliseconds
Socket timeout, milliseconds
|
protected ThrottleDescription |
throttleDescription
The throttle description
|
protected String |
throttleGroupName
Throttle group name
|
protected TrustsDescription |
trustsDescription
The trusts description
|
protected static Set<String> |
understoodProtocols |
protected String |
userAgent
The user-agent for this connector instance
|
currentContext, paramsGLOBAL_DENY_TOKEN, JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_CHAINED_ADD, MODEL_CHAINED_ADD_CHANGE, MODEL_CHAINED_ADD_CHANGE_DELETE, MODEL_PARTIAL| Constructor and Description |
|---|
WebcrawlerConnector()
Constructor.
|
| Modifier and Type | Method and Description |
|---|---|
String |
addSeedDocuments(ISeedingActivity activities,
Specification spec,
String lastSeedVersion,
long seedTime,
int jobMode)
Queue "seed" documents.
|
protected String[] |
calculateDocumentEvents(INamingActivity activities,
String documentIdentifier)
Calculate events that should be associated with a document.
|
String |
check()
Check status of connection.
|
protected int |
checkFetchAllowed(String documentIdentifier,
String protocol,
String hostIPAddress,
int port,
PageCredentials credential,
IKeystoreManager trustStore,
String hostName,
String[] binNames,
long currentTime,
String pathString,
IVersionActivity versionActivities,
int connectionLimit,
String proxyHost,
int proxyPort,
String proxyAuthDomain,
String proxyAuthUsername,
String proxyAuthPassword)
Check robots to see if fetch is allowed.
|
void |
clearThreadContext()
Clear out any state information specific to a given thread.
|
protected static void |
compileList(List<Pattern> output,
List<String> input)
Compile all regexp entries in the passed in list, and add them to the output
list.
|
void |
deinstall(IThreadContext threadContext)
Uninstall the connector.
|
void |
disconnect()
Close the connection.
|
protected String |
doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter,
WebURL url)
Code to canonicalize a URL.
|
protected String |
documentIdentifiertoFileName(String documentIdentifier)
Convert a document identifier to filename.
|
protected static String |
extractContentType(String contentType) |
protected static String |
extractEncoding(String contentType) |
protected boolean |
extractLinks(String documentIdentifier,
IProcessActivity activities,
WebcrawlerConnector.DocumentURLFilter filter)
Code to extract links from an already-fetched document.
|
protected static String |
extractMimeType(String contentType) |
protected static Set<String> |
findExcludedHeaders(Specification spec)
Read a document specification to get a set of excluded headers
|
protected FormData |
findHTMLForm(String currentURI,
LoginParameters lp)
Find matching HTML form data, if present.
|
protected String |
findHTMLLinkURI(String currentURI,
LoginParameters lp)
Find HTML link URI, if present, making sure specified preference is matched.
|
protected static List<WebcrawlerConnector.NameValue> |
findMetadata(Specification spec)
Read a document specification to yield a map of name/value pairs for metadata
|
protected String |
findPreferredRedirectionURI(String currentURI,
LoginParameters lp)
Find a preferred redirection URI, if it exists
|
protected String |
findRedirectionURI(String currentURI)
Find a redirection URI, if it exists
|
protected String |
findSpecifiedContent(String currentURI,
LoginParameters lp)
Find existence of specific content on the page (never finds a URL)
|
protected static String[] |
getAcls(Specification spec)
Grab forced acl out of document specification.
|
String[] |
getActivitiesList()
Return the list of activities that this connector supports (i.e.
|
String[] |
getBinNames(String documentIdentifier)
Get the bin name string for a document identifier.
|
int |
getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers().
|
String |
getFormCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form check javascript method to call.
|
String |
getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form presave check javascript method to call.
|
int |
getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.
|
protected PageCredentials |
getPageCredential(String documentIdentifier)
Get the page credentials for a given document identifier (URL)
|
String[] |
getRelationshipTypes()
Return the list of relationship types that this connector recognizes.
|
protected SequenceCredentials |
getSequenceCredential(String documentIdentifier)
Get the sequence credentials for a given document identifier (URL)
|
protected void |
getSession()
Start a session
|
protected IKeystoreManager |
getTrustStore(String documentIdentifier)
Get the trust store for a given document identifier (URL)
|
protected void |
handleHTML(String documentURI,
IHTMLHandler handler)
Handle document references from HTML
|
protected static void |
handleIOException(IOException e,
String context) |
protected void |
handleRedirects(String documentURI,
IRedirectionHandler handler)
Handle extracting the redirect link from a redirect response.
|
protected void |
handleXML(String documentURI,
IXMLHandler handler)
Handle document references from XML.
|
void |
install(IThreadContext threadContext)
Install the connector.
|
protected boolean |
isContentInteresting(IFingerprintActivity activities,
String documentIdentifier,
int response,
String contentType)
Code to check if data is interesting, based on response code and content type.
|
protected boolean |
isDocumentText(String documentURI)
Is the document text, as far as we can tell?
|
protected static boolean |
isStrange(byte x)
Check if character is not typical ASCII or utf-8.
|
protected static boolean |
isText(byte[] beginChunk,
int chunkLength)
Test to see if a document is text or not.
|
protected static boolean |
isWhiteSpace(byte x)
Check if a byte is a whitespace character.
|
protected void |
loginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus,
IProcessActivity activities,
String documentIdentifier,
SequenceCredentials sessionCredential,
String globalSequenceEvent) |
protected int |
lookupIPAddress(String documentIdentifier,
IVersionActivity activities,
String hostName,
long currentTime,
StringBuilder ipAddressBuffer)
Look up an ipaddress given a non-canonical host name.
|
protected String |
makeDNSEventName(INamingActivity activities,
String hostNameKey)
Calculate the event name for DNS access.
|
protected String |
makeDocumentIdentifier(String parentIdentifier,
String rawURL,
WebcrawlerConnector.DocumentURLFilter filter)
Convert an absolute or relative URL to a document identifier.
|
protected String |
makeRobotsEventName(INamingActivity versionActivities,
String robotsKey)
Construct a name for the global web-connector robots event.
|
protected static String |
makeRobotsKey(String protocol,
String hostName,
int port)
Construct the robots key for a host.
|
protected String |
makeSessionLoginEventName(INamingActivity activities,
String sequenceKey)
Calculate the event name for session login.
|
void |
outputConfigurationBody(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
String tabName)
Output the configuration body section.
|
void |
outputConfigurationHeader(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters,
List<String> tabsArray)
Output the configuration header section.
|
void |
outputSpecificationBody(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
int actualSequenceNumber,
String tabName)
Output the specification body section.
|
void |
outputSpecificationHeader(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber,
List<String> tabsArray)
Output the specification header section.
|
void |
poll()
This method is periodically called for all connectors that are connected but not
in active use.
|
String |
processConfigurationPost(IThreadContext threadContext,
IPostParameters variableContext,
Locale locale,
ConfigParams parameters)
Process a configuration post.
|
protected void |
processDocument(IProcessActivity activities,
String documentIdentifier,
String versionString,
boolean indexDocument,
Map<String,Set<String>> metaHash,
Map<String,Set<String>> metaHash2,
String[] acls,
WebcrawlerConnector.DocumentURLFilter filter) |
void |
processDocuments(String[] documentIdentifiers,
IExistingVersions statuses,
Specification spec,
IProcessActivity activities,
int jobMode,
boolean usesDefaultAuthority)
Process a set of documents.
|
String |
processSpecificationPost(IPostParameters variableContext,
Locale locale,
Specification ds,
int connectionSequenceNumber)
Process a specification post.
|
protected static List<String> |
stringToArray(String input)
Read a string as a sequence of individual expressions, urls, etc.
|
void |
viewConfiguration(IThreadContext threadContext,
IHTTPOutput out,
Locale locale,
ConfigParams parameters)
View configuration.
|
void |
viewSpecification(IHTTPOutput out,
Locale locale,
Specification ds,
int connectionSequenceNumber)
View specification.
|
addSeedDocuments, addSeedDocuments, addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRemainingDocumentIdentifiers, outputSpecificationBody, outputSpecificationBody, outputSpecificationHeader, outputSpecificationHeader, outputSpecificationHeader, processDocuments, processDocuments, processDocuments, processDocuments, processSpecificationPost, processSpecificationPost, releaseDocumentVersions, releaseDocumentVersions, requestInfo, viewSpecification, viewSpecificationconnect, getConfiguration, isConnected, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfigurationclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitconnect, getConfiguration, isConnected, setThreadContextpublic static final String _rcsid
protected static final int RESULTSTATUS_FALSE
protected static final int RESULTSTATUS_TRUE
protected static final int RESULTSTATUS_NOTYETDETERMINED
protected static final String[] interestingMimeTypeArray
protected static final int ROBOTS_NONE
protected static final int ROBOTS_DATA
protected static final int ROBOTS_ALL
public static final String REL_LINK
public static final String REL_REDIRECT
public static final String ACTIVITY_FETCH
public static final String ACTIVITY_PROCESS
public static final String ACTIVITY_ROBOTSPARSE
public static final String ACTIVITY_LOGON_START
public static final String ACTIVITY_LOGON_END
protected static final String FETCH_ROBOTS
protected static final String FETCH_STANDARD
protected static final String FETCH_LOGIN
protected int robotsUsage
protected String userAgent
protected String from
protected int connectionTimeoutMilliseconds
protected int socketTimeoutMilliseconds
protected String throttleGroupName
protected ThrottleDescription throttleDescription
protected CredentialsDescription credentialsDescription
protected TrustsDescription trustsDescription
protected RobotsManager robotsManager
protected DNSManager dnsManager
protected CookieManager cookieManager
protected boolean isInitialized
protected static DataCache cache
protected String proxyHost
protected int proxyPort
protected String proxyAuthDomain
protected String proxyAuthUsername
protected String proxyAuthPassword
protected static final int SESSIONSTATE_NORMAL
protected static final int SESSIONSTATE_LOGIN
protected static final int RESULT_NO_DOCUMENT
protected static final int RESULT_NO_VERSION
protected static final int RESULT_VERSION_NEEDED
protected static final int RESULT_RETRY_DOCUMENT
public int getConnectorModel()
getConnectorModel in interface IRepositoryConnectorgetConnectorModel in class BaseRepositoryConnectorpublic void install(IThreadContext threadContext) throws ManifoldCFException
install in interface IConnectorinstall in class BaseConnectorthreadContext - is the current thread context.ManifoldCFExceptionpublic void deinstall(IThreadContext threadContext) throws ManifoldCFException
deinstall in interface IConnectordeinstall in class BaseConnectorthreadContext - is the current thread context.ManifoldCFExceptionpublic String[] getActivitiesList()
getActivitiesList in interface IRepositoryConnectorgetActivitiesList in class BaseRepositoryConnectorpublic String[] getRelationshipTypes()
getRelationshipTypes in interface IRepositoryConnectorgetRelationshipTypes in class BaseRepositoryConnectorpublic void clearThreadContext()
clearThreadContext in interface IConnectorclearThreadContext in class BaseConnectorprotected void getSession()
throws ManifoldCFException
ManifoldCFExceptionpublic void poll()
throws ManifoldCFException
poll in interface IConnectorpoll in class BaseConnectorManifoldCFExceptionpublic String check() throws ManifoldCFException
check in interface IConnectorcheck in class BaseConnectorManifoldCFExceptionpublic void disconnect()
throws ManifoldCFException
disconnect in interface IConnectordisconnect in class BaseConnectorManifoldCFExceptionpublic String[] getBinNames(String documentIdentifier)
getBinNames in interface IRepositoryConnectorgetBinNames in class BaseRepositoryConnectordocumentIdentifier - is the document identifier.public String addSeedDocuments(ISeedingActivity activities, Specification spec, String lastSeedVersion, long seedTime, int jobMode) throws ManifoldCFException, ServiceInterruption
addSeedDocuments in interface IRepositoryConnectoraddSeedDocuments in class BaseRepositoryConnectoractivities - is the interface this method should use to perform whatever framework actions are desired.spec - is a document specification (that comes from the job).seedTime - is the end of the time range of documents to consider, exclusive.lastSeedVersion - is the last seeding version string for this job, or null if the job has no previous seeding version string.jobMode - is an integer describing how the job is being run, whether continuous or once-only.ManifoldCFExceptionServiceInterruptionpublic void processDocuments(String[] documentIdentifiers, IExistingVersions statuses, Specification spec, IProcessActivity activities, int jobMode, boolean usesDefaultAuthority) throws ManifoldCFException, ServiceInterruption
processDocuments in interface IRepositoryConnectorprocessDocuments in class BaseRepositoryConnectordocumentIdentifiers - is the set of document identifiers to process.statuses - are the currently-stored document versions for each document in the set of document identifiers
passed in above.activities - is the interface this method should use to queue up new document references
and ingest documents.jobMode - is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.ManifoldCFExceptionServiceInterruptionprotected void loginAndFetch(WebcrawlerConnector.FetchStatus fetchStatus, IProcessActivity activities, String documentIdentifier, SequenceCredentials sessionCredential, String globalSequenceEvent) throws ManifoldCFException, ServiceInterruption
protected void processDocument(IProcessActivity activities, String documentIdentifier, String versionString, boolean indexDocument, Map<String,Set<String>> metaHash, Map<String,Set<String>> metaHash2, String[] acls, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException, ServiceInterruption
protected static void handleIOException(IOException e, String context) throws ManifoldCFException, ServiceInterruption
public int getMaxDocumentRequest()
getMaxDocumentRequest in interface IRepositoryConnectorgetMaxDocumentRequest in class BaseRepositoryConnectorpublic void outputConfigurationHeader(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, List<String> tabsArray) throws ManifoldCFException, IOException
outputConfigurationHeader in interface IConnectoroutputConfigurationHeader in class BaseConnectorthreadContext - is the local thread context.out - is the output to which any HTML should be sent.parameters - are the configuration parameters, as they currently exist, for this connection being configured.tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFExceptionIOExceptionpublic void outputConfigurationBody(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters, String tabName) throws ManifoldCFException, IOException
outputConfigurationBody in interface IConnectoroutputConfigurationBody in class BaseConnectorthreadContext - is the local thread context.out - is the output to which any HTML should be sent.parameters - are the configuration parameters, as they currently exist, for this connection being configured.tabName - is the current tab name.ManifoldCFExceptionIOExceptionpublic String processConfigurationPost(IThreadContext threadContext, IPostParameters variableContext, Locale locale, ConfigParams parameters) throws ManifoldCFException
processConfigurationPost in interface IConnectorprocessConfigurationPost in class BaseConnectorthreadContext - is the local thread context.variableContext - is the set of variables available from the post, including binary file post information.parameters - are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFExceptionpublic void viewConfiguration(IThreadContext threadContext, IHTTPOutput out, Locale locale, ConfigParams parameters) throws ManifoldCFException, IOException
viewConfiguration in interface IConnectorviewConfiguration in class BaseConnectorthreadContext - is the local thread context.out - is the output to which any HTML should be sent.parameters - are the configuration parameters, as they currently exist, for this connection being configured.ManifoldCFExceptionIOExceptionpublic String getFormCheckJavascriptMethodName(int connectionSequenceNumber)
getFormCheckJavascriptMethodName in interface IRepositoryConnectorgetFormCheckJavascriptMethodName in class BaseRepositoryConnectorconnectionSequenceNumber - is the unique number of this connection within the job.public String getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
getFormPresaveCheckJavascriptMethodName in interface IRepositoryConnectorgetFormPresaveCheckJavascriptMethodName in class BaseRepositoryConnectorconnectionSequenceNumber - is the unique number of this connection within the job.public void outputSpecificationHeader(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, List<String> tabsArray) throws ManifoldCFException, IOException
outputSpecificationHeader in interface IRepositoryConnectoroutputSpecificationHeader in class BaseRepositoryConnectorout - is the output to which any HTML should be sent.locale - is the locale the output is preferred to be in.ds - is the current document specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFExceptionIOExceptionpublic void outputSpecificationBody(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber, int actualSequenceNumber, String tabName) throws ManifoldCFException, IOException
outputSpecificationBody in interface IRepositoryConnectoroutputSpecificationBody in class BaseRepositoryConnectorout - is the output to which any HTML should be sent.locale - is the locale the output is preferred to be in.ds - is the current document specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.actualSequenceNumber - is the connection within the job that has currently been selected.tabName - is the current tab name. (actualSequenceNumber, tabName) form a unique tuple within
the job.ManifoldCFExceptionIOExceptionpublic String processSpecificationPost(IPostParameters variableContext, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException
processSpecificationPost in interface IRepositoryConnectorprocessSpecificationPost in class BaseRepositoryConnectorvariableContext - contains the post data, including binary file-upload information.locale - is the locale the output is preferred to be in.ds - is the current document specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.ManifoldCFExceptionpublic void viewSpecification(IHTTPOutput out, Locale locale, Specification ds, int connectionSequenceNumber) throws ManifoldCFException, IOException
viewSpecification in interface IRepositoryConnectorviewSpecification in class BaseRepositoryConnectorout - is the output to which any HTML should be sent.locale - is the locale the output is preferred to be in.ds - is the current document specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.ManifoldCFExceptionIOExceptionprotected String makeSessionLoginEventName(INamingActivity activities, String sequenceKey)
protected String makeDNSEventName(INamingActivity activities, String hostNameKey)
protected int lookupIPAddress(String documentIdentifier, IVersionActivity activities, String hostName, long currentTime, StringBuilder ipAddressBuffer) throws ManifoldCFException, ServiceInterruption
ManifoldCFExceptionServiceInterruptionprotected static String makeRobotsKey(String protocol, String hostName, int port)
protected String makeRobotsEventName(INamingActivity versionActivities, String robotsKey)
protected int checkFetchAllowed(String documentIdentifier, String protocol, String hostIPAddress, int port, PageCredentials credential, IKeystoreManager trustStore, String hostName, String[] binNames, long currentTime, String pathString, IVersionActivity versionActivities, int connectionLimit, String proxyHost, int proxyPort, String proxyAuthDomain, String proxyAuthUsername, String proxyAuthPassword) throws ManifoldCFException, ServiceInterruption
ManifoldCFExceptionServiceInterruptionprotected String makeDocumentIdentifier(String parentIdentifier, String rawURL, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException
parentIdentifier - the identifier of the document in which the raw url was found, or null if none.rawURL - the starting, un-normalized, un-canonicalized URL.filter - the filter object, used to remove unmatching URLs.ManifoldCFExceptionprotected String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter, WebURL url) throws ManifoldCFException, URISyntaxException
protected boolean isContentInteresting(IFingerprintActivity activities, String documentIdentifier, int response, String contentType) throws ServiceInterruption, ManifoldCFException
protected String documentIdentifiertoFileName(String documentIdentifier) throws URISyntaxException
documentIdentifier - URISyntaxExceptionprotected String findRedirectionURI(String currentURI) throws ManifoldCFException
ManifoldCFExceptionprotected FormData findHTMLForm(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFExceptionprotected String findPreferredRedirectionURI(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFExceptionprotected String findSpecifiedContent(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFExceptionprotected String findHTMLLinkURI(String currentURI, LoginParameters lp) throws ManifoldCFException
ManifoldCFExceptionprotected boolean extractLinks(String documentIdentifier, IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter) throws ManifoldCFException, ServiceInterruption
protected void handleRedirects(String documentURI, IRedirectionHandler handler) throws ManifoldCFException
ManifoldCFExceptionprotected void handleXML(String documentURI, IXMLHandler handler) throws ManifoldCFException, ServiceInterruption
protected void handleHTML(String documentURI, IHTMLHandler handler) throws ManifoldCFException
ManifoldCFExceptionprotected boolean isDocumentText(String documentURI) throws ManifoldCFException
ManifoldCFExceptionprotected static boolean isText(byte[] beginChunk,
int chunkLength)
protected static boolean isStrange(byte x)
protected static boolean isWhiteSpace(byte x)
protected static List<String> stringToArray(String input)
protected static void compileList(List<Pattern> output, List<String> input) throws ManifoldCFException
ManifoldCFExceptionprotected PageCredentials getPageCredential(String documentIdentifier)
protected SequenceCredentials getSequenceCredential(String documentIdentifier)
protected IKeystoreManager getTrustStore(String documentIdentifier) throws ManifoldCFException
ManifoldCFExceptionprotected static String[] getAcls(Specification spec)
spec - is the document specification.protected static List<WebcrawlerConnector.NameValue> findMetadata(Specification spec) throws ManifoldCFException
ManifoldCFExceptionprotected static Set<String> findExcludedHeaders(Specification spec) throws ManifoldCFException
ManifoldCFExceptionprotected String[] calculateDocumentEvents(INamingActivity activities, String documentIdentifier)