public class TikaExtractor extends BaseTransformationConnector
| Modifier and Type | Class and Description |
|---|---|
protected static interface |
TikaExtractor.DestinationStorage |
protected static class |
TikaExtractor.FileDestinationStorage |
protected static class |
TikaExtractor.MemoryDestinationStorage |
protected static class |
TikaExtractor.SpecPacker |
| Modifier and Type | Field and Description |
|---|---|
static String |
_rcsid |
protected static String[] |
activitiesList |
protected static String |
ACTIVITY_EXTRACT |
protected static long |
inMemoryMaximumFile
We handle up to 64K in memory; after that we go to disk.
|
currentContext, paramsDOCUMENTSTATUS_ACCEPTED, DOCUMENTSTATUS_REJECTED| Constructor and Description |
|---|
TikaExtractor() |
| Modifier and Type | Method and Description |
|---|---|
int |
addOrReplaceDocumentWithException(String documentURI,
VersionContext pipelineDescription,
RepositoryDocument document,
String authorityNameString,
IOutputAddActivity activities)
Add (or replace) a document in the output data store using the connector.
|
boolean |
checkDocumentIndexable(VersionContext pipelineDescription,
File localFile,
IOutputCheckActivity checkActivity)
Pre-determine whether a document (passed here as a File object) is acceptable or not.
|
boolean |
checkLengthIndexable(VersionContext pipelineDescription,
long length,
IOutputCheckActivity checkActivity)
Pre-determine whether a document's length is acceptable.
|
boolean |
checkMimeTypeIndexable(VersionContext pipelineDescription,
String mimeType,
IOutputCheckActivity checkActivity)
Detect if a mime type is acceptable or not.
|
protected static void |
fillInBoilerplateSpecificationMap(Map<String,Object> paramMap,
Specification os) |
protected static void |
fillInExceptionsSpecificationMap(Map<String,Object> paramMap,
Specification os) |
protected static void |
fillInFieldMappingSpecificationMap(Map<String,Object> paramMap,
Specification os) |
String[] |
getActivitiesList()
Return a list of activities that this connector generates.
|
String |
getFormCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form check javascript method to call.
|
String |
getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
Obtain the name of the form presave check javascript method to call.
|
VersionContext |
getPipelineDescription(Specification os)
Get an output version string, given an output specification.
|
protected static int |
handleIOException(IOException e) |
protected static int |
handleSaxException(SAXException e) |
protected static int |
handleTikaException(org.apache.tika.exception.TikaException e) |
void |
outputSpecificationBody(IHTTPOutput out,
Locale locale,
Specification os,
int connectionSequenceNumber,
int actualSequenceNumber,
String tabName)
Output the specification body section.
|
void |
outputSpecificationHeader(IHTTPOutput out,
Locale locale,
Specification os,
int connectionSequenceNumber,
List<String> tabsArray)
Output the specification header section.
|
String |
processSpecificationPost(IPostParameters variableContext,
Locale locale,
Specification os,
int connectionSequenceNumber)
Process a specification post.
|
void |
viewSpecification(IHTTPOutput out,
Locale locale,
Specification os,
int connectionSequenceNumber)
View specification.
|
checkDateIndexable, checkURLIndexable, requestInfocheck, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, isConnected, outputConfigurationBody, outputConfigurationBody, outputConfigurationHeader, outputConfigurationHeader, outputConfigurationHeader, pack, packFixedList, packList, packList, poll, processConfigurationPost, processConfigurationPost, setThreadContext, unpack, unpackFixedList, unpackList, viewConfiguration, viewConfigurationclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitcheck, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, isConnected, outputConfigurationBody, outputConfigurationHeader, poll, processConfigurationPost, setThreadContext, viewConfigurationpublic static final String _rcsid
protected static final String ACTIVITY_EXTRACT
protected static final String[] activitiesList
protected static final long inMemoryMaximumFile
public String[] getActivitiesList()
getActivitiesList in interface ITransformationConnectorgetActivitiesList in class BaseTransformationConnectorpublic VersionContext getPipelineDescription(Specification os) throws ManifoldCFException, ServiceInterruption
getPipelineDescription in interface IPipelineConnectorgetPipelineDescription in class BaseTransformationConnectoros - is the current output specification for the job that is doing the crawling.ManifoldCFExceptionServiceInterruptionpublic boolean checkMimeTypeIndexable(VersionContext pipelineDescription, String mimeType, IOutputCheckActivity checkActivity) throws ManifoldCFException, ServiceInterruption
checkMimeTypeIndexable in interface IPipelineConnectorcheckMimeTypeIndexable in class BaseTransformationConnectorpipelineDescription - is the document's pipeline version string, for this connection.mimeType - is the mime type of the document.checkActivity - is an object including the activities that can be performed by this method.ManifoldCFExceptionServiceInterruptionpublic boolean checkDocumentIndexable(VersionContext pipelineDescription, File localFile, IOutputCheckActivity checkActivity) throws ManifoldCFException, ServiceInterruption
checkDocumentIndexable in interface IPipelineConnectorcheckDocumentIndexable in class BaseTransformationConnectorpipelineDescription - is the document's pipeline version string, for this connection.localFile - is the local file to check.checkActivity - is an object including the activities that can be done by this method.ManifoldCFExceptionServiceInterruptionpublic boolean checkLengthIndexable(VersionContext pipelineDescription, long length, IOutputCheckActivity checkActivity) throws ManifoldCFException, ServiceInterruption
checkLengthIndexable in interface IPipelineConnectorcheckLengthIndexable in class BaseTransformationConnectorpipelineDescription - is the document's pipeline version string, for this connection.length - is the length of the document.checkActivity - is an object including the activities that can be done by this method.ManifoldCFExceptionServiceInterruptionpublic int addOrReplaceDocumentWithException(String documentURI, VersionContext pipelineDescription, RepositoryDocument document, String authorityNameString, IOutputAddActivity activities) throws ManifoldCFException, ServiceInterruption, IOException
addOrReplaceDocumentWithException in interface IPipelineConnectoraddOrReplaceDocumentWithException in class BaseTransformationConnectordocumentURI - is the URI of the document. The URI is presumed to be the unique identifier which the output data store will use to process
and serve the document. This URI is constructed by the repository connector which fetches the document, and is thus universal across all output connectors.outputDescription - is the description string that was constructed for this document by the getOutputDescription() method.document - is the document data to be processed (handed to the output data store).authorityNameString - is the name of the authority responsible for authorizing any access tokens passed in with the repository document. May be null.activities - is the handle to an object that the implementer of a pipeline connector may use to perform operations, such as logging processing activity,
or sending a modified document to the next stage in the pipeline.IOException - only if there's a stream error reading the document data.ManifoldCFExceptionServiceInterruptionpublic String getFormCheckJavascriptMethodName(int connectionSequenceNumber)
getFormCheckJavascriptMethodName in interface IPipelineConnectorgetFormCheckJavascriptMethodName in class BaseTransformationConnectorconnectionSequenceNumber - is the unique number of this connection within the job.public String getFormPresaveCheckJavascriptMethodName(int connectionSequenceNumber)
getFormPresaveCheckJavascriptMethodName in interface IPipelineConnectorgetFormPresaveCheckJavascriptMethodName in class BaseTransformationConnectorconnectionSequenceNumber - is the unique number of this connection within the job.public void outputSpecificationHeader(IHTTPOutput out, Locale locale, Specification os, int connectionSequenceNumber, List<String> tabsArray) throws ManifoldCFException, IOException
outputSpecificationHeader in interface IPipelineConnectoroutputSpecificationHeader in class BaseTransformationConnectorout - is the output to which any HTML should be sent.locale - is the preferred local of the output.os - is the current pipeline specification for this connection.connectionSequenceNumber - is the unique number of this connection within the job.tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.ManifoldCFExceptionIOExceptionpublic void outputSpecificationBody(IHTTPOutput out, Locale locale, Specification os, int connectionSequenceNumber, int actualSequenceNumber, String tabName) throws ManifoldCFException, IOException
outputSpecificationBody in interface IPipelineConnectoroutputSpecificationBody in class BaseTransformationConnectorout - is the output to which any HTML should be sent.locale - is the preferred local of the output.os - is the current pipeline specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.actualSequenceNumber - is the connection within the job that has currently been selected.tabName - is the current tab name.ManifoldCFExceptionIOExceptionpublic String processSpecificationPost(IPostParameters variableContext, Locale locale, Specification os, int connectionSequenceNumber) throws ManifoldCFException
processSpecificationPost in interface IPipelineConnectorprocessSpecificationPost in class BaseTransformationConnectorvariableContext - contains the post data, including binary file-upload information.locale - is the preferred local of the output.os - is the current pipeline specification for this job.connectionSequenceNumber - is the unique number of this connection within the job.ManifoldCFExceptionpublic void viewSpecification(IHTTPOutput out, Locale locale, Specification os, int connectionSequenceNumber) throws ManifoldCFException, IOException
viewSpecification in interface IPipelineConnectorviewSpecification in class BaseTransformationConnectorout - is the output to which any HTML should be sent.locale - is the preferred local of the output.connectionSequenceNumber - is the unique number of this connection within the job.os - is the current pipeline specification for this job.ManifoldCFExceptionIOExceptionprotected static void fillInFieldMappingSpecificationMap(Map<String,Object> paramMap, Specification os)
protected static void fillInExceptionsSpecificationMap(Map<String,Object> paramMap, Specification os)
protected static void fillInBoilerplateSpecificationMap(Map<String,Object> paramMap, Specification os)
protected static int handleTikaException(org.apache.tika.exception.TikaException e)
throws IOException,
ManifoldCFException,
ServiceInterruption
protected static int handleSaxException(SAXException e) throws IOException, ManifoldCFException, ServiceInterruption
protected static int handleIOException(IOException e) throws ManifoldCFException
ManifoldCFException