Package org.apache.nutch.tools
Class CommonCrawlFormatWARC
- java.lang.Object
-
- org.apache.nutch.tools.AbstractCommonCrawlFormat
-
- org.apache.nutch.tools.CommonCrawlFormatWARC
-
- All Implemented Interfaces:
Closeable,AutoCloseable,CommonCrawlFormat
public class CommonCrawlFormatWARC extends AbstractCommonCrawlFormat
-
-
Field Summary
Fields Modifier and Type Field Description static StringMAX_WARC_FILE_SIZEstatic StringTEMPLATE-
Fields inherited from class org.apache.nutch.tools.AbstractCommonCrawlFormat
conf, content, inLinks, jsonArray, keyPrefix, LOG, metadata, reverseKey, reverseKeyValue, simpleDateFormat, url
-
-
Constructor Summary
Constructors Constructor Description CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData)CommonCrawlFormatWARC(Configuration nutchConf, CommonCrawlConfig config)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()Optional method that could be implemented if the actual format needs some close procedure.protected voidcloseArray(String key, boolean nested, boolean newline)protected voidcloseObject(String key)protected StringgenerateJson()StringgetJsonData()Get a string representation of the JSON structure of the URL content.StringgetJsonData(String url, Content content, Metadata metadata, ParseData parseData)Returns a string representation of the JSON structure of the URL content.protected voidstartArray(String key, boolean nested, boolean newline)protected voidstartObject(String key)protected voidwriteArrayValue(String value)protected voidwriteKeyNull(String key)protected voidwriteKeyValue(String key, String value)protected URIwriteRequest(URI id)protected URIwriteResponse()-
Methods inherited from class org.apache.nutch.tools.AbstractCommonCrawlFormat
getImported, getInLinks, getJsonData, getKey, getMethod, getRequestAccept, getRequestAcceptEncoding, getRequestAcceptLanguage, getRequestContactEmail, getRequestContactName, getRequestHostAddress, getRequestHostName, getRequestRobots, getRequestSoftware, getRequestUserAgent, getResponseAddress, getResponseContent, getResponseContentEncoding, getResponseContentType, getResponseDate, getResponseHostName, getResponseServer, getResponseStatus, getTimestamp, getUrl, setInLinks
-
-
-
-
Field Detail
-
MAX_WARC_FILE_SIZE
public static final String MAX_WARC_FILE_SIZE
- See Also:
- Constant Field Values
-
TEMPLATE
public static final String TEMPLATE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CommonCrawlFormatWARC
public CommonCrawlFormatWARC(Configuration nutchConf, CommonCrawlConfig config) throws IOException
- Throws:
IOException
-
CommonCrawlFormatWARC
public CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData) throws IOException
- Throws:
IOException
-
-
Method Detail
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata, ParseData parseData) throws IOException
Description copied from interface:CommonCrawlFormatReturns a string representation of the JSON structure of the URL content. Takes into consideration theContent,MetadataandParseData.- Specified by:
getJsonDatain interfaceCommonCrawlFormat- Overrides:
getJsonDatain classAbstractCommonCrawlFormat- Parameters:
url- the canonical urlcontent- urlContentmetadata- urlMetadataparseData- urlParseData- Returns:
- the JSON URL content string
- Throws:
IOException- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData() throws IOException
Description copied from interface:CommonCrawlFormatGet a string representation of the JSON structure of the URL content.- Specified by:
getJsonDatain interfaceCommonCrawlFormat- Overrides:
getJsonDatain classAbstractCommonCrawlFormat- Returns:
- the JSON URL content string
- Throws:
IOException- if there is a fatal I/O error obtaining JSON data
-
writeResponse
protected URI writeResponse() throws IOException, ParseException
- Throws:
IOExceptionParseException
-
writeRequest
protected URI writeRequest(URI id) throws IOException, ParseException
- Throws:
IOExceptionParseException
-
generateJson
protected String generateJson() throws IOException
- Specified by:
generateJsonin classAbstractCommonCrawlFormat- Throws:
IOException
-
writeKeyValue
protected void writeKeyValue(String key, String value) throws IOException
- Specified by:
writeKeyValuein classAbstractCommonCrawlFormat- Throws:
IOException
-
writeKeyNull
protected void writeKeyNull(String key) throws IOException
- Specified by:
writeKeyNullin classAbstractCommonCrawlFormat- Throws:
IOException
-
startArray
protected void startArray(String key, boolean nested, boolean newline) throws IOException
- Specified by:
startArrayin classAbstractCommonCrawlFormat- Throws:
IOException
-
closeArray
protected void closeArray(String key, boolean nested, boolean newline) throws IOException
- Specified by:
closeArrayin classAbstractCommonCrawlFormat- Throws:
IOException
-
writeArrayValue
protected void writeArrayValue(String value) throws IOException
- Specified by:
writeArrayValuein classAbstractCommonCrawlFormat- Throws:
IOException
-
startObject
protected void startObject(String key) throws IOException
- Specified by:
startObjectin classAbstractCommonCrawlFormat- Throws:
IOException
-
closeObject
protected void closeObject(String key) throws IOException
- Specified by:
closeObjectin classAbstractCommonCrawlFormat- Throws:
IOException
-
close
public void close()
Description copied from interface:CommonCrawlFormatOptional method that could be implemented if the actual format needs some close procedure.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein interfaceCommonCrawlFormat- Overrides:
closein classAbstractCommonCrawlFormat
-
-