Package org.apache.nutch.tools
Class WARCUtils
- java.lang.Object
-
- org.apache.nutch.tools.WARCUtils
-
public class WARCUtils extends Object
-
-
Field Summary
Fields Modifier and Type Field Description static StringCOLONSPstatic StringCONFORMS_TOstatic StringCRLFstatic StringFORMATstatic org.archive.uid.UUIDGeneratorgeneratorstatic StringHOSTNAMEstatic StringHTTP_HEADER_FROMstatic StringHTTP_HEADER_USER_AGENTstatic StringIPstatic StringOPERATORprotected static PatternPROBLEMATIC_HEADERSstatic StringROBOTSstatic StringSOFTWAREprotected static StringX_HIDE_HEADER
-
Constructor Summary
Constructors Constructor Description WARCUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.archive.io.warc.WARCRecordInfodocToMetadata(NutchDocument doc)static StringfixHttpHeaders(String headers, int contentLength)Modify verbatim HTTP response headers: fix, remove or replace headersContent-Length,Content-EncodingandTransfer-Encodingwhich may confuse WARC readers.static StringgetAgentString(String name, String version, String description, String URL, String email)static StringgetHostname(Configuration conf)static StringgetIPAddress(Configuration conf)static org.archive.util.anvl.ANVLRecordgetWARCInfoContent(Configuration conf)static byte[]toByteArray(org.archive.format.http.HttpHeaders headers)
-
-
-
Field Detail
-
SOFTWARE
public static final String SOFTWARE
- See Also:
- Constant Field Values
-
HTTP_HEADER_FROM
public static final String HTTP_HEADER_FROM
- See Also:
- Constant Field Values
-
HTTP_HEADER_USER_AGENT
public static final String HTTP_HEADER_USER_AGENT
- See Also:
- Constant Field Values
-
HOSTNAME
public static final String HOSTNAME
- See Also:
- Constant Field Values
-
ROBOTS
public static final String ROBOTS
- See Also:
- Constant Field Values
-
OPERATOR
public static final String OPERATOR
- See Also:
- Constant Field Values
-
FORMAT
public static final String FORMAT
- See Also:
- Constant Field Values
-
CONFORMS_TO
public static final String CONFORMS_TO
- See Also:
- Constant Field Values
-
IP
public static final String IP
- See Also:
- Constant Field Values
-
generator
public static final org.archive.uid.UUIDGenerator generator
-
CRLF
public static final String CRLF
- See Also:
- Constant Field Values
-
COLONSP
public static final String COLONSP
- See Also:
- Constant Field Values
-
PROBLEMATIC_HEADERS
protected static final Pattern PROBLEMATIC_HEADERS
-
X_HIDE_HEADER
protected static final String X_HIDE_HEADER
- See Also:
- Constant Field Values
-
-
Method Detail
-
getWARCInfoContent
public static final org.archive.util.anvl.ANVLRecord getWARCInfoContent(Configuration conf)
-
getHostname
public static final String getHostname(Configuration conf) throws UnknownHostException
- Throws:
UnknownHostException
-
getIPAddress
public static final String getIPAddress(Configuration conf) throws UnknownHostException
- Throws:
UnknownHostException
-
toByteArray
public static final byte[] toByteArray(org.archive.format.http.HttpHeaders headers) throws IOException- Throws:
IOException
-
getAgentString
public static final String getAgentString(String name, String version, String description, String URL, String email)
-
docToMetadata
public static final org.archive.io.warc.WARCRecordInfo docToMetadata(NutchDocument doc) throws UnsupportedEncodingException
- Throws:
UnsupportedEncodingException
-
fixHttpHeaders
public static final String fixHttpHeaders(String headers, int contentLength)
Modify verbatim HTTP response headers: fix, remove or replace headersContent-Length,Content-EncodingandTransfer-Encodingwhich may confuse WARC readers. Ensure that returned header end with a single empty line (\r\n\r\n).- Parameters:
headers- HTTP 1.1 or 1.0 response header string, CR-LF-separated lines, first line is status linecontentLength- Effective uncompressed and unchunked length of content- Returns:
- safe HTTP response header
-
-