Package org.apache.nutch.util
Class MimeUtil
- java.lang.Object
-
- org.apache.nutch.util.MimeUtil
-
public final class MimeUtil extends Object
This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika. Any Mime handling code should be placed in this utility class, and hidden from the Nutch classes that rely on it.
-
-
Constructor Summary
Constructors Constructor Description MimeUtil(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description StringautoResolveContentType(String typeName, String url, byte[] data)A facade interface to trying all the possible mime type resolution strategies available within Tika.static StringcleanMimeType(String origType)Cleans aMimeTypename by removing out the actualMimeType, from a string of the form:StringforName(String name)A facade interface to Tika's underlyingMimeTypes.forName(String)method.StringgetMimeType(File f)Facade interface to Tika's underlyingMimeTypes.getMimeType(File)method.StringgetMimeType(String url)Facade interface to Tika's underlyingMimeTypes.getMimeType(String)method.static voidsetPoolSize(int poolSize)
-
-
-
Constructor Detail
-
MimeUtil
public MimeUtil(Configuration conf)
-
-
Method Detail
-
setPoolSize
public static void setPoolSize(int poolSize)
-
cleanMimeType
public static String cleanMimeType(String origType)
Cleans aMimeTypename by removing out the actualMimeType, from a string of the form:<primary type>/<sub type> ; < optional params- Parameters:
origType- The original mime type string to be cleaned.- Returns:
- The primary type, and subtype, concatenated, e.g., the actual mime type.
-
autoResolveContentType
public String autoResolveContentType(String typeName, String url, byte[] data)
A facade interface to trying all the possible mime type resolution strategies available within Tika. First, the mime type provided intypeNameis cleaned, withcleanMimeType(String). Then the cleaned mime type is looked up in the underlying TikaMimeTypesregistry, by its cleaned name. If theMimeTypeis found, then that mime type is used, otherwise URL resolution is used to try and determine the mime type. However, ifmime.type.magicis enabled inNutchConfiguration, then mime type magic resolution is used to try and obtain a better-than-the-default approximation of theMimeType.- Parameters:
typeName- The original mime type, returned from aProtocolOutput.url- The given @see url, that Nutch was trying to crawl.data- The byte data, returned from the crawl, if any.- Returns:
- The correctly, automatically guessed
MimeTypename.
-
getMimeType
public String getMimeType(String url)
Facade interface to Tika's underlyingMimeTypes.getMimeType(String)method.- Parameters:
url- A string representation of the document URL to sense theMimeTypefor.- Returns:
- An appropriate
MimeType, identified from the given Document url in string form.
-
forName
public String forName(String name)
A facade interface to Tika's underlyingMimeTypes.forName(String)method.- Parameters:
name- The name of a validMimeTypein the Tika mime registry.- Returns:
- The object representation of the
MimeType, if it exists, or null otherwise.
-
-