Package org.apache.nutch.urlfilter.fast
Class FastURLFilter
- java.lang.Object
-
- org.apache.nutch.urlfilter.fast.FastURLFilter
-
- All Implemented Interfaces:
Configurable,URLFilter,Pluggable
public class FastURLFilter extends Object implements URLFilter
Filters URLs based on a file of regular expressions using host/domains matching first. The default policy is to accept a URL if no matches are found. Rule Format:Host www.example.org DenyPath /path/to/be/excluded DenyPath /some/other/path/excluded # Deny everything from *.example.com and example.com Domain example.com DenyPath .* Domain example.org DenyPathQuery /resource/.*?action=exclude
Hostrules are evaluated beforeDomainrules. ForHostrules the entire host name of a URL must match while the domain names inDomainrules are considered as matches if the domain is a suffix of the host name (consisting of complete host name parts). Shorter domain suffixes are checked first, a single dot "." as "domain name" can be used to specify global rules applied to every URL. E.g., for "www.example.com" the rules given above are looked up in the following order:- check "www.example.com" whether host-based rules exist and whether one of them matches
- check "www.example.com" for domain-based rules
- check "example.com" for domain-based rules
- check "com" for domain-based rules
- check for global rules ("
Domain .")
file:/path/file.txtare checked for global rules only. URLs which fail to be parsed asURLare always rejected. For rules either the URL path (DenyPath) or path and query (DenyPathQuery) are checked whether the givenJava Regular expressionis found (seeMatcher.find()) in the URL path (and query). Rules are applied in the order of their definition. For better performance, regular expressions which are simpler/faster or match more URLs should be defined earlier. Comments in the rule file start with the#character and reach until the end of the line. The rules file is defined via the propertyurlfilter.fast.file, the default name isfast-urlfilter.txt. In addition, it can filter based on the length of the whole URL, its path element or its query element. Seeurlfilter.fast.url.*configurations.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classFastURLFilter.DenyAllRuleRule forDenyPath .*orDenyPath .?static classFastURLFilter.DenyPathQueryRulestatic classFastURLFilter.DenyPathRulestatic classFastURLFilter.Rule
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.LoggerLOGstatic StringURLFILTER_FAST_FILEstatic StringURLFILTER_FAST_MAX_LENGTHstatic StringURLFILTER_FAST_PATH_MAX_LENGTHstatic StringURLFILTER_FAST_QUERY_MAX_LENGTH-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description FastURLFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Stringfilter(String url)Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfigurationgetConf()voidreloadRules()voidsetConf(Configuration conf)
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
URLFILTER_FAST_FILE
public static final String URLFILTER_FAST_FILE
- See Also:
- Constant Field Values
-
URLFILTER_FAST_MAX_LENGTH
public static final String URLFILTER_FAST_MAX_LENGTH
- See Also:
- Constant Field Values
-
URLFILTER_FAST_PATH_MAX_LENGTH
public static final String URLFILTER_FAST_PATH_MAX_LENGTH
- See Also:
- Constant Field Values
-
URLFILTER_FAST_QUERY_MAX_LENGTH
public static final String URLFILTER_FAST_QUERY_MAX_LENGTH
- See Also:
- Constant Field Values
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
filter
public String filter(String url)
Description copied from interface:URLFilterInterface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
-
reloadRules
public void reloadRules() throws IOException- Throws:
IOException
-
-