Package org.apache.nutch.urlfilter.api
Class RegexURLFilterBase
- java.lang.Object
-
- org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- All Implemented Interfaces:
Configurable,URLFilter,Pluggable
- Direct Known Subclasses:
AutomatonURLFilter,RegexURLFilter
public abstract class RegexURLFilterBase extends Object implements URLFilter
GenericURLFilterbased on regular expressions.The regular expressions rules are expressed in a file. The file of rules is determined for each implementation using the
getRulesReader(Configuration conf)method.The format of this file is made of many rules (one per line):
[+-]<regex>
where plus (+)means go ahead and index it and minus (-)means no.- Author:
- Jérôme Charron
-
-
Field Summary
Fields Modifier and Type Field Description protected booleanhasHostDomainRulesWhether there are host- or domain-specific rules.-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Modifier Constructor Description RegexURLFilterBase()Constructs a new empty RegexURLFilterBaseRegexURLFilterBase(File filename)Constructs a new RegexURLFilter and init it with a file of rules.protectedRegexURLFilterBase(Reader reader)Constructs a new RegexURLFilter and init it with a Reader of rules.RegexURLFilterBase(String rules)Constructs a new RegexURLFilter and inits it with a list of rules.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract RegexRulecreateRule(boolean sign, String regex)Creates a newRegexRule.protected abstract RegexRulecreateRule(boolean sign, String regex, String hostOrDomain)Creates a newRegexRule.Stringfilter(String url)Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfigurationgetConf()protected abstract ReadergetRulesReader(Configuration conf)Returns the name of the file of rules to use for a particular implementation.static voidmain(RegexURLFilterBase filter, String[] args)Filter the standard input using a RegexURLFilterBase.voidsetConf(Configuration conf)
-
-
-
Field Detail
-
hasHostDomainRules
protected boolean hasHostDomainRules
Whether there are host- or domain-specific rules. If there are no specific rules host and domain name are not extracted from the URL to speed up the matching.readRules(Reader)automatically sets this to true if host- or domain-specific rules are used in the rule file.
-
-
Constructor Detail
-
RegexURLFilterBase
public RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase
-
RegexURLFilterBase
public RegexURLFilterBase(File filename) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a file of rules.- Parameters:
filename- is the name of rules file.- Throws:
IOException- if there is a fatal I/O error interpreting the inputFileIllegalArgumentException- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
RegexURLFilterBase
public RegexURLFilterBase(String rules) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and inits it with a list of rules.- Parameters:
rules- string with a list of rules, one rule per line- Throws:
IOException- if there is a fatal I/O error interpreting the input rulesIllegalArgumentException- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
RegexURLFilterBase
protected RegexURLFilterBase(Reader reader) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a Reader of rules.- Parameters:
reader- is a reader of rules.- Throws:
IOException- if there is a fatal I/O error interpreting the inputReaderIllegalArgumentException- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
-
Method Detail
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex)
Creates a newRegexRule.- Parameters:
sign- of the regular expression. Atruevalue means that any URL matching this rule must be included, whereas afalsevalue means that any URL matching this rule must be excluded.regex- is the regular expression associated to this rule.- Returns:
RegexRule
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex, String hostOrDomain)
Creates a newRegexRule.- Parameters:
sign- of the regular expression. Atruevalue means that any URL matching this rule must be included, whereas afalsevalue means that any URL matching this rule must be excluded.regex- is the regular expression associated to this rule.hostOrDomain- the host or domain to which this regex belongs- Returns:
RegexRule
-
getRulesReader
protected abstract Reader getRulesReader(Configuration conf) throws IOException
Returns the name of the file of rules to use for a particular implementation.- Parameters:
conf- is the current configuration.- Returns:
- the name of the resource containing the rules to use.
- Throws:
IOException- if there is a fatal error obtaining theReader
-
filter
public String filter(String url)
Description copied from interface:URLFilterInterface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
main
public static void main(RegexURLFilterBase filter, String[] args) throws IOException, IllegalArgumentException
Filter the standard input using a RegexURLFilterBase.- Parameters:
filter- is the RegexURLFilterBase to use for filtering the standard input.args- some optional parameters (not used).- Throws:
IOException- if there is a fatal I/O error interpreting the input argumentsIllegalArgumentException- if there is a fatal error processing the input arguments
-
-