Class LuceneTokenizer
- java.lang.Object
-
- org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
public class LuceneTokenizer extends Object
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classLuceneTokenizer.TokenizerType
-
Constructor Summary
Constructors Constructor Description LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, boolean useStopFilter, LuceneAnalyzerUtil.StemFilterType stemFilterType)Creates a tokenizer based on param valuesLuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, List<String> stopWords, boolean addToDefault, LuceneAnalyzerUtil.StemFilterType stemFilterType)Creates a tokenizer based on param valuesLuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, LuceneAnalyzerUtil.StemFilterType stemFilterType, int mingram, int maxgram)Creates a tokenizer for the ngram model based on param values
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.lucene.analysis.TokenStreamgetTokenStream()get the tokenStream created byTokenizer
-
-
-
Constructor Detail
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, boolean useStopFilter, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param values- Parameters:
content- - The text to tokenizetokenizer- - the type of tokenizer to use CLASSIC or DEFAULTuseStopFilter- - if set to true the token stream will be filtered using default Lucene stopsetstemFilterType- a preferredLuceneAnalyzerUtil.StemFilterTypeto use. Can be one ofLuceneAnalyzerUtil.StemFilterType.PORTERSTEM_FILTER,LuceneAnalyzerUtil.StemFilterType.ENGLISHMINIMALSTEM_FILTER, orLuceneAnalyzerUtil.StemFilterType.NONE
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, List<String> stopWords, boolean addToDefault, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param values- Parameters:
content- - The text to tokenizetokenizer- - the type of tokenizer to use CLASSIC or DEFAULTstopWords- - Provide a set of user defined stop wordsaddToDefault- - If set to true, the stopSet words will be added to the Lucene default stop set. If false, then only the user provided words will be used as the stop setstemFilterType- a preferredLuceneAnalyzerUtil.StemFilterTypeto use. Can be one ofLuceneAnalyzerUtil.StemFilterType.PORTERSTEM_FILTER,LuceneAnalyzerUtil.StemFilterType.ENGLISHMINIMALSTEM_FILTER, orLuceneAnalyzerUtil.StemFilterType.NONE
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, LuceneAnalyzerUtil.StemFilterType stemFilterType, int mingram, int maxgram)
Creates a tokenizer for the ngram model based on param values- Parameters:
content- - The text to tokenizetokenizer- - the type of tokenizer to use CLASSIC or DEFAULTstemFilterType- - Type of stemming to performmingram- - Value of mingram for tokenizingmaxgram- - Value of maxgram for tokenizing
-
-