Package org.apache.nutch.crawl
Class Injector
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.Injector
-
- All Implemented Interfaces:
Configurable,Tool
public class Injector extends NutchTool implements Tool
Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb. Useful for bootstrapping a Nutch crawl. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.Note, that some metadata keys are reserved:
- nutch.score
- allows to set a custom score for a specific URL
- nutch.fetchInterval
- allows to set a custom fetch interval for a specific URL
- nutch.fetchInterval.fixed
- allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule
Example:
http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classInjector.InjectMapperInjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.static classInjector.InjectReducerCombine multiple new entries for a url.
-
Field Summary
Fields Modifier and Type Field Description static StringnutchFetchIntervalMDNamemetadata key reserved for setting a custom fetchInterval for a specific URLstatic StringnutchFixedFetchIntervalMDNamemetadata key reserved for setting a fixed custom fetchInterval for a specific URLstatic StringnutchScoreMDNamemetadata key reserved for setting a custom score for a specific URLstatic StringURL_FILTER_NORMALIZE_ALLproperty to pass value of command-line option -filterNormalizeAll to mapper-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description Injector()Injector(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidinject(Path crawlDb, Path urlDir)voidinject(Path crawlDb, Path urlDir, boolean overwrite, boolean update)voidinject(Path crawlDb, Path urlDir, boolean overwrite, boolean update, boolean normalize, boolean filter, boolean filterNormalizeAll)static voidmain(String[] args)intrun(String[] args)Map<String,Object>run(Map<String,Object> args, String crawlId)Used by the Nutch REST servicevoidusage()-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
URL_FILTER_NORMALIZE_ALL
public static final String URL_FILTER_NORMALIZE_ALL
property to pass value of command-line option -filterNormalizeAll to mapper- See Also:
- Constant Field Values
-
nutchScoreMDName
public static String nutchScoreMDName
metadata key reserved for setting a custom score for a specific URL
-
nutchFetchIntervalMDName
public static String nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URL
-
nutchFixedFetchIntervalMDName
public static String nutchFixedFetchIntervalMDName
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
-
-
Constructor Detail
-
Injector
public Injector()
-
Injector
public Injector(Configuration conf)
-
-
Method Detail
-
inject
public void inject(Path crawlDb, Path urlDir) throws IOException, ClassNotFoundException, InterruptedException
-
inject
public void inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update) throws IOException, ClassNotFoundException, InterruptedException
-
inject
public void inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update, boolean normalize, boolean filter, boolean filterNormalizeAll) throws IOException, ClassNotFoundException, InterruptedException
-
usage
public void usage()
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Used by the Nutch REST service- Specified by:
runin classNutchTool- Parameters:
args- aMapof arguments to be run with the toolcrawlId- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception- if there is an error during the tool execution
-
-