Package org.apache.nutch.crawl
Class CrawlDb
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.CrawlDb
-
- All Implemented Interfaces:
Configurable,Tool
public class CrawlDb extends NutchTool implements Tool
This class takes the output of the fetcher and updates the crawldb accordingly.
-
-
Field Summary
Fields Modifier and Type Field Description static StringCRAWLDB_ADDITIONS_ALLOWEDstatic StringCRAWLDB_PURGE_404static StringCRAWLDB_PURGE_ORPHANSstatic StringCURRENT_NAMEstatic StringLOCK_NAME-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description CrawlDb()CrawlDb(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static JobcreateJob(Configuration config, Path crawlDb)static voidinstall(Job job, Path crawlDb)static Pathlock(Configuration job, Path crawlDb, boolean force)static voidmain(String[] args)intrun(String[] args)Map<String,Object>run(Map<String,Object> args, String crawlId)Runs the tool, using a map of arguments.voidupdate(Path crawlDb, Path[] segments, boolean normalize, boolean filter)voidupdate(Path crawlDb, Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force)-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
CRAWLDB_ADDITIONS_ALLOWED
public static final String CRAWLDB_ADDITIONS_ALLOWED
- See Also:
- Constant Field Values
-
CRAWLDB_PURGE_404
public static final String CRAWLDB_PURGE_404
- See Also:
- Constant Field Values
-
CRAWLDB_PURGE_ORPHANS
public static final String CRAWLDB_PURGE_ORPHANS
- See Also:
- Constant Field Values
-
CURRENT_NAME
public static final String CURRENT_NAME
- See Also:
- Constant Field Values
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CrawlDb
public CrawlDb()
-
CrawlDb
public CrawlDb(Configuration conf)
-
-
Method Detail
-
update
public void update(Path crawlDb, Path[] segments, boolean normalize, boolean filter) throws IOException, InterruptedException, ClassNotFoundException
-
update
public void update(Path crawlDb, Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force) throws IOException, InterruptedException, ClassNotFoundException
-
createJob
public static Job createJob(Configuration config, Path crawlDb) throws IOException
- Throws:
IOException
-
lock
public static Path lock(Configuration job, Path crawlDb, boolean force) throws IOException
- Throws:
IOException
-
install
public static void install(Job job, Path crawlDb) throws IOException
- Throws:
IOException
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchToolRuns the tool, using a map of arguments. May return results, or null.- Specified by:
runin classNutchTool- Parameters:
args- aMapof arguments to be run with the toolcrawlId- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception- if there is an error during the tool execution
-
-