Package org.apache.nutch.crawl
Class MimeAdaptiveFetchSchedule
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.AbstractFetchSchedule
-
- org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
-
- All Implemented Interfaces:
Configurable,FetchSchedule
public class MimeAdaptiveFetchSchedule extends AdaptiveFetchSchedule
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types. This class can be typically used in cases where a recrawl consists of many different MIME-types. It's not very common for MIME-types other than text/html to change frequently. Using this class you can configure different factors per MIME-type so to prefer frequently changing MIME-types over others. For it to work this class relies on the Content-Type MetaData key being present in the CrawlDB. This can either be done when injecting new URL's or by adding "Content-Type" to the db.parsemeta.to.crawldb configuration setting to force MIME-types of newly discovered URL's to be added to the CrawlDB.- Author:
- markus
-
-
Field Summary
Fields Modifier and Type Field Description static StringSCHEDULE_DEC_RATEstatic StringSCHEDULE_INC_RATEstatic StringSCHEDULE_MIME_FILE-
Fields inherited from class org.apache.nutch.crawl.AdaptiveFetchSchedule
DEC_RATE, INC_RATE
-
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
-
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description MimeAdaptiveFetchSchedule()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static voidmain(String[] args)voidsetConf(Configuration conf)CrawlDatumsetFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)Sets thefetchIntervalandfetchTimeon a successfully fetched page.-
Methods inherited from class org.apache.nutch.crawl.AdaptiveFetchSchedule
getHostName, getMaxInterval, getMinInterval
-
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Field Detail
-
SCHEDULE_INC_RATE
public static final String SCHEDULE_INC_RATE
- See Also:
- Constant Field Values
-
SCHEDULE_DEC_RATE
public static final String SCHEDULE_DEC_RATE
- See Also:
- Constant Field Values
-
SCHEDULE_MIME_FILE
public static final String SCHEDULE_MIME_FILE
- See Also:
- Constant Field Values
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable- Overrides:
setConfin classAdaptiveFetchSchedule
-
setFetchSchedule
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Description copied from class:AbstractFetchScheduleSets thefetchIntervalandfetchTimeon a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Specified by:
setFetchSchedulein interfaceFetchSchedule- Overrides:
setFetchSchedulein classAdaptiveFetchSchedule- Parameters:
url- url of the pagedatum- page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.prevFetchTime- previous value of fetch time, or 0 if not available.prevModifiedTime- previous value of modifiedTime, or 0 if not available.fetchTime- the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime- last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.state- ifFetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before thefetchTime, ifFetchSchedule.STATUS_NOTMODIFIEDthen the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set toFetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
-