edu.umass.cs.mallet.base.extract
Class CRFExtractor

java.lang.Object
  extended byedu.umass.cs.mallet.base.extract.CRFExtractor
All Implemented Interfaces:
Extractor, java.io.Serializable

public class CRFExtractor
extends java.lang.Object
implements Extractor

Created: Oct 12, 2004

See Also:
Serialized Form

Constructor Summary
CRFExtractor(CRF4 crf)
           
CRFExtractor(CRF4 crf, Pipe tokpipe)
           
CRFExtractor(CRF4 crf, Pipe tokpipe, TokenizationFilter filter)
           
CRFExtractor(CRF4 crf, Pipe tokpipe, TokenizationFilter filter, java.lang.String backgroundTag)
           
CRFExtractor(java.io.File crfFile)
           
 
Method Summary
 Extraction extract(java.lang.Object o)
          Performs extraction given a raw object.
 Extraction extract(PipeInputIterator source)
          Performs extraction on a a set of raw documents.
 Extraction extract(Tokenization spans)
          Performs extraction from an object that has been already been tokenized.
 CRF4 getCrf()
           
 Pipe getFeaturePipe()
          Returns the pipe used by this extractor for.
 Alphabet getInputAlphabet()
          Returns an alphabet of the features used by the extractor.
 LabelAlphabet getTargetAlphabet()
          Returns an alphabet of the labels used by the extractor.
 Pipe getTokenizationPipe()
          Returns the pipe used by this extractor to tokenize the input.
 Sequence pipeInput(java.lang.Object input)
           
 InstanceList pipeInstances(PipeInputIterator source)
           
 void setFeaturePipe(Pipe featurePipe)
           
 void setTokenizationPipe(Pipe tokenizationPipe)
          Sets the pipe used by this extractor for tokenization.
 void slicePipes(int num)
          Transfer some Pipes from the feature pipe to the tokenization pipe.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CRFExtractor

public CRFExtractor(CRF4 crf)

CRFExtractor

public CRFExtractor(java.io.File crfFile)
             throws java.io.IOException

CRFExtractor

public CRFExtractor(CRF4 crf,
                    Pipe tokpipe)

CRFExtractor

public CRFExtractor(CRF4 crf,
                    Pipe tokpipe,
                    TokenizationFilter filter)

CRFExtractor

public CRFExtractor(CRF4 crf,
                    Pipe tokpipe,
                    TokenizationFilter filter,
                    java.lang.String backgroundTag)
Method Detail

extract

public Extraction extract(java.lang.Object o)
Description copied from interface: Extractor
Performs extraction given a raw object. The object will be passed through the Extractor's pipe.

Specified by:
extract in interface Extractor
Parameters:
o - The document to extract from (often a String).
Returns:
Extraction the results of performing extraction

extract

public Extraction extract(Tokenization spans)
Description copied from interface: Extractor
Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.

Specified by:
extract in interface Extractor
Parameters:
spans - A tokenized document
Returns:
Extraction the results of performing extraction

pipeInstances

public InstanceList pipeInstances(PipeInputIterator source)

extract

public Extraction extract(PipeInputIterator source)
Description copied from interface: Extractor
Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.

Specified by:
extract in interface Extractor
Parameters:
source - A source of raw documents
Returns:
Extraction the results of performing extraction

getTokenizationPipe

public Pipe getTokenizationPipe()
Description copied from interface: Extractor
Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.

Specified by:
getTokenizationPipe in interface Extractor
Returns:
a pipe

setTokenizationPipe

public void setTokenizationPipe(Pipe tokenizationPipe)
Description copied from interface: Extractor
Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.

The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.

Specified by:
setTokenizationPipe in interface Extractor

getFeaturePipe

public Pipe getFeaturePipe()
Description copied from interface: Extractor
Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.

Specified by:
getFeaturePipe in interface Extractor
Returns:
a pipe

setFeaturePipe

public void setFeaturePipe(Pipe featurePipe)

getInputAlphabet

public Alphabet getInputAlphabet()
Description copied from interface: Extractor
Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.

Specified by:
getInputAlphabet in interface Extractor
Returns:
the input alphabet

getTargetAlphabet

public LabelAlphabet getTargetAlphabet()
Description copied from interface: Extractor
Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).

Specified by:
getTargetAlphabet in interface Extractor
Returns:
the target alphabet

getCrf

public CRF4 getCrf()

slicePipes

public void slicePipes(int num)
Transfer some Pipes from the feature pipe to the tokenization pipe. The feature pipe must be a SerialPipes. This will destructively modify the CRF object of the extractor. This is useful if you have a CRF hat has been trained from a single pipe, which you need to split up int feature and tokenization pipes


pipeInput

public Sequence pipeInput(java.lang.Object input)