edu.umass.cs.mallet.base.pipe
Class Pipe

java.lang.Object
  extended byedu.umass.cs.mallet.base.pipe.Pipe
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
AceTypeFeature, AcronymOf, AddClassifierTokenPredictions, AffixOfMentionPair, AllLinks, Array2FeatureVector, AugmentableFeatureVectorAddConjunctions, AugmentableFeatureVectorLogScale, AuthorLastNameEqual, AuthorPipe, AverageLink, BooktitlePipe, CharSequence2CharNGrams, CharSequence2TokenSequence, CharSequenceArray2TokenSequence, CharSequenceReplace, CharSubsequence, Classification2ConfidencePredictingFeatureVector, ClosestSingleLink, ClusterHomogeneity, ClusterSize, ConllNer2003Sentence2TokenSequence, ConllNer2003Sentence2TokenSequence, CountMatches, CountMatchesAlignedWithOffsets, CountMatchesMatching, Csv2Array, Csv2FeatureVector, DatePipe, Directory2FileIterator, EnronMessage2TokenSequence, ExactFieldMatchPipe, FarthestSingleLink, FeatureSequence2AugmentableFeatureVector, FeatureSequence2FeatureVector, FeaturesInWindow, FeaturesOfFirstMention, FeatureValueString2FeatureVector, FeatureVectorConjunctions, FeatureWindow, FieldStringDistancePipe, Filename2CharSequence, ForAll, FuchunPipe, GenderMentionPair, GlobalPipe, GlobalPipeUnSeg, HeuristicPipe, HobbsDistanceMentionPair, Input2CharSequence, InstanceListTrimFeaturesByCount, InterFieldPipe, IteratingPipe, JournalPipe, LengthBins, LexiconMembership, LinearDistanceMentionPair, LineGroupString2TokenSequence, ListMember, LongRegexMatches, MakeAmpersandXMLFriendly, MentionPair2FeatureVector, MentionPair2FeatureVectorFilter, MentionPairAntecedentPosition, MentionPairHeadIdentical, MentionPairIdentical, MentionPairNPDistance, MentionPairSentenceDistance, MentionPairSubstring, ModifierWordFeatures, NNegativeNodes, NodeClusterPair2FeatureVector, NodePair2FeatureVector, NodePairSaveSource, Noop, NormalizationPipe, NullAntecedentFeatureExtractor, OffsetConjunctions, OffsetFeatureConjunction, OffsetPropertyConjunctions, PageMatchPipe, PagesPipe, PaperClusterPrediction, ParallelPipes, PartOfSpeechMentionPair, PlainFieldPipe, POSFeaturesPipe, PrintInput, PrintInputAndTarget, PrintTokenSequenceFeatures, PublisherPipe, RegexMatches, RegexPipe, SaveDataInSource, SelectiveSGML2TokenSequence, SequencePrintingPipe, SerialPipes, SGML2FieldsPipe, SGML2TokenSequence, SGMLStringDistances, SimpleTagger.SimpleTaggerSentence2FeatureVectorSequence, SimpleTaggerSentence2TokenSequence, SourceLocation2TokenSequence, SplitFieldStringDistancePipe, StringAddNewLineDelimiter, StringDistances, Target2BIOFormat, Target2FeatureSequence, Target2Label, Target2LabelSequence, TargetRememberLastLabel, TechPipe, TestCRF.TestCRF2String, TestCRF.TestCRFTokenSequenceRemoveSpaces, TestCRF2.TestCRF2String, TestCRF2.TestCRFTokenSequenceRemoveSpaces, TestCRF3.TestCRF2String, TestCRF3.TestCRFTokenSequenceRemoveSpaces, TestCRF4.TestCRF2String, TestCRF4.TestCRFTokenSequenceRemoveSpaces, TestInstancePipe.Array2ArrayIterator, TestMEMM.TestMEMM2String, TestMEMM.TestMEMMTokenSequenceRemoveSpaces, TestSGML2TokenSequence.Array2ArrayIterator, ThereExists, ThereExistsMatch, TitlePipe, Token2FeatureVector, TokenFeaturesMentionPair, TokenSequence2FeatureSequence, TokenSequence2FeatureSequenceWithBigrams, TokenSequence2FeatureVectorSequence, TokenSequence2TokenIterator, TokenSequence2Tokenization, TokenSequenceDocHeader, TokenSequenceLowercase, TokenSequenceMatchDataAndTarget, TokenSequenceNGrams, TokenSequenceParseFeatureString, TokenSequenceRemoveNonAlpha, TokenSequenceRemoveStopwords, TokenText, TokenTextCharNGrams, TokenTextCharPrefix, TokenTextCharSuffix, TokenTextNGrams, TrieLexiconMembership, TUI_CorefIE.BogusClusterPipe, TUI_CorefIE.NegativeClusterFeaturePipe, TUI_CorefIE.NumAppearancesInClusterPipe, TUI_CorefIE.WordAppearsInAnyClusterPipe, TUI_CorefIE.WordOftenAppearsAsPipe, VenueAcronymPipe, VenueClusterPrediction, VenuePaperCluster2FeatureVector, VenuePipe, VolumesMatchPipe, YearPipe, YearsWithinFivePipe

public abstract class Pipe
extends java.lang.Object
implements java.io.Serializable

The abstract superclass of all Pipes, which transform one data type to another. Pipes are most often used for feature extraction.

A pipe operates on an Instance, which is a carrier of data. A pipe reads from and writes to fields in the Instance when it is requested to process the instance. It is up to the pipe which fields in the Instance it reads from and writes to, but usually a pipe will read its input from and write its output to the "data" field of an instance.

A pipe doesn't have any direct notion of input or output - it merely modifies instances that are handed to it. A set of helper classes, subclasses of AbstractPipeInputIterator, iterate over commonly encountered input data structures and feed the elements of these data structures to a pipe as instances.

A pipe is frequently used in conjunction with an InstanceList As instances are added to the list, they are processed by the pipe associated with the instance list and the processed Instance is kept in the list.

In one common usage, a FileIterator is given a list of directories to operate over. The FileIterator walks through each directory, creating an instance for each file and putting the data from the file in the data field of the instance. The directory of the file is stored in the target field of the instance. The FileIterator feeds instances to an InstanceList, which processes the instances through its associated pipe and keeps the results.

Pipes can be hierachically composed. In a typical usage, a SerialPipe is created which holds instances of other pipes in an ordered list. Piping in instance through a SerialPipe means piping the instance through the child pipes in sequence.

A pipe holds onto two separate Alphabets: one for the symbols (feature names) encountered in the data fields of the instances processed through the pipe, and one for the symbols encountered in the target fields.

See Also:
Serialized Form

Constructor Summary
Pipe()
          Construct a pipe with no data and target dictionaries
Pipe(Alphabet dataDict, Alphabet targetDict)
          Construct pipe with data and target dictionaries.
Pipe(java.lang.Class dataDictClass, java.lang.Class targetDictClass)
          Construct pipe with type specifications for dictionaries.
 
Method Summary
 Alphabet getDataAlphabet()
           
 java.rmi.dgc.VMID getInstanceId()
           
 Pipe getParent()
           
 Pipe getParentRoot()
           
 Alphabet getTargetAlphabet()
           
 boolean isDataAlphabetSet()
           
 boolean isTargetProcessing()
          Return true iff this pipe expects and processes information in the target slot.
abstract  Instance pipe(Instance carrier)
          Process an Instance.
 Instance pipe(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, Instance parent, PropertyList properties)
          Create and process an Instance.
 java.lang.Object readResolve()
          This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version.
protected  Alphabet resolveDataAlphabet()
           
protected  Alphabet resolveTargetAlphabet()
           
 void setDataAlphabet(Alphabet dDict)
           
 void setParent(Pipe p)
           
 void setTargetAlphabet(Alphabet tDict)
           
 void setTargetProcessing(boolean lookForAndProcessTarget)
          Set whether input is taken from target field of instance during processing.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Pipe

public Pipe(java.lang.Class dataDictClass,
            java.lang.Class targetDictClass)
Construct pipe with type specifications for dictionaries. Pass non-null as class if you want the given dictionary created as an instance of the class

Parameters:
dataDictClass - Class that will be used to create a data dictionary.
targetDictClass - Class that will be used to create a data dictionary. WHEN WHEN WHEN WHEN

Pipe

public Pipe()
Construct a pipe with no data and target dictionaries


Pipe

public Pipe(Alphabet dataDict,
            Alphabet targetDict)
Construct pipe with data and target dictionaries. Note that, since the default values of the dataDictClass and targetDictClass are null, that if you specify null for one of the arguments here, this pipe step will not ever create any corresponding dictionary for the argument.

Parameters:
dataDict - Alphabet that will be used as the data dictionary.
targetDict - Alphabet that will be used as the target dictionary. WHEN WHEN WHEN WHEN
Method Detail

pipe

public abstract Instance pipe(Instance carrier)
Process an Instance. This method takes an input Instance, destructively modifies it in some way, and returns it. This is the method by which all pipes are eventually run.

One can create a new concrete subclass of Pipe simply by implementing this method.

Parameters:
carrier - Instance to be processed.

pipe

public Instance pipe(java.lang.Object data,
                     java.lang.Object target,
                     java.lang.Object name,
                     java.lang.Object source,
                     Instance parent,
                     PropertyList properties)
Create and process an Instance. An instance is created from the given arguments and then the pipe is run on the instance.

Parameters:
data - Object used to initialize data field of new instance.
target - Object used to initialize target field of new instance.
name - Object used to initialize name field of new instance.
source - Object used to initialize source field of new instance.
parent - Unused
properties - Unused

setTargetProcessing

public void setTargetProcessing(boolean lookForAndProcessTarget)
Set whether input is taken from target field of instance during processing. If argument is false, don't expect to find input material for the target. By default, this is true.


isTargetProcessing

public boolean isTargetProcessing()
Return true iff this pipe expects and processes information in the target slot.


setParent

public void setParent(Pipe p)

getParent

public Pipe getParent()

getParentRoot

public Pipe getParentRoot()

resolveDataAlphabet

protected Alphabet resolveDataAlphabet()

resolveTargetAlphabet

protected Alphabet resolveTargetAlphabet()

getDataAlphabet

public Alphabet getDataAlphabet()

getTargetAlphabet

public Alphabet getTargetAlphabet()

setDataAlphabet

public void setDataAlphabet(Alphabet dDict)

isDataAlphabetSet

public boolean isDataAlphabetSet()

setTargetAlphabet

public void setTargetAlphabet(Alphabet tDict)

getInstanceId

public java.rmi.dgc.VMID getInstanceId()

readResolve

public java.lang.Object readResolve()
                             throws java.io.ObjectStreamException
This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version. We use a hashMap of instanceIds to determine if we have already read in this object.

Returns:
Throws:
java.io.ObjectStreamException