edu.umass.cs.mallet.base.types
Class InstanceList

java.lang.Object
  extended byedu.umass.cs.mallet.base.types.InstanceList
All Implemented Interfaces:
PipeOutputAccumulator, java.io.Serializable
Direct Known Subclasses:
PagedInstanceList

public class InstanceList
extends java.lang.Object
implements java.io.Serializable, PipeOutputAccumulator

A list of machine learning instances, typically used for training or testing of a machine learning algorithm.

All of the instances in the list will have been passed through the same Pipe, and thus must also share the same data and target Alphabets. InstanceList keeps a reference to the pipe and the two alphabets.

The most common way of adding instances to an InstanceList is through the add(PipeInputIterator) method. PipeInputIterators are a way of mapping general data sources into instances suitable for processing through a pipe. As each Instance is pulled from the PipeInputIterator, the InstanceList copies the instance and runs the copy through its pipe (with resultant destructive modifications) before saving the modified instance on its list. This is the usual way in which instances are transformed by pipes.

InstanceList also contains methods for randomly generating lists of feature vectors; splitting lists into non-overlapping subsets (useful for test/train splits), and iterators for cross validation.

See Also:
Instance, Pipe, PipeInputIterator, Serialized Form

Nested Class Summary
 class InstanceList.CrossValidationIterator
          CrossValidationIterator allows iterating over pairs of InstanceList, where each pair is split into training/testing based on nfolds.
 class InstanceList.Iterator
           
protected static interface InstanceList.Stream
           
 
Constructor Summary
InstanceList()
          Creates a list which must have its pipe set later.
InstanceList(Alphabet dataVocab, Alphabet targetVocab)
          Creates a list which will not pass added instances through a pipe.
InstanceList(Pipe pipe)
          Creates a list with the given pipe.
InstanceList(Pipe pipe, int capacity)
          Creates a list with the given pipe and initial capacity where all added instances are passed through the specified pipe.
InstanceList(Random r, Alphabet vocab, java.lang.String[] classNames, int meanInstancesPerLabel)
           
InstanceList(Random r, Dirichlet classCentroidDistribution, double classCentroidAverageAlphaMean, double classCentroidAverageAlphaVariance, double featureVectorSizePoissonLambda, double classInstanceCountPoissonLambda, java.lang.String[] classNames)
          Creates a list consisting of randomly-generated FeatureVectors.
InstanceList(Random r, int vocabSize, int numClasses)
           
 
Method Summary
 boolean add(Instance instance)
          Appends the instance to this list.
 boolean add(Instance instance, double instanceWeight)
          Appends the instance to this list, assigning it the specified weight.
 void add(InstanceList ilist)
          Adds to this list each instance in the input list.
 boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source)
          Constructs and appends an instance to this list, passing it through this list's pipe.
 boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, double instanceWeight)
          Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.
 void add(PipeInputIterator pi)
          Adds to this list every instance generated by the iterator, passing each one through this list's pipe.
 InstanceList cloneEmpty()
           
 PipeOutputAccumulator clonePipeOutputAccumulator()
           
 InstanceList.CrossValidationIterator crossValidationIterator(int nfolds)
           
 InstanceList.CrossValidationIterator crossValidationIterator(int nfolds, int seed)
           
 java.lang.Object get(int index)
          Returns the Instance at the specified index.
 Alphabet getDataAlphabet()
          Returns the Alphabet mapping features of the data to integers.
 java.lang.Class getDataClass()
          Returns the class of the object contained in the data field of the first Instance in this list.
 FeatureSelection getFeatureSelection()
           
 Instance getInstance(int index)
          Returns the Instance at the specified index.
 double getInstanceWeight(int index)
           
 FeatureSelection[] getPerLabelFeatureSelection()
           
 Pipe getPipe()
          Returns the pipe through which each added Instance is passed, which may be null.
 Alphabet getTargetAlphabet()
          Returns the Alphabet mapping target output labels to integers.
 InstanceList.Iterator iterator()
           
static InstanceList load(java.io.File file)
          Constructs a new InstanceList, deserialized from file.
 double noisify(double ratio)
           
 void pipeOutputAccumulate(Instance carrier, Pipe iteratedPipe)
           
 void removeSources()
          Sets the "source" field to null in all instances.
 void removeTargets()
          Sets the "target" field to null in all instances.
 InstanceList sampleWithInstanceWeights(java.util.Random r)
          Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the instance weights.
 InstanceList sampleWithReplacement(java.util.Random r, int numSamples)
           
 InstanceList sampleWithWeights(java.util.Random r, double[] weights)
          Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights.
 void save(java.io.File file)
          Saves this InstanceList to file.
 void setFeatureSelection(FeatureSelection selectedFeatures)
           
 void setInstance(int index, Instance instance)
          Replaces the Instance at position index with a new one.
 void setInstanceWeight(int index, double weight)
           
 void setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)
           
 InstanceList shallowClone()
           
 int size()
           
 InstanceList[] split(double[] proportions)
           
 InstanceList[] split(java.util.Random r, double[] proportions)
          Shuffles the elements of this list among several smaller lists.
 InstanceList[] splitByModulo(int m)
          Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first.
 InstanceList[] splitInOrder(double[] proportions)
          Chops this list into several sequential sublists.
 InstanceList subList(int start, int end)
           
 LabelVector targetLabelDistribution()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

InstanceList

public InstanceList(Pipe pipe,
                    int capacity)
Creates a list with the given pipe and initial capacity where all added instances are passed through the specified pipe.

Parameters:
pipe - The pipe through which all added instances will be passed.

InstanceList

public InstanceList(Pipe pipe)
Creates a list with the given pipe.

Parameters:
pipe - The pipe through which all added instances will be passed.

InstanceList

public InstanceList(Alphabet dataVocab,
                    Alphabet targetVocab)

Creates a list which will not pass added instances through a pipe.

Used in those infrequent circumstances when the InstanceList has no pipe, and objects containing vocabularies are entered directly into the InstanceList; for example, the creation of a random InstanceList using Dirichlets and Multinomials.

Parameters:
dataVocab - The vocabulary for added instances' data fields
targetVocab - The vocabulary for added instances' targets

InstanceList

public InstanceList()
Creates a list which must have its pipe set later.


InstanceList

public InstanceList(Random r,
                    Dirichlet classCentroidDistribution,
                    double classCentroidAverageAlphaMean,
                    double classCentroidAverageAlphaVariance,
                    double featureVectorSizePoissonLambda,
                    double classInstanceCountPoissonLambda,
                    java.lang.String[] classNames)
Creates a list consisting of randomly-generated FeatureVectors.


InstanceList

public InstanceList(Random r,
                    Alphabet vocab,
                    java.lang.String[] classNames,
                    int meanInstancesPerLabel)

InstanceList

public InstanceList(Random r,
                    int vocabSize,
                    int numClasses)
Method Detail

subList

public InstanceList subList(int start,
                            int end)

shallowClone

public InstanceList shallowClone()

noisify

public double noisify(double ratio)

cloneEmpty

public InstanceList cloneEmpty()

split

public InstanceList[] split(java.util.Random r,
                            double[] proportions)
Shuffles the elements of this list among several smaller lists.

Parameters:
proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist.
r - The source of randomness to use in shuffling.
Returns:
one InstanceList for each element of proportions

split

public InstanceList[] split(double[] proportions)

splitInOrder

public InstanceList[] splitInOrder(double[] proportions)
Chops this list into several sequential sublists.

Parameters:
proportions - A list of numbers corresponding to the proportion of elements in each returned sublist.
Returns:
one InstanceList for each element of proportions

splitByModulo

public InstanceList[] splitByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first. The second list contains all remaining elements.


sampleWithReplacement

public InstanceList sampleWithReplacement(java.util.Random r,
                                          int numSamples)

getInstance

public Instance getInstance(int index)
Returns the Instance at the specified index.


sampleWithInstanceWeights

public InstanceList sampleWithInstanceWeights(java.util.Random r)
Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the instance weights. The new instances all have their weights set to one.


sampleWithWeights

public InstanceList sampleWithWeights(java.util.Random r,
                                      double[] weights)
Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.


setInstance

public void setInstance(int index,
                        Instance instance)
Replaces the Instance at position index with a new one.


getInstanceWeight

public double getInstanceWeight(int index)

setInstanceWeight

public void setInstanceWeight(int index,
                              double weight)

setFeatureSelection

public void setFeatureSelection(FeatureSelection selectedFeatures)

getFeatureSelection

public FeatureSelection getFeatureSelection()

setPerLabelFeatureSelection

public void setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)

getPerLabelFeatureSelection

public FeatureSelection[] getPerLabelFeatureSelection()

removeTargets

public void removeTargets()
Sets the "target" field to null in all instances. This makes unlabeled data.


removeSources

public void removeSources()
Sets the "source" field to null in all instances. This will often save memory when the raw data had been placed in that field.


get

public java.lang.Object get(int index)
Returns the Instance at the specified index.


load

public static InstanceList load(java.io.File file)
Constructs a new InstanceList, deserialized from file. If the string value of file is "-", then deserialize from System.in.


save

public void save(java.io.File file)
Saves this InstanceList to file. If the string value of file is "-", then serialize to System.out.


size

public int size()

getDataClass

public java.lang.Class getDataClass()
Returns the class of the object contained in the data field of the first Instance in this list.


getPipe

public Pipe getPipe()
Returns the pipe through which each added Instance is passed, which may be null.


getDataAlphabet

public Alphabet getDataAlphabet()
Returns the Alphabet mapping features of the data to integers.


getTargetAlphabet

public Alphabet getTargetAlphabet()
Returns the Alphabet mapping target output labels to integers.


targetLabelDistribution

public LabelVector targetLabelDistribution()

pipeOutputAccumulate

public void pipeOutputAccumulate(Instance carrier,
                                 Pipe iteratedPipe)
Specified by:
pipeOutputAccumulate in interface PipeOutputAccumulator

clonePipeOutputAccumulator

public PipeOutputAccumulator clonePipeOutputAccumulator()
Specified by:
clonePipeOutputAccumulator in interface PipeOutputAccumulator

iterator

public InstanceList.Iterator iterator()

crossValidationIterator

public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds,
                                                                    int seed)

crossValidationIterator

public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds)

add

public void add(PipeInputIterator pi)
Adds to this list every instance generated by the iterator, passing each one through this list's pipe.


add

public void add(InstanceList ilist)

Adds to this list each instance in the input list.

The lists' pipes must match, except that this list's pipe is allowed to be "not yet set", and the input list's pipe is allowed to be null.


add

public boolean add(java.lang.Object data,
                   java.lang.Object target,
                   java.lang.Object name,
                   java.lang.Object source,
                   double instanceWeight)
Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.

Returns:
true

add

public boolean add(java.lang.Object data,
                   java.lang.Object target,
                   java.lang.Object name,
                   java.lang.Object source)
Constructs and appends an instance to this list, passing it through this list's pipe. Default weight is 1.0.

Returns:
true

add

public boolean add(Instance instance)
Appends the instance to this list.

Returns:
true

add

public boolean add(Instance instance,
                   double instanceWeight)
Appends the instance to this list, assigning it the specified weight.

Returns:
true