edu.umass.cs.mallet.base.types
Class PagedInstanceList

java.lang.Object
  extended byedu.umass.cs.mallet.base.types.InstanceList
      extended byedu.umass.cs.mallet.base.types.PagedInstanceList
All Implemented Interfaces:
PipeOutputAccumulator, java.io.Serializable

public class PagedInstanceList
extends InstanceList

xxx .split() methods still unreliable An InstanceList which avoids OutOfMemoryErrors by saving Instances to disk when there is not enough memory to create a new Instance. It implements a fixed-size paging scheme, where each page on disk stores instancesPerPage Instances. So, while the number of Instances per pages is constant, the size in bytes of each page may vary. Using this class instead of InstanceList means the number of Instances you can store is essentially limited only by disk size (and patience). The paging scheme is optimized for the most frequent case of looping through the InstanceList from index 0 to n. If there are n instances, then instances 0->(n/size()) are stored together on page 1, instances (n/size)+1 -> 2*(n/size) are on page 2, ... etc. This way, pages adjacent in the instances list will usually be in the same page. The paging scheme also tries to only keep one page in memory at a time. The justification for this is that the page size is near the limit of the maximum number of instances that can be kept in memory. Since we assume the frequent case is looping from instance 0 to n, keeping other Instances in memory will be a waste of resources. About instancesPerPage -- If instancesPerPage = -1, then its value will be set automatically by the following: When the first OutOfMemoryError is thrown, count how many instances are currently in memory, then divide by two. This is a conservative estimate of how many Instance objects can fit in memory simultaneously. If you know this value beforehand, simply pass it to the constructor. NOTE: The event which causes an OutOfMemoryError is the instantiation of a new Instance, _not_ the addition of this Instance to an InstanceList. Therefore, if you want to avoid OutOfMemoryErrors, let PagedInstanceList instantiate the new Instance for you. IOW, do this: Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (data, target, name, source); Or This PipeInputIterator iter = ...; Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (iter); But Not This: Pipe p = ...; PagedInstanceList ilist = new PagedInstanceList (p); ilist.add (new Instance (data, target, name, source)); If memory is low, the last example will throw an OutOfMemoryError before control has been passed to PagedInstanceList to catch the error. NOTE ALSO: To save write time, we do not write the same Instance to disk more than once, i.e., there are no dirty bits or write-throughs. Thus, this assumes that after an Instance has been passed through its Pipe, it is no longer modified. One way around this is to call PagedInstanceList.setInstance (Instance inst), which _will_ overwrite an Instance that has been paged to disk.

See Also:
InstanceList, Serialized Form

Nested Class Summary
 
Nested classes inherited from class edu.umass.cs.mallet.base.types.InstanceList
InstanceList.CrossValidationIterator, InstanceList.Iterator, InstanceList.Stream
 
Constructor Summary
PagedInstanceList()
           
PagedInstanceList(Pipe pipe)
           
PagedInstanceList(Pipe pipe, int size)
           
PagedInstanceList(Pipe pipe, int size, int instancesPerPage, java.io.File swapDir)
          Creates a PagedInstanceList where "instancesPerPage" instances are swapped to disk in directory "swapDir" if the amount of free system memory drops below "minFreeMemory" bytes
 
Method Summary
 boolean add(Instance instance)
          Appends the instance to this list.
 boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, double instanceWeight)
          Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.
 void add(PipeInputIterator pi)
          Adds to this list every instance generated by the iterator, passing each one through this list's pipe.
 InstanceList cloneEmpty()
           
 boolean collectGarbage()
           
 Instance getInstance(int index)
          Returns the Instance at the specified index.
static InstanceList load(java.io.File file)
          Constructs a new InstanceList, deserialized from file.
 InstanceList sampleWithReplacement(java.util.Random r, int numSamples)
          Overridden to add samples in original order to reduce thrashing.
 InstanceList sampleWithWeights(java.util.Random r, double[] weights)
          Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights.
 void setCollectGarbage(boolean b)
           
 void setInstance(int index, Instance instance)
          Replaces the Instance at position index with a new one.
 InstanceList shallowClone()
           
 InstanceList[] split(double[] proportions)
           
 InstanceList[] split(java.util.Random r, double[] proportions)
          Shuffles the elements of this list among several smaller lists.
 InstanceList[] splitByModulo(int m)
          Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first.
 void swapOutAll()
          Save all instances to disk and set to null to free memory.
 
Methods inherited from class edu.umass.cs.mallet.base.types.InstanceList
add, add, add, clonePipeOutputAccumulator, crossValidationIterator, crossValidationIterator, get, getDataAlphabet, getDataClass, getFeatureSelection, getInstanceWeight, getPerLabelFeatureSelection, getPipe, getTargetAlphabet, iterator, noisify, pipeOutputAccumulate, removeSources, removeTargets, sampleWithInstanceWeights, save, setFeatureSelection, setInstanceWeight, setPerLabelFeatureSelection, size, splitInOrder, subList, targetLabelDistribution
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PagedInstanceList

public PagedInstanceList(Pipe pipe,
                         int size,
                         int instancesPerPage,
                         java.io.File swapDir)
Creates a PagedInstanceList where "instancesPerPage" instances are swapped to disk in directory "swapDir" if the amount of free system memory drops below "minFreeMemory" bytes

Parameters:
pipe - instance pipe
instancesPerPage - number of Instances to store in each page. If -1, determine at first call to swapOutExcept
swapDir - where the pages on disk live.

PagedInstanceList

public PagedInstanceList(Pipe pipe,
                         int size)

PagedInstanceList

public PagedInstanceList(Pipe pipe)

PagedInstanceList

public PagedInstanceList()
Method Detail

split

public InstanceList[] split(java.util.Random r,
                            double[] proportions)
Shuffles the elements of this list among several smaller lists. Overrides InstanceList.split to add instances in original order, to prevent thrashing.

Overrides:
split in class InstanceList
Parameters:
proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist.
r - The source of randomness to use in shuffling.
Returns:
one InstanceList for each element of proportions

split

public InstanceList[] split(double[] proportions)
Overrides:
split in class InstanceList

splitByModulo

public InstanceList[] splitByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first. The second list contains all remaining elements. Overrides InstanceList.splitByModulo to use PagedInstanceLists.

Overrides:
splitByModulo in class InstanceList

sampleWithReplacement

public InstanceList sampleWithReplacement(java.util.Random r,
                                          int numSamples)
Overridden to add samples in original order to reduce thrashing.

Overrides:
sampleWithReplacement in class InstanceList

sampleWithWeights

public InstanceList sampleWithWeights(java.util.Random r,
                                      double[] weights)
Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.

Overrides:
sampleWithWeights in class InstanceList

swapOutAll

public void swapOutAll()
Save all instances to disk and set to null to free memory.


getInstance

public Instance getInstance(int index)
Returns the Instance at the specified index. If this Instance is not in memory, swap a block of instances back into memory.

Overrides:
getInstance in class InstanceList

setInstance

public void setInstance(int index,
                        Instance instance)
Replaces the Instance at position index with a new one. Note that this is the only sanctioned way of changing an Instance.

Overrides:
setInstance in class InstanceList

add

public boolean add(Instance instance)
Appends the instance to this list. Note that since memory for the Instance has already been allocated, no check is made to catch OutOfMemoryError.

Overrides:
add in class InstanceList
Returns:
true if successful

add

public void add(PipeInputIterator pi)
Adds to this list every instance generated by the iterator, passing each one through this list's pipe. Checks are made to ensure an OutOfMemoryError is not thrown when instantiating a new Instance.

Overrides:
add in class InstanceList

add

public boolean add(java.lang.Object data,
                   java.lang.Object target,
                   java.lang.Object name,
                   java.lang.Object source,
                   double instanceWeight)
Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight. Checks are made to ensure an OutOfMemoryError is not thrown when instantiating a new Instance.

Overrides:
add in class InstanceList
Returns:
true

setCollectGarbage

public void setCollectGarbage(boolean b)

collectGarbage

public boolean collectGarbage()

shallowClone

public InstanceList shallowClone()
Overrides:
shallowClone in class InstanceList

cloneEmpty

public InstanceList cloneEmpty()
Overrides:
cloneEmpty in class InstanceList

load

public static InstanceList load(java.io.File file)
Constructs a new InstanceList, deserialized from file. If the string value of file is "-", then deserialize from System.in.