edu.umass.cs.mallet.base.extract
Class StringTokenization
java.lang.Object
edu.umass.cs.mallet.base.types.TokenSequence
edu.umass.cs.mallet.base.extract.StringTokenization
- All Implemented Interfaces:
- PipeOutputAccumulator, Sequence, java.io.Serializable, Tokenization
- public class StringTokenization
- extends TokenSequence
- implements Tokenization
- See Also:
- Serialized Form
Method Summary |
java.lang.Object |
getDocument()
Returns the document of which this is a tokenization. |
Span |
getSpan(int i)
|
Span |
subspan(int firstToken,
int lastToken)
Returns a span formed by concatenating the spans from start to end. |
Methods inherited from class edu.umass.cs.mallet.base.types.TokenSequence |
add, add, addAll, addAll, addAll, clonePipeOutputAccumulator, get, getNumericProperty, getProperty, getToken, hasProperty, iterator, pipeOutputAccumulate, remove, removeLastToken, setNumericProperty, setProperty, size, toFeatureSequence, toFeatureVector, toString |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface edu.umass.cs.mallet.base.types.Sequence |
get, size |
StringTokenization
public StringTokenization(java.lang.CharSequence seq)
- Create an empty StringTokenization
StringTokenization
public StringTokenization(java.lang.CharSequence string,
CharSequenceLexer lexer)
- Creates a tokenization of the given string. Tokens are
added from all the matches of the given lexer.
subspan
public Span subspan(int firstToken,
int lastToken)
- Description copied from interface:
Tokenization
- Returns a span formed by concatenating the spans from start to end.
In more detail:
- The start of the new span will be the start index of getSpan(start).
- The end of the new span will be the start index of getSpan(end).
- Unless start == end, the new span will completely include getSpan(start).
- The new span will never intersect getSpan(end)
- If start == end, then the new span contains no text.
- Specified by:
subspan
in interface Tokenization
- Parameters:
firstToken
- The index of the first token in the new span (inclusive).
This is an index of a token, *not* an index into the document.lastToken
- The index of the first token in the new span (exclusive).
This is an index of a token, *not* an index into the document.
- Returns:
- A span into this tokenization's document
getSpan
public Span getSpan(int i)
- Specified by:
getSpan
in interface Tokenization
getDocument
public java.lang.Object getDocument()
- Description copied from interface:
Tokenization
- Returns the document of which this is a tokenization.
- Specified by:
getDocument
in interface Tokenization
- Returns: