edu.umass.cs.mallet.base.extract
Class HierarchicalTokenizationFilter
java.lang.Object
edu.umass.cs.mallet.base.extract.HierarchicalTokenizationFilter
- All Implemented Interfaces:
- TokenizationFilter
- public class HierarchicalTokenizationFilter
- extends java.lang.Object
- implements TokenizationFilter
Tokenization filter that will create nested spans based on a hierarchical labeling of the data.
The labels should be of the form LBL1[|LBLk]*. For example,
A A|B A|B|C A|B|C A|B A A
w1 w2 w3 w4 w5 w6 w7
will result in LabeledSpans like
<A>w1 <B>w2 <C>w3 w4</C> w5</B> w6 w7</A>
Also, labels of the form <B-field> will force a new instance of the field to begin,
even if it is already active. And prefixes of I- are ignored so you can use BIO labeling.
Created: Nov 12, 2004
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
HierarchicalTokenizationFilter
public HierarchicalTokenizationFilter()
HierarchicalTokenizationFilter
public HierarchicalTokenizationFilter(java.util.regex.Pattern ignorePattern)
constructLabeledSpans
public LabeledSpans constructLabeledSpans(LabelAlphabet dict,
java.lang.Object document,
Label backgroundTag,
Tokenization input,
Sequence seq)
- Description copied from interface:
TokenizationFilter
- Converts a the sequence of labels into a set of labeled spans. Essentially, this converts the
output of sequence labeling into an extraction output.
- Specified by:
constructLabeledSpans
in interface TokenizationFilter
- Parameters:
dict
- document
- backgroundTag
- input
- seq
-
- Returns: