LBJChunk: A Chunker (or "shallow parser") written with LBJ.
Cognitive Computations Group
University of Illinois at Urbana-Champaign
http://l2r.cs.uiuc.edu/~cogcomp
Nick Rizzolo <rizzolo@gmail.com>

This distribution contains the LBJ and Java source code for the LBJ chunker
(a.k.a.  a "shallow parser").  To compile and/or run this software, you will
need both LBJ at least version 2.1.0 and the LBJ POS tagger installed on your
system and on your CLASSPATH.  While either of LBJ's distribution packages
(the source code or the jar files) will work, it will be easier to use the POS
tagger's jar distribution, since has already been trained with our training
data.

While all the source code is included in this distribution, the training data
is not.


IF YOU DON'T HAVE TRAINING DATA:

You may be interested in downloading our trained chunker, derived from this
distribution's source files and training data from the CoNLL 2000 shared task
corpus.  Simply download LBJChunk.jar from the same webpage where you
downloaded this distribution and add it to your CLASSPATH.  Then see below for
usage instructions.


IF YOU HAVE TRAINING DATA:

If you have training data in the same format as either the CoNLL 2000 shared
task corpus, you may edit the
src/edu/illinois/cs/cogcomp/lbj/chunk/Constants.java source file (it is very
simple) to point to the appropriate files.  Then simply type 'make' (or you
may need 'gmake' depending on the configuration of your system) to train and
compile the system.  Finally, put the full path to the class directory on your
CLASSPATH.  For example:

  setenv CLASSPATH ${CLASSPATH}:/home/user/LBJChunk/class

If your training data is in a different format, you'll need to replace the
CoNLL2000Parser class with one that parses the format of your data.  Then
replace the line of code in the chunk.lbj source file that calls the parser
before following the steps above.  If your data happens to be in the same
format as the Reuters 2003 corpus, the class Reuters2003Parser is already
provided in this distribution, and may simply be substituted for
CoNLL2000Parser in the aforementioned line of code.


USING THE CHUNKER

Testing
-------

Assuming the chunker's class files are on the $CLASSPATH, its performance can
be tested on test data labeled in the same format as the CoNLL 2000 corpus
with the following command:

  java edu.illinois.cs.cogcomp.lbj.chunk.ChunkTester <test data>

where <test data> is the path to the labeled test data.  This very simple
program makes use of the LBJ2.nlp.seg.BIOTester class which collects
precision, recall, and F1 statistics over the _segments_ (i.e., chunks, in
this case) discovered by a "BIO" style classifier (such as this chunker; see
below for a description of the tags produced).

If your data has chunk labels but not part of speech tags, use the same
CoNLL 2000 corpus format with a single dash in place of each POS tag.  These
tags will then be computed automatically during feature extraction.

Evaluating
----------

The LBJ runtime library contains a class that implements a general purpose
segmenter based on a word classifier that returns "BIO" style tags, such as
this chunker.  To invoke this program, type:

  java LBJ2.nlp.seg.SegmentTagPlain \
         edu.illinois.cs.cogcomp.lbj.chunk.Chunker <plain text file>

For more information about the SegmentTagPlain program, see the online javadoc
documentation:
http://l2r.cs.uiuc.edu/~cogcomp/software/LBJ2/library/LBJ2/nlp/seg/SegmentTagPlain.html

Importing
---------

This implementation uses the LBJ library's Token class to internally represent
the words whose chunk tags it computes.  If your Java application uses the
Token class as well, you can import the chunker and use it like so:

  // Begin Foo.java
  ...
  import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
  import LBJ2.nlp.seg.Token;
  ...
  public class Foo
  {
    ...
    void myMethod()
    {
      ...
      Chunker tagger = new Chunker();
      ...
      Token word = ...
      ...
      String tag = tagger.discreteValue(word);
      ...
    }
    ...
  }

Note that if your word object does not have its partOfSpeech field filled, the
LBJ POS tagger (which must be on your $CLASSPATH) will be loaded automatically
by the chunker to compute the tag for use as a feature.

Used as shown above, the chunker will return one of the following tags for
each word:

  Tag       Explanation: "The chunker predicts the word ..."
  B-ADJP      begins an adjective phrase.
  I-ADJP      is inside an adjective phrase.
  B-ADVP      begins an adverbial phrase.
  I-ADVP      is inside an adverbial phrase.
  B-CONJP     begins a conjunctive phrase.
  I-CONJP     is inside a conjunctive phrase.
  B-INTJ      begins an interjection.
  I-INTJ      is inside an interjection.
  B-LST       begins a list marker.
  I-LST       is inside a list marker.
  B-NP        begins a noun phrase.
  I-NP        is inside a noun phrase.
  B-PP        begins a prepositional phrase.
  I-PP        is inside a prepositional phrase.
  B-PRT       begins a particle.
  I-PRT       is inside a particle.
  B-SBAR      begins a subordinated clause.
  I-SBAR      is inside a subordinated clause.
  B-UCP       begins an unlike coordinated phrase.
  I-UCP       is inside an unlike coordinated phrase.
  B-VP        begins a verb phrase.
  I-VP        is inside a verb phrase.
  O           is outside of any chunk.

