This code is currently broken. It has a subtle bug which produces invalid results, can make your code run exponentially & could use exponential memory.
DO NOT USE THIS LIBRARY!!!
Pull requests are welcome, but seeing as this is GitHub, nobody will care & the project is thus effectively abandoned. Contact [email protected] if you really need a functioning Probabilistic Earley Parser enough so that you are willing to fund it.
This is a library for parsing a sequence of tokens (like words) into tree structures, along with the probability that the particular sequence generates that tree structure. This is mainly useful for linguistic purposes, such as morphological parsing, speech recognition and generally information extraction. It also finds applications in computational biology.
For example:
- As a computational linguist, you want to derive the most probable interpretation of an English sentence
tokens | parse tree |
---|---|
[i, want, british, food] |
- As a computational biologist, you want to predict the secondary structure for an RNA sequence (which you can encode in a tree structure)
tokens | parse tree |
---|---|
GGGC``UAUU``AGCU``CAGU UGGU``UAGA``GCGC``ACCC CUGA``UAAG``GGUG``AGGU CGCU``GAUU``CGAA``UUCA GCAU``AGCC``CA |
-
As a cognitive scientist, you want to compute the cognitive load of interpreting sentences as a function of "surprises" that happen as you read a sentence
-
As a computational linguist, you want to know the most likely table of contents structure for a list of paragraphs
This library allows you to do these things efficiently, as long as you can describe the rules as a Context-free Grammar (CFG).
The innovation of this library with respect to the many other parsing libraries is that this one allows the production rules in your grammar to have a probability attached to them. That is: it parses sentences using Stochastic Context-free Grammars. This allows us to make better choices in case of ambiguous sentences, because we can order multiple interpretations of the same sentence by probability. Furthermore, this parser does not limit token types to strings.
For a theoretical grounding of this work, refer to Stolcke, An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities.
You can use this project as a library in your Java application or as a standalone command-line app.
Download the latest JAR
By default, the parser will assume that you distinguish non-terminals from terminals by capitalizing them. You can also add a custom category handler if you call the API from Java code.
Create a UTF8-encoded .cfg
file that contains your grammar, such as the following:
# grammar.cfg
S -> NP VP (0.8) # specify probability between 0 and 1 by appending between parentheses
S -> NP (0.2)
NP -> Det N # probability defaults to 1.0
NP -> Det Nom # all rules for a given category must sum to 1, so the builder normalizes probabilities to ensure they sum to 1.0
Nom -> Adj N
VP β V # Use '->' or 'β'
Det β the # probability defaults to 1.0
N β heavy (0.2)
Adj β heavy (0.8)
V β heave (0.8)
N β /heave[r]?/i (0.2) # You can specify terminals as regular expressions by enclosing them in '/'.
Execute runnable jar on the terminal:
java -jar probabilistic-earley-parser-jar-with-dependencies.jar -i grammar.cfg -goal S the heavy heave
This will give the Viterbi parse to the ambiguous sentence "the heavy heave":
0.128 (= 0.8 * 0.2 * 0.8)
βββ <start>
βββ S
βββ NP
β βββ Det
β β βββ the (the)
β βββ N
β βββ heavy (heavy)
βββ VP
βββ V
βββ heave (heave)
This is the parse with the semantic of "heavy people heave"
In contrast, the less likely parse was "a heave that is heavy":
0.032 (= 0.2 * 0.8 * 0.2)
βββ <start>
βββ S
βββ NP
βββ Det
β βββ the (the)
βββ Nom
βββ Adj
β βββ heavy (heavy)
βββ N
βββ heave (heave)
The command line interface is meant for quickly trying out a simple grammar. For actual real-life parsing stuff, you probably want to use the Java API.
Grab from Maven:
<dependencies>
<dependency>
<groupId>org.leibnizcenter</groupId>
<artifactId>probabilistic-earley-parser</artifactId>
<version>0.9.12</version>
</dependency>
</dependencies>
or Gradle:
compile 'org.leibnizcenter:probabilistic-earley-parser:0.9.12'
Or just include the the latest JAR in your project.
Most applications will want to interface with Parser
, which you instantiate with a grammar:
public class Example {
// NonTerminals are just wrappers around a string
private static final NonTerminal S = Category.nonTerminal("S");
private static final NonTerminal NP = Category.nonTerminal("NP");
private static final NonTerminal VP = Category.nonTerminal("VP");
private static final NonTerminal TV = Category.nonTerminal("TV");
private static final NonTerminal Det = Category.nonTerminal("Det");
private static final NonTerminal N = Category.nonTerminal("N");
private static final NonTerminal Mod = Category.nonTerminal("Mod");
// Terminal types are realized by implementing the Terminal interface, specifically the function hasCategory. Terminal is a functional interface.
// Note that tokens can be of multiple terminal types (homographs: "bank" as a noun or "bank" as a verb), so you can use this method to pool many words to a single terminal
private static final Terminal<String> transitiveVerb = token -> token.obj.matches("(hit|chased)");
// Some utility terminal types are pre-defined:
private static final Terminal<String> the = new CaseInsensitiveStringTerminal("the");
private static final Terminal<String> a = new CaseInsensitiveStringTerminal("a");
private static final Terminal<String> man = new ExactStringTerminal("man");
private static final Terminal<String> stick = new ExactStringTerminal("stick");
private static final Terminal<String> with = new ExactStringTerminal("with");
private static final Grammar grammar = new Grammar.Builder("test")
.setSemiring(LogSemiring.get()) // If not set, defaults to Log semiring which is probably what you want. The builder takes care of converting probabilties to semiring elements
.addRule(
1.0, // Probability between 0.0 and 1.0, defaults to 1.0
S, // Left hand side of the rule
NP, VP // Right hand side of the rule
)
.addRule(
NP,
Det, N // eg. The man
)
.addRule(
NP,
Det, N, Mod // eg. The man (with a stick)
)
.addRule(
0.4,
VP,
TV, NP, Mod // eg. (chased) (the man) (with a stick)
)
.addRule(
0.6,
VP,
TV, NP // eg. (chased) (the man with a stick)
)
.addRule(Det, a)
.addRule(Det, the)
.addRule(N, man)
.addRule(N, stick)
.addRule(TV, transitiveVerb)
.addRule(Mod, with, NP) // eg. with a stick
.build();
public static void main(String[] args) {
Parser<String> parser = new Parser<>(grammar);
System.out.println(
parser.recognize(S, Tokens.tokenize("The man chased the man \n\t with a stick")) // true
);
System.out.println(
parser.getViterbiParseWithScore(S, Tokens.tokenize("the", "stick", "chased", "the", "man")) // Most likely parse tree with probability
);
}
}
You can parse .cfg
files as follows:
Grammar<String> g = Grammar.fromString(Paths.get("path", "to", "grammar.cfg"), Charset.forName("UTF-8"));
One of the advantages of Earley parsing is the top-down control you can exert while parsing. You can pass the parser callbacks to influence the parsing process. Only use this if you really know what you're doing. It may mess up your results if you are not careful.
new ParseCallbacks.Builder()
.withOnPreScan((position, token, chart) -> System.out.println("Scan about to happen for token " token))
.withScanProbability((position, token) -> {
if (token.getCategories().contains(anUnexpectedTerminalForThisWord)) {
return grammar.semiring.fromProbability(0.5);
} else {
return grammar.semiring.one();
}
})
.withOnPostScan((position, token, chart) -> System.out.println("Scan happened for token " token))
.withOnPostComplete((position, token, chart) -> System.out.println("Complete happened for token " token))
.build()
The Earley algorithm has nice complexity properties. In particular, it can parse:
- any CFG in O(nΒ³),
- unambiguous CFGs in O(nΒ²)
- left-recursive unambiguous grammars in O(n). For example,
S β S a
is a left-recursive rule.
Note that this implementation does not apply innovations such as Joop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement these ideas and still have a probabilistic parser.
For an efficient parser that works only on non-probabilistic context-free grammars, look into Marpa. Marpa is a C library with a Perl interface, and a Lua interface is underway. It is currently painful to embed within a Java project, however.
Pull requests for these issues are welcome:
- I have not provisioned for Ξ΅-rules (empty right-hand sign). Issue.
- Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented. Issue.
- Higher level concepts such as * and are not implemented
- Error handling / logging could be better, available as an experimental feature. Issue.
- Viterbi parsing only returns one single parse. In the case of an ambiguous sentence, the returned parse is not guaranteed the left-most parse.
- Behavior for strangely defined grammars is not defined, such as when the same rule is defined multiple times with a different probability. Issue
The probability of a parse is defined as the product of the probabilities of all the applied rules. Usually, we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition and multiplication.
Implementing this naively can lead to problems: imagine a computation like 0.1 * 0.1 * ... * 0.1. At some point, floating point arithmetic will be unable to represent a number so small (arithmetic underflow). To counter, we map numbers between 0 and 1 to numbers between 0 and infinity with the Log semiring, which allows us to represent a lot more numbers (everything between 0 and Double.MAX_VALUE
).
This project makes it possible to use any commutative semiring that can have its elements represented as doubles. I can't really imagine a use case for using another semiring than the Log semiring, but who knows, maybe you can.
This software is licensed under a permissive MIT license.
- Stolcke, Andreas. "An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational linguistics 21.2 (1995): 165-201.
- Leo, Joop MIM. "A general context-free parsing algorithm running in linear time on every LR (k) grammar without using lookahead." Theoretical computer science 82.1 (1991): 165-176.
- Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer." (2012): 115.
Inquiries go to Maarten Trompper at [email protected]