The Comonad.Reader » Parsing

A Parsec Full of Rats, Part 2

Edward Kmett — Sat, 24 Sep 2011 03:07:32 +0000

Last time, I showed that we can build a small parsec clone with packrat support.

This time I intend to implement packrat directly on top of Parsec 3.

One of the main topics of discussion when it comes to packrat parsing since Bryan Ford's initial release of Pappy has been the fact that in general you shouldn't use packrat to memoize every rule, and that instead you should apply Amdahl's law to look for the cases where the lookup time is paid back in terms of repetitive evaluation, computation time and the hit rate. This is great news for us, since, we only want to memoize a handful of expensive combinators.

First, we'll need to import enough of Parsec to do something interesting.

 
{-# LANGUAGE RecordWildCards, ViewPatterns, FlexibleInstances, MultiParamTypeClasses #-}
 
import Text.Parsec
import qualified Text.Parsec.Token as T
import Text.Parsec.Token
    (GenLanguageDef(..), GenTokenParser(TokenParser))
import Text.Parsec.Pos (initialPos, updatePosChar)
import Data.Functor.Identity (Identity(..))
import Control.Applicative hiding ((< |>))
import Control.Monad.Fix (fix)

Then as before, we'll define PEG-style backtracking:

 
(< />) :: Monad m => ParsecT s u m a -> ParsecT s u m a ->
    ParsecT s u m a
p < /> q = try p < |> q
infixl 3 < />

Now we need an analogue to our Result type from last time, which recalled whether or not we had consumed input, and what the current cursor location is. Fortunately, we can recycle the definitions from Parsec to this end.

 
type Result d a = Consumed (Reply d () a)

We'll define a combinator to build a parser directly from a field accessor. Last time, this was just the use of the "Rat" constructor. Now it is a bit trickier, because we need to turn Consumed (Reply d () a) into m (Consumed (m (Reply d u a))) by wrapping it in the appropriate monad, and giving the user back his state unmolested.

 
rat :: Monad m => (d -> Result d a) -> ParsecT d u m a
rat f   = mkPT $ \s0 -> return $
    return . patch s0 < $> f (stateInput s0) where
  patch (State _ _ u) (Ok a (State s p _) err) = Ok a (State s p u) err
  patch _             (Error e)                = Error e

Last time we could go from a parser to a result just by applying the user stream type, but with parsec we also have to supply their notion of a position. This leads to the following combinator. By running in the Identity monad with no user state it should be obvious that we've duplicated the functionality of the previous 'Rat' parser (with the addition of a source position).

 
womp :: d -> SourcePos -> ParsecT d () Identity a -> Result d a
womp d pos p = fmap runIdentity . runIdentity $
    runParsecT p (State d pos ())

The combinator is so named because we needed a big space-rat rather than a little pack-rat to keep with the theme.

It's not impossible. I used to bullseye womp rats in my T-16 back home, they're not much bigger than two meters.

Now we'll write a bit of annoyingly verbose boilerplate to convince Parsec that we really want a LanguageDef for some monad other than Identity. (As an aside, why Text.Parsec.Language doesn't contain GenLanguageDefs that are parametric in their choice of Monad is beyond me.)

 
myLanguageDef :: Monad m => T.GenLanguageDef D u m
myLanguageDef = T.LanguageDef
  { commentStart    = "{-"
  , commentEnd      = "-}"
  , commentLine     = "--"
  , nestedComments  = True
  , identStart      = letter < |> char '_'
  , identLetter     = alphaNum < |> oneOf "_'"
  , opStart         = opLetter myLanguageDef
  , opLetter        = oneOf ":!#$%&*+./< =>?@\\^|-~"
  , reservedOpNames = []
  , reservedNames   = []
  , caseSensitive   = True
  }

As a shameless plug, trifecta offers a particularly nice solution to this problem, breaking up the monolithic Token type into separate concerns and letting you layer parser transformers that enrich the parser to deal with things like Haskell-style layout, literate comments, parsing comments in whitespace, etc.

And as one last bit of boilerplate, we'll abuse RecordWildcards once again to avoid the usual 20 lines of boilerplate that are expected of us, so we can get access to parsec's token parsers.

 
TokenParser {..} = T.makeTokenParser myLanguageDef

Now we're ready to define our incredibly straightforward stream type:

 
data D = D
  { _add        :: Result D Integer
  , _mult       :: Result D Integer
  , _primary    :: Result D Integer
  , _dec        :: Result D Integer
  , _uncons     :: Maybe (Char, D)
  }
 
instance Monad m => Stream D m Char where
  uncons = return . _uncons

And using the general purpose rat combinator from earlier, we can write some memoized parsers:

 
add, mult, primary, dec :: Parsec D u Integer
add     = rat _add
mult    = rat _mult
primary = rat _primary
dec     = rat _dec

And finally, we write the code to tie the knot and build the stream:

 
parse :: SourceName -> String -> D
parse n = go (initialPos n) where
  go p s = fix $ \d -> let
    (womp d p -> _add) =
            (+) < $> mult < * reservedOp "+" <*> add
        < /> mult < ?> "summand"
    (womp d p -> _mult) =
            (*) < $> primary < * reservedOp "*" <*> mult
        < /> primary < ?> "factor"
    (womp d p -> _primary) =
            parens add
        < /> dec < ?> "number"
    (womp d p -> _dec) = natural
    _uncons = case s of
      (x:xs) -> Just (x, go (updatePosChar p x) xs)
      []     -> Nothing
    in D { .. }
 
runD :: Parsec D u a -> u -> SourceName -> String -> Either ParseError a
runD p u fn s = runParser p u fn (prep fn s)

and finally, let it rip:

 
eval :: String -> Integer
eval s = either (error . show) id $
    runD (whiteSpace *> add < * eof) () "-" s

While this approach tends to encourage memoizing fewer combinators than libraries such as frisby, this is exactly what current research suggests you probably should do with packrat parsing!

The other purported advantage of packrat parsers is that they can deal with left recursion in the grammar. However, that is not the case, hidden left recursion in the presence of the algorithm used in the scala parsing combinator libraries leads to incorrect non-left-most parses as shown by Tratt.

I leave it as an exercise for the reader to extend this material with the parsec+iteratees approach from my original talk on trifecta to get packrat parsing of streaming input. Either that or you can wait until it is integrated into trifecta.

You can download the source to this (without the spurious spaces inserted by wordpress) here.

If I can find the time, I hope to spend some time addressing Scott and Johnstone's GLL parsers, which actually achieve the O(n^3) worst case bounds touted for Tomita's GLR algorithm (which is actually O(n^4) as it was originally defined despite the author's claims), and how to encode them in Haskell with an eye towards building a memoizing parser combinator library that can parse LL(1) fragments in O(1), deal with arbitrary context-free grammars in O(n^3), and degrade reasonably gracefully in the presence of context-sensitivity, while supporting hidden left recursion as long as such recursion passes through at least one memoized rule. This is important because CFGs are closed under extensions to the grammar, which is a nice property to have if you want to have a language where you can add new statement types easily without concerning yourself overmuch with the order in which you insert the rules or load the different extensions.

A Parsec Full of Rats, Part 1

Edward Kmett — Sat, 24 Sep 2011 02:10:06 +0000

You never heard of the Millenium Falcon? It's the ship that made the Kessel Run in 12 parsecs.

I've been working on a parser combinator library called trifecta, and so I decided I'd share some thoughts on parsing.

Packrat parsing (as provided by frisby, pappy, rats! and the Scala parsing combinators) and more traditional recursive descent parsers (like Parsec) are often held up as somehow different.

Today I'll show that you can add monadic parsing to a packrat parser, sacrificing asymptotic guarantees in exchange for the convenient context sensitivity, and conversely how you can easily add packrat parsing to a traditional monadic parser combinator library.

To keep this post self-contained, I'm going to start by defining a small packrat parsing library by hand, which acts rather like parsec in its backtracking behavior. First, some imports:

 
{-# LANGUAGE RecordWildCards, ViewPatterns, DeriveFunctor #-}
import Control.Applicative
import Control.Monad (MonadPlus(..), guard)
import Control.Monad.Fix (fix)
import Data.Char (isDigit, digitToInt, isSpace)

Second, we'll define a bog simple parser, which consumes an input stream of type d, yielding a possible answer and telling us whether or not it has actually consumed any input as it went.

 
newtype Rat d a = Rat { runRat :: d -> Result d a }
  deriving Functor
 
data Result d a
  = Pure a             -- didn't consume anything, can backtrack
  | Commit d a      -- consumed input
  | Fail String Bool -- failed, flagged if consumed
  deriving Functor

Now, we can finally implement some type classes:

 
instance Applicative (Rat d) where
  pure a = Rat $ \ _ -> Pure a
  Rat mf < *> Rat ma = Rat $ \ d -> case mf d of
    Pure f      -> fmap f (ma d)
    Fail s c    -> Fail s c
    Commit d' f -> case ma d' of
      Pure a       -> Commit d' (f a)
      Fail s _     -> Fail s True
      Commit d'' a -> Commit d'' (f a)

including an instance of Alternative that behaves like parsec, only backtracking on failure if no input was unconsumed.

 
instance Alternative (Rat d) where
  Rat ma < |> Rat mb = Rat $ \ d -> case ma d of
    Fail _ False -> mb d
    x            -> x
  empty = Rat $ \ _ -> Fail "empty" False

For those willing to forego the asymptotic guarantees of packrat, we'll offer a monad.

 
instance Monad (Rat d) where
  return a = Rat $ \_ -> Pure a
  Rat m >>= k = Rat $ \d -> case m d of
    Pure a -> runRat (k a) d
    Commit d' a -> case runRat (k a) d' of
      Pure b -> Commit d' b
      Fail s _ -> Fail s True
      commit -> commit
    Fail s c -> Fail s c
  fail s = Rat $ \ _ -> Fail s False
 
instance MonadPlus (Rat d) where
  mplus = (< |>)
  mzero = empty

and a Parsec-style "try", which rewinds on failure, so that < |> can try again.

 
try :: Rat d a -> Rat d a
try (Rat m) = Rat $ \d -> case m d of
  Fail s _ -> Fail s False
  x        -> x

Since we've consumed < |> with parsec semantics. Let's give a PEG-style backtracking (< />).

 
(< />) :: Rat d a -> Rat d a -> Rat d a
p < /> q = try p < |> q
infixl 3 < />

So far nothing we have done involves packrat at all. These are all general purpose recursive descent combinators.

We can define an input stream and a number of combinators to read input.

 
class Stream d where
  anyChar :: Rat d Char
 
whiteSpace :: Stream d => Rat d ()
whiteSpace = () < $ many (satisfy isSpace)
phrase :: Stream d => Rat d a -> Rat d a
phrase m = whiteSpace *> m < * eof
 
notFollowedBy :: Rat d a -> Rat d ()
notFollowedBy (Rat m) = Rat $ \d -> case m d of
  Fail{} -> Pure ()
  _      -> Fail "unexpected" False
 
eof :: Stream d => Rat d ()
eof = notFollowedBy anyChar
 
satisfy :: Stream d => (Char -> Bool) -> Rat d Char
satisfy p = try $ do
  x < - anyChar
  x <$ guard (p x)
 
char :: Stream d => Char -> Rat d Char
char c = satisfy (c ==)
 
lexeme :: Stream d => Rat d a -> Rat d a
lexeme m = m < * whiteSpace
 
symbol :: Stream d => Char -> Rat d Char
symbol c = lexeme (char c)
 
digit :: Stream d => Rat d Int
digit = digitToInt < $> satisfy isDigit

And we can of course use a string as our input stream:

 
instance Stream [Char] where
  anyChar = Rat $ \s -> case s of
    (x:xs) -> Commit xs x
    [] -> Fail "EOF" False

Now that we've built a poor man's Parsec, let's do something more interesting. Instead of just using String as out input stream, let's include slots for use in memoizing the results from our various parsers at each location. To keep things concrete, we'll memoize the ArithPackrat.hs example that Bryan Ford used in his initial packrat presentation enriched with some whitespace handling.

 
data D = D
  { _add        :: Result D Int
  , _mult       :: Result D Int
  , _primary    :: Result D Int
  , _decimal    :: Result D Int
  , anyCharD    :: Result D Char
  }

If you look at the type of each of those functions you'll see that _add :: D -> Result D Int, which is exactly our Rat newtype expects as its argument, we we can bundle them directly:

 
 
add, mult, primary, decimal :: Rat D Int
add     = Rat _add
mult    = Rat _mult
primary = Rat _primary
decimal = Rat _decimal

We can similarly juse use the character parse result.

 
instance Stream D where
  anyChar = Rat anyCharD

Now we just need to build a D from a String. I'm using view patterns and record wildcards to shrink the amount of repetitive naming.

 
parse :: String -> D
parse s = fix $ \d -> let
  Rat (dv d -> _add) =
        (+) < $> mult < * symbol '+' <*> add
     < /> mult
  Rat (dv d -> _mult) =
        (*) < $> primary < * symbol '*' <*> mult
    < /> primary
  Rat (dv d -> _primary) =
        symbol '(' *> add < * symbol ')'
     decimal
  Rat (dv d -> _decimal) =
     foldl' (\b a -> b * 10 + a) 0 < $> lexeme (some digit)
  anyCharD = case s of
    (x:xs) -> Commit (parse xs) x
    []     -> Fail "EOF" False
  in D { .. }
 
dv :: d -> (d -> b) -> b
dv d f = f d

Note that we didn't really bother factoring the grammar, since packrat will take care of memoizing the redundant calls!

And with that, we can define an evaluator.

 
eval :: String -> Int
eval s = case runRat (whiteSpace *> add < * eof) (parse s) of
  Pure a -> a
  Commit _ a -> a
  Fail s _ -> error s

Note that because the input stream D contains the result directly and parse is the only thing that ever generates a D, and it does so when we start up, it should be obvious that the parse results for each location can't depend on any additional information smuggled in via our monad.

Next time, we'll add a packratted Stream type directly to Parsec, which will necessitate some delicate handling of user state.

The small parser implemented here can be found on my github account, where it hasn't been adulterated with unnecessary spaces by my blog software.

P.S. To explain the quote, had I thought of it earlier, I could have named my parsing combinator library "Kessel Run" as by the time I'm done with it "it will contain at least 12 parsecs" between its different parser implementations.

Iteratees, Parsec, and Monoids, Oh My!

Edward Kmett — Wed, 16 Sep 2009 02:51:08 +0000

I'll be giving a talk tomorrow, Wednesday, September 16th, 2009 at the Boston Haskell User Group in the MIT CSAIL Reading Room (on the 8th floor of the William H. Gates tower of the Stata center) about mixing Oleg's iteratees with parsec and monoids to build practical parallel parsers and to cheaply reparse after local modifications are made to source code.

Ravi is trying to organize some time before hand during which people can get together and work on Haskell projects, or spend some time learning Haskell, so its not all scary academic stuff.

The meeting is scheduled from 7-9pm, and an ever growing number of us have been wandering down to the Cambridge Brewing Company afterwards to hang out and talk.

If you are curious about Haskell, or even an expert, or just happen to be interested in parallel programming and find yourself in the area, come on by.

Iteratees, Parsec and Monoids (Slides)

Edward Kmett — Thu, 20 Aug 2009 16:55:03 +0000

I was asked to give two talks at the Boston Area Haskell User Group for this past Tuesday. The first was pitched at a more introductory level and the second was to go deeper into what I have been using monoids for lately.

The first talk covers an introduction to the mathematical notion of a monoid, introduces some of the features of my Haskell monoids library on hackage, and starts to motivate the use of monoidal parallel/incremental parsing, and the modification use of compression algorithms to recycle monoidal results.

The second talk covers a way to generate a locally-context sensitive parallel/incremental parser by modifying Iteratees to enable them to drive a Parsec 3 lexer, and then wrapping that in a monoid based on error productions in the grammar before recycling these techniques at a higher level to deal with parsing seemingly stateful structures, such as Haskell layout.

Due to a late start, I was unable to give the second talk. However, I did give a quick run through to a few die-hards who stayed late and came to the Cambridge Brewing Company afterwards. As I promised some people that I would post the slides after the talk, here they are.

The current plan is to possibly give the second talk in full at either the September or October Boston Haskell User Group sessions, depending on scheduling and availability.

[ Iteratee.hs ]

Slides from Hac Phi: “All About Monoids”

Edward Kmett — Fri, 31 Jul 2009 15:41:16 +0000

Some people have requested my slides from the short talk I gave about monoids and monoidal parsing at Hac Phi. So, here they are.

There will be more to come at the next Boston Haskell User Group in August, where it looks like I'll be giving two short talks covering monoids. I may use the monoidal parsing engine from Kata as an example for the advanced talk if I have time and will start to cover parsing larger classes of grammars in general (regular languages, CFGs/TIGs, TAGs, PEGs, LALR, attribute-grammars, etc.)