Submitted by itkovian on Thu, 07/13/2006 - 2:24am.

I've been struggling with the following issue for over three weeks. At this point, it seems like there's nothing I can improve to make certain the application runs faster or allocates memory bounded by O(n), n being the size of the file I'm processing.

The goal is to construct a tree structure from the data contained in a file. Each line in said file indicates an entry or exitpoint of a function that was called (the file contains an execution trace of a Java app). When processing an entry point, the relevant data is stored in a Method record (annotated with a pre (count) number) and the lot is pushed onto a stack. When an exit point is encoutered, the top of the stack is checked. If the functions match, the stack is popped, the Method is annotated with a post (count) number, and we store the annotated Method record in a queue. The queue then contains all the information needed to contruct a real Tree, using the pre and post numbers.

And additional complication is the fact that each file line also belongs to an execution thread, which is located in the line. We need to construct a tree for each thread. For this we use a Data.Map that maps the thread in (:: Int) to the relevant stack and queue.

Finally, we also have a mapping fro function ids to function names, also maintained in a Data.Map

I am using a State Monad with the state contained in a ParserState record. Each line with update the state via updateState, where I make a distinction between the possible kinds of line in the trace, by swithcing on the first field of each line. Fields are separated by a space. I'm using the Data.ByteString stuff to lazily read the file.

I've tried to get Haskell to evaluate each update to the State through succintly using the seq operator. Perhaps I overdid that.

The question remains where and how I need to change my code to

  • Make things faster
  • Use less memory.

A profiling run (after -O2) showing retained sets seem to indicate super-linear behaviour in retaining data.

All comments are appreciated.

I'm pasting the Main module contents here.

import Debug.Trace
import System.Environment
import System.IO
import Data.Map as M
import Data.Maybe
import Data.List (sortBy)
import Control.Monad.State

import qualified Data.ByteString.Lazy.Char8 as B

import Method
import Tree
import Algorithms

type Preorder  = Integer
type Postorder = Integer
type MethodMap = M.Map Integer B.ByteString
type ThreadMap = M.Map Integer [(Preorder, Postorder, Method)]

data ParserState = ParserState { methodStack :: !ThreadMap
                               , methodQueue :: !ThreadMap
                               , pre         :: !Integer
                               , post        :: !Integer
                               , methodMap   :: !MethodMap
                               , currentThread :: !Integer
                               } deriving (Show)

initialParserState :: ParserState
initialParserState = ParserState e e 0 0 e 0
  where e = M.empty :: Map Integer a

readInteger :: B.ByteString -> Integer
readInteger = fromIntegral . fst . fromJust . B.readInt

parseTraceMonadic :: [B.ByteString] -> ParserState
parseTraceMonadic ss = state { methodQueue = reverse (methodQueue state) }
  where state = execState (mapM_ (\x -> modify (updateState x) ) ss) initialParserState
  {- I've pushed this through a >> get >>= (`seq` return()) too -} 

updateState :: B.ByteString -> ParserState -> ParserState
updateState s state = case (B.unpack $ head fields) of
  "M" -> updateStateMethod     fields state
  "E" -> updateStateException  fields state
  "C" -> updateStateEntry      fields state
  "R" -> updateStateExit       fields state
  where fields = B.splitWith (== ' ') s

updateStateMethod :: [B.ByteString] -> ParserState -> ParserState
updateStateMethod (_:methodId:methodName:_) state = 
  let methodMap' = M.insert (readInteger methodId) methodName (methodMap state)
  in methodMap' `seq` state { methodMap = methodMap' }

updateStateException :: [B.ByteString] -> ParserState -> ParserState
updateStateException _ state = state

updateStateEntry :: [B.ByteString] -> ParserState -> ParserState
updateStateEntry (_:ss) state = 
  let methodStack' = updateMap thread (methodStack state) (\x y -> Just (x:y)) (pre state, 0, method)
  in let pre' = ((+1) $! (pre state))
  in methodAtack' `seq` pre' `seq`
  state { methodStack = methodStack'
        , pre = pre' }
  where method = mkMethod ( B.unpack ss)
        thread = Method.thread method

updateStateExit :: [B.ByteString] -> ParserState -> ParserState
updateStateExit (_:ss) state = 
  case updateMethod m ( B.unpack ss) of
     Just um -> let methodStack' = updateMap thread 
                                   (methodStack state) 
                                   (\x y -> Just (tail y)) 
                                   (pre_, post state, um)
                in let methodQueue' = updateMap thread 
                                      (methodQueue state) 
                                      (\x y -> Just (x:y)) 
                                      (pre_, post state, um)
                in let post' = ((+1) $! (post state))
                in methodStack' `seq` methodQueue' `seq` post' `seq`
                state { methodStack = methodStack'
                      , methodQueue = methodQueue'
                      , post = post' }
     Nothing -> error $    "Top of the stack is mismatching! Expected " 
                        ++ (show m) ++ " yet got " ++ (show ss) 
                        ++ "\n" ++ (show state)
  where method = mkMethod ( B.unpack ss)
        thread = Method.thread method    
        (pre_, _, m) = let x = M.lookup thread (methodStack state) 
                       in x `seq` case x of
                          Just stack -> head stack
                          Nothing    -> error $    "Method stack has not been found for thread " 
                                                ++ (show thread) ++ " -> fields: " ++ (show ss)

updateMap key map f value = case M.member key map of
                              True  -> M.update (f value) key map
                              False -> M.insert key [value] map

main = do
          [filename] <- System.Environment.getArgs
          file       <- System.IO.openFile filename System.IO.ReadMode
          contents   <- B.hGetContents file
          let parserState = parseTraceMonadic . B.lines $ contents
          print (methodQueue parserState)
          print (methodStack parserState)
          print (methodMap parserState)
          print (pre parserState)
          print (post parserState)

I think taking a peek at the Method module can;t hurt either, so:

data Method = Method { mid :: Integer
                     , thread :: Integer
                     , instruction_entry :: Integer
                     , instruction_exit :: Integer
                     } deriving (Eq, Show)

eM = Method 0 0 0 0

mkMethod :: [String] -> Method
mkMethod s = let [_thread, _id, _entry] = take 3 $ map (read :: String -> Integer) s 
             in [_thread, _id, _entry] `seq` Method { mid = _id
                                                    , thread = _thread
                                                    , instruction_entry = _entry
                                                    , instruction_exit = 0

updateMethod :: Method -> [String] -> Maybe Method
updateMethod (Method mid thread instruction_entry instruction_exit ) s
  | thread == _thread && mid == _id = _exit `seq` Just Method { mid = mid
                                                              , thread = thread
                                                              , instruction_entry = instruction_entry
                                                              , instruction_exit = _exit
  | otherwise = Nothing
  where [_thread, _id, _exit] = take 3 $ map (read :: String -> Integer) s
Submitted by itkovian on Fri, 07/14/2006 - 4:41am.

Lemmih showed me how to speed up things a few notches. Using a fold instead of the state, bringing the ByteString stuff into the Method module instead of passing it through Strings first ... It seems like the heap usage is better too.

However, I'm still not down to the speed Ocaml yields, sadly, because I don't like programming in Ocaml. Of course, that's a personal thing.

Submitted by itkovian on Tue, 07/25/2006 - 1:54am.

I have extended the implementation with Lemmih's suggestions to handle a trace that contains data for different threads, using a Map Int State thingie. Turns out it slows down the speed of the program again.

Furthermore, its still not up to par with a quick-and-dirty ocaml implementation I wrote (I suck at writing ocaml). So, I'm thinking that I should rethink the implementation in Haskell, although I see no immediate opportunity. FYI, the ocaml takes 13 seconds to parse a 3.5 million line trace, while the Haskell (-O2) takes at least 11 seconds on a 40K line trace.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.