Following on from yesterday’s post, here is a similar tool implemented in Haskell. This one is to process a log format we use for various applications, which is
<timestamp>: key1=value1 key2=value2 ... keyn=valuen
There may be any number of keys, but within a log file all entries will have the same keys. The end result will be a SQLite database with a timestamp column + one column per key.
module Main where import Data.Time (LocalTime) import Data.Time.Format (readTime) import System.Locale (defaultTimeLocale) import IO (readFile, getContents, hPutStrLn, stderr) import Data.List (intersperse, lines) import Data.List.Split (splitOn) import Database.HDBC.Sqlite3 (connectSqlite3, Connection) import Database.HDBC (run, commit, disconnect, handleSqlError, toSql, prepare, executeMany, getTables) import System.Console.GetOpt import System
As with OCaml the first thing is to open or import
the modules. I am fully qualifying all the functions I’m using just to help familiarize myself with where things live. Haskell “knows” which of many source files is the entry point by where module Main is declared (I am not actually sure how OCaml does, just from where a let _
construct is? Something it can evaluate immediately rather than just declaring a function?) GetOpt
seems to need to be imported in its entirety in order to get the constructor for its datatype.
-- | Input line of the form "Tue Jul 6 05:15:02 2010: sweep=27 broadcast=194..." to LocalTime and a list of key-value pairs parseLine::String -> (LocalTime, [(String, Int)]) parseLine s = (log_date, kvs) where log_date = ((readTime defaultTimeLocale "%a %b %e %T %Y" log_date')::LocalTime) log_date' = s' !! 0 kvs = map toPair ( map (splitOn "=") (splitOn " " (s' !! 1)) ) s' = splitOn ": " s toPair [x, y] = (x, read y::Int)
This is the function that actually parses a line of the file. I’m sure there’s a much better way to do this with Parsec, but it very simply first splits the line into a timestamp and the data on the :
, then splits the data into key=value
pairs on whitespace, then finally splits each of those on =
. It returns the timestamp and a list of tuples each of which is one key and its associated (integer) value.
processLogFile::String ->IO String ->FilePath ->IO () processLogFile tabname infile outfile = do conn <-connectSqlite3 outfile logfile <-infile
This is the function that processes an entire logfile. It takes a table name (a pure String), an input stream (impure, hence being extracted from the IO Monad with ←), an output filename (a pure String, it is not actually a file at this stage, just a name – the IO happens when it is passed as a parameter to connectSqlite3, hence another use of ← ) and returns a result of IO.
-- if the table already exists, drop it. return 0 the right way to say "do nothing"? tables <-getTables conn case (tabname `elem` tables) of True -> run conn (dropTable tabname) [] False -> return 0
In this case I am recreating the table rather than appending new data into it, using a similar technique to ignore the error if it occurs (i.e. the table did not already exist). dropTable
is defined below; it is just a function that returns the the SQL.
-- read the first line of the logfile, extract the keys and create a table -- this will obviously fail if more keys are added later let (_, kvs) = parseLine $ head $ lines $ logfile run conn (createTable tabname kvs) []
Next read the first line of the log file to get the list of new column names. This Haskell syntax is equivalent to parseLine (head (lines logfile))
. Read the logfile, split it on newlines, take the first one, and call parse function on it. Haskell’s laziness means that this doesn’t read the entire file just to get one line of it. createTable
is a function defined below that returns SQL.
stmt <-prepare conn (insertInto tabname kvs) executeMany stmt (map toSqlList (lines logfile)) commit conn disconnect conn
Now a statement is prepared (this does IO, SQLite compares the statement against the underlying table structure) and the statement is applied to every line in the log file using map
. If all the underlying infrastructure was set up for it I guess this could be massively parallelized as each line is independant of the others, and the toSqlList
function is pure. Not much point when SQLite would do a table lock for each one tho’! Haskell’s HDBC disables autocommit so it must be done explicitly.
where -- SQL statements createTable tabname kvs = "create table " ++ tabname ++ " (log_date datetime primary key, " ++ (concat $ intersperse "," (map (++ " number") (colNames kvs))) ++ ");" dropTable tabname = "drop table " ++ tabname ++";" -- length of list of keys +1 for the log_date column insertInto tabname kvs = "insert into " ++ tabname ++ " values (" ++ (concat $ intersperse "," $ replicate ((length kvs) + 1) "?") ++ ")" colNames kvs = map fst kvs -- convert the data from the parsed structure to a list of SQL types for inserting in a prepared statement toSqlList logline = toSql log_date : colValues kvs where (log_date, kvs) = parseLine logline colValues kvs = map toSql (map snd kvs)
These are the local functions referenced in the main body of processLogFile
. Mainly they are straightforward templates for SQL statements. toSqlList
creates a list of all the values that will be bound into the prepared statment (lines 35-36) cast using the toSql
constructor from Database.HDBC
. It does this by mapping snd
(second) over the (key, value) pairs returned by parseLine
to extract the values. Line 45 simply produces a string of the form ?,?,?...?
.
-- handle command line options -- default is to read STDIN, create/use a database file called log.db and load into a table named log data Options = Options { optTabName ::String , optInput ::IO String , optOutput ::FilePath } defaultOptions = Options { optTabName = "log" , optInput = getContents , optOutput = "log.db" } options::[OptDescr (Options -> IO Options) ] options = [ Option "i" ["input"] (ReqArg (\arg opt ->return opt {optInput = readFile arg}) "FILE") "Input file" , Option "o" ["output"] (ReqArg (\arg opt ->return opt {optOutput = arg}) "FILE") "Output file" , Option "t" ["table"] (ReqArg (\arg opt ->return opt {optTabName = arg}) "TABLE") "Table name" , Option "v" ["version"] (NoArg (\_ -> do argv0 <-getProgName hPutStrLn stderr (argv0 ++ " version 0.1") exitWith ExitSuccess)) "Print version" , Option "h" ["help"] (NoArg (\_ ->do argv0 <-getProgName hPutStrLn stderr (usageInfo argv0 options) exitWith ExitSuccess)) "Show help" ]
This code handles both parsing the command line arguments and setting up the defaults in the case that they are not specified. First a data structure (record) for the options is defined, then this is populated with the defaults, then these are tied together with the actual command line options and their help text. This is more sophisticated than the OCaml Arg
module I have been using so far but a similar facility is available for it.
main = handleSqlError $ do args <- getArgs let (actions, nonOptions, errors) = getOpt RequireOrder options args opts <-foldl (>>=) (return defaultOptions) actions let Options { optTabName = table , optInput = input , optOutput = output } = opts processLogFile table input output -- End of file
And finally, the entry point of the program. This gets the command line arguments (which is IO), overlays the contents of defaultOptions
onto anything that is undefined, performs a pattern match against the resulting data structure to bind the variables, then calls processLogFile
to do the work.
So which is better for developing this particular type of tool, OCaml or Haskell? There is very little to call between them in my opinion. I’m sure I will refine that once I have used them for more substantial projects. Both give me what I’m after: a high-level type-safe mainly functional compiled language. Haskell does have the advantage of better out-of-the-box date handling and better bindings to SQLite – it can INSERT
with a prepared statement. The former is a non-issue once the appropriate OCaml library is identified and installed, of course, but Haskell is more Pythonic in this sense; there is an obvious way to do it, whereas OCaml requires an inexperienced programmer (i.e. me) to make decisions that will only obvious in retrospect, e.g. which of the three (so far!) date libraries to use, Unix, CalendarLib or Ocamlnet. I don’t want to complain about OCaml’s SQLite bindings too much; after all someone has done the work and given it away. Eventually I’ll be at the point where I can just implement the missing feature and submit a patch myself…
OCaml/Haskell or Python, is the next question. There’s no denying Python is an incredibly productive language, with great libraries, and much more maintainable than Perl. But throwaway scripts never are; you will always use it again, tweak it, add options, add error handling, give it to someone else to use… And before long, it’s real tool. In some cases today’s 10-line file munger is tomorrow’s 1000-line ETL batch job on which the business depends – and which no-one dares touch in case it breaks. Of course people do do large scale projects in Python. I just don’t see the point in writing unit tests for things the type system could already have caught!
FWIW I have written a minimalistic syntax extension for OCaml (ocaml-sqlexpr) that simplifies
the use of SQLite in a type-safe way that resembles OCaml’s built-in
(type-safe) Printf.printf, allowing you to do things like
It uses prepared statements (which will be cached if you use sqlc”…”), and
handles parameter binding and data conversion when it makes sense (e.g.
promoting INT to a float). It also features (nestable) transaction support and
the usual higher-level functions (fold, iter) with automatic retries when the
DB is locked. Just have to get around to documenting it.
Intriguing! I will definitely check that out. Thanks 🙂
Pingback: Import 2 column CSV files to some sort of DB… preliminary planning | XL-UAT