One powerful technique I often use is to quickly load application or system logs into throwaway SQLite databases for analysis. This means I can easily run queries such as, how many times has event X occurred broken down by hour, or how how does the average of this metric on that box compare with its peers. Other approaches I have seen people use for this type of analysis include Awk, Excel, and printing out every Orca graph, laying them out on the boardroom table and just staring at them. These approaches rapidly become unmanageable once you need to correlate many machines, different types of logfiles, etc. In this example I will be loading sar
output but it can obviously be any regularly-structured file.
Here is the project directory:
gaius@debian:~/Projects/Sar2db$ ls Makefile sar2db.ml sqlite3_overlay.ml
And a slightly more sophisticated Makefile:
# Makefile TARGET=sar2db EXTRAS=sqlite3_overlay.ml $(TARGET): $(TARGET).ml $(EXTRAS) ocamlfind ocamlopt -package sqlite3 -package unix -package calendar -linkpkg $(EXTRAS) $< -annot -o $(TARGET) clean: rm -f $(TARGET) *.o *.cm* *~ *.db # End of file
The OCaml code starts as usual, opening the main modules I am using to move their functions into the default namespace:
(* A program to load sar output into SQLite for analysis *) open Unix open Sqlite3 open Sqlite3_overlay open Scanf open Printf (* Parse command line arguments *) let (infile, dbname, tabname) = ((ref ""), (ref ""), (ref "")) let _ = Arg.parse [("-i", Arg.Set_string infile, "Input file name (default STDIN)"); ("-o", Arg.Set_string dbname, "New SQLite database name (default sar.db)"); ("-t", Arg.Set_string dbname, "Table name (default sardata)")] (fun x -> raise (Arg.Bad (sprintf "%s: unexpected argument '%s'" Sys.argv.(0) x))) (sprintf "Usage: %s [-o database]" Sys.argv.(0));
Scanf
is a module that is the opposite of Printf
: rather than producing output based on a format string, it instead uses a format string to parse a line of input. Note that I am overlaying some Sqlite3
functions with my own; I cannot for the life of me figure out how to do INSERT
with bind variables with this library. I know it is possible in the underlying C API. So I am going to fake it for now. Obviously that abandons the performance enhancement of using prepared statements in the first place, but it keeps the main body of the code nice and clean, and I’m sure I’ll figure it out eventually.
(* main *) try (* Open the database (implicit create) and create the table (if it does not already exist *) let dbname = (match !dbname with |"" -> "sar.db" |_ -> !dbname) in let chan = (match !infile with |"" -> Pervasives.stdin |_ -> open_in !infile) in let lda = db_open dbname in let tabname = (match !tabname with |"" -> "sardata"|_ -> !tabname) in
This part of the code actually implements the defaults for the command line arguments. If the database name is not specified in defaults to sar.db
, similarly the table name defaults to sardata
. This flexibility allows many different datasets to coexist for cross referencing.
let sql_create = sprintf "create table %s ( sample_time datetime, sample_host varchar2(32), sample_cpu varchar2(4), pc_user number, pc_nice number, pc_system number, pc_iowait number, pc_steal number, pc_idle number)" tabname in let sql_index = sprintf "create unique index ix_sar on %s (sample_time, sample_host)" tabname in List.iter (fun stmt -> ignore (exec lda stmt) ) [sql_create; sql_index];
Next create the table and an index on it for performance. If either already exists it’s fine, the new data will simply be appended to what is there, so any errors executing the SQL are ignored. Before using this technique it is important to verify that the SQL is actually valid! The most obvious flaw is that I am performing no verification that the table name passed in on the command line is a valid one. However I have tested that provided that it is, this block of code will perform as intended.
(* Read the header to get the hostname and the start date *) let (sample_host, start_date) = let first_line = input_line chan in (sscanf first_line "%_s %_s (%s@) %02d/%02d/%04d" (fun host mon dd yyyy -> (host, fst (mktime {tm_year = yyyy - 1900; tm_mday = dd; tm_mon = mon - 1; tm_sec=0; tm_min=0; tm_hour=0; tm_wday=0; tm_yday=0; tm_isdst=false}) ))) in
Before going on, it would perhaps be useful to note the format of sar output:
gaius@debian:~/Projects/Sar2db$ sar|head -5 Linux 2.6.32-5-686 (debian) 09/12/2010 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:01:01 AM all 51.33 0.00 6.83 15.54 0.00 26.30 12:02:01 AM all 30.36 0.00 6.11 21.08 0.00 42.45 12:03:01 AM all 43.72 0.00 8.41 21.31 0.00 26.57
There is a date in the header, and the individual rows only have the time. Therefore to support gathering data over several days, it will be necessary to add the date and the time together. Also in the header is the name of the machine. The )
character is special in the format string, hence escaping it with the @
in order to perform a match. There are more concise methods available for date parsing, but in this program times and dates are mixed in with other data, so it makes sense to parse them all together. It takes as arguments a string to match against, a format string to tell it what to expect and a function to execute with the values. Format specifiers with an underscore are matched but then discarded and not passed to the function, or alternately I could have declared the function (fun _ _ host mon dd yyyy -> ...)
. The function returns a tuple containing the host name and the start date.
The final block of code in the module implements an “infinite” loop, broken out of when an exception is raised. Note (line 18) that we are already in a try
block; this is the block that will exit the program when the end of the input is reached, either a file or standard input. The try
block here is to loop again when the format string is not matched (periodically there will be a repeat of the header in the sar
output, which we will ignore, as the format will not match). When nesting like this it is helpful to annotate which with
belongs to which try
. Indentation is not significant in OCaml as it is in Haskell and Python.
(* Infinite loop to read the file, exiting when End_of_file *) let ins_stmt = prepare lda (sprintf "insert into %s values (?,?,?,?,?,?,?,?,?)" tabname) in while true do let next_line = input_line chan in try sscanf next_line "%02d:%02d:%02d %s %s %f %f %f %f %f %f" (fun hh mi ss ampm cpu pc_user pc_nice pc_sys pc_io pc_steal pc_idle -> (* Add the start_date to the sample_time *) let hh24 = match hh with |12 -> 0. |_ -> float hh in let hh' = match ampm with |"AM" -> 0. |_ -> 12. in let sample_time = start_date +. ((hh24 +. hh') *. 3600.) +. (float mi *. 60.) +. float ss in bind ins_stmt 1 (to_date sample_time); bind ins_stmt 2 (Data.TEXT sample_host); bind ins_stmt 3 (Data.TEXT cpu); Array.iteri (fun i x -> bind ins_stmt (i+4) (Data.FLOAT x)) [|pc_user; pc_nice; pc_sys; pc_io; pc_steal; pc_idle|]; match (execP ins_stmt) with |Rc.OK -> () |_ -> print_endline (errmsg lda); ) with Scan_failure e -> () |End_of_file -> () (* sscanf *) done with Sys_error e -> eprintf "%s: %s\n" Sys.argv.(0) e |End_of_file -> () (* input_line *) (* end of file *)
The addition of a time to a date is based on the example in PLEAC. I have not yet decided whether to use CalendarLib or OCamlnet’s Netdate in future. Probably the latter as it cuts down on the dependencies. Similarly, I’ll look into Pcre as it’s a prereq for Ocamlnet anyway, and I am very familiar with regexps from Python (that is how I do it usually).
Note that each different data type is bound slightly differently. Since there are many consecutive values of the same type, they can be bound in a loop with iteri
. They are cast using constructors from the Sqlite3.Data
module to Data.t
. Finally the SQL can be executed (execP
is my own function for prepared statements, as noted earlier).
It’s possible to very quickly and easily gather data from a variety of machines, using SSH key-pair and/or NFS storage. For example:
for n in `seq 1 10` do ssh host${n} "sar|/shared/sar2db -o /shared/sar.db" done
Now using SQL’s GROUP BY
syntax, it is easy to analyze the performance of a cluster.
Finally here is the code for my “fake” prepared statements.
(* Overlays on Sqlite module. Highly inefficient! *) open Sqlite3 open CalendarLib (* expose only the interface *) module type SQLITE3_OVERLAY = sig type prepared val prepare : Sqlite3.db -> string -> prepared val execP : prepared -> Sqlite3.Rc.t val bind : prepared -> int -> Sqlite3.Data.t -> unit val to_date : float -> Sqlite3.Data.t end (* a "fake" prepared statement. contains original sql and values bound into it *) type prepared = {lda :db; (* database handle *) sql :string; (* original SQL *) bindvars :Data.t array; (* variables to bind *) positions :int list} (* indexes of ? in sql *) (** return a list containing all the indexes of ? in a string *) let findq sql = let rec findq' sql offset acc = try let pos = String.index_from sql offset '?' in findq' sql (pos+1) (pos::acc) with Not_found -> List.rev acc in findq' sql 0 [] (** construct a prepared statement record *) let prepare lda sql = { lda = lda; sql=sql; positions=(findq sql); bindvars=Array.make (List.length (findq sql)) (Data.NULL)} (** insert a string at pos into a another string *) let splice str pos x = let before = String.sub str 0 pos in let after = String.sub str (pos + 1) ((String.length str) - (pos + 1)) in before ^ x ^ after (** substitute the bind vars of a prepared into it's sql and execute it *) let execP stmt = let sql' = ref "" in sql' := String.copy stmt.sql; for i = (Array.length stmt.bindvars) - 1 downto 0 do (* backwards to not invalidate findq results *) sql' := splice !sql' (List.nth stmt.positions i) ("'" ^ (Data.to_string stmt.bindvars.(i)) ^ "'") done; exec stmt.lda !sql' (** bind a value into the statement at pos - indexes start from 1 *) let bind stmt pos value = stmt.bindvars.(pos - 1) <- value (** Convert Epoch time to Data.TEXT in ISO 8601 format *) let to_date e = Data.TEXT (Printer.CalendarPrinter.sprint "%Y%m%dT%H:%M:%S" (Calendar.from_unixfloat e)) (* End of file *)
It simply builds literal SQL by substituting a string in position of a ? indexed from 1 (see line 46 in sar2db.ml
). I am overlaying the prepare
and bind
functions and adding execP
and to_date
(as there is no Data.DATETIME
constructor). findq
and splice
are only used internally so they are hidden. Sqlite implicitly handles the type conversions from strings, but I am fully type safe until the last step.
Pingback: Dictionary format strings | So I decided to take my work back underground
Pingback: Scope in database code (2) | So I decided to take my work back underground
Pingback: Calling Unix libraries from OCaml | So I decided to take my work back underground
Pingback: Import 2 column CSV files to some sort of DB… preliminary planning | CL-UAT