Continuing with my fixation on reducing the number of network roundtrips in OCI*ML I have enhanced the orabindexec
function to perform bulk DML, that is to say, batching many similar INSERT
, UPDATE
or DELETE
statements on the same table into a single OCI call. The function signature is unchanged, taking a list of arrays each of which represents one row to bind into a statement prepared with oraparse
, so existing code will get a boost with no modification. The size of the batch is simply the size of the list; effectively this function is the opposite of orafetchall
.
$ ./ocimlsh Objective Caml version 3.12.0 not connected > orabindexec;; - : Ociml.meta_statement -> Ociml.col_value array list -> unit = <fun> not connected > orafetchall;; - : Ociml.meta_statement -> Ociml.col_value array list = <fun> not connected >
Using the same system from the previous benchmark on prefetching, a 1G client VM with a ~35ms ping/~70ms TNS ping time over a WAN/VPN connection to the server, loading a 10,000-line CSV file with a COMMIT
at the end shows the following results:
Batch size | Time (s) | Rows/sec | Speedup factor |
---|---|---|---|
1 | 855 | 11.70 | 1 |
10 | 93.53 | 106.92 | 9.14 |
100 | 20.45 | 489.00 | 41.79 |
1000 | 17.33 | 577.03 | 49.32 |
2000 | 17.85 | 560.22 | 47.88 |
5000 | 25.01 | 399.84 | 34.17 |
10000 | 36.28 | 275.63 | 23.56 |
Repeating the same test locally on the server shows just what a brutal effect network latency has on SQL*Net, while showing an even greater improvement with bulk DML:
Batch size | Time (s) | Rows/sec | Speedup factor |
---|---|---|---|
1 | 11.75 | 851.06 | 1 |
10 | 1.27 | 7874.02 | 9.25 |
100 | 0.22 | 45454.55 | 53.41 |
1000 | 0.11 | 90909.09 | 106.82 |
2000 | 0.12 | 83333.33 | 97.92 |
5000 | 0.15 | 66666.67 | 78.33 |
10000 | 0.19 | 52631.58 | 61.84 |
Interesting to see that the optimal batch size is the same for this particular workload.
While this was somewhat more work to implement than simply switching on OCI’s built-in prefetching mechanism, the numbers show that it was definitely worthwhile. While the rows/sec over the WAN/VPN were lower than a SELECT
statement, as to be expected with a pre-warmed block buffer cache, the speedup from tuning is nearly 50×, and over 100× locally. OCI*ML is definitely fast enough now, without even any tricks like INSERT APPEND
or NOLOGGING
to be used as an ETL tool in place of Perl, as well as for monitoring and report generation.
It also demonstrates the value of experimental over theoretical tuning, bigger batches are not always better. In part this will be because two copies of the batch will exist in memory: one on the OCaml side as it is being parsed and the list built, and another on the OCI side as it being reconstructed there in a suitable format.