Thursday, May 14, 2015

Benchmarking Elixir File Hashing Step 1. Creating test data

In my previous post I outlined how to get a cryptographic hash of a file using Elixir. For large files, you want to chunk the file rather than read the whole thing into memory, it would be good to determine if there is an optimal chunk size to use to for passing into the encryption functions.
 iex> File.stream!("./known_hosts.txt",[],2048)
|> Enum.reduce(:crypto.hash_init(:sha256),fn(line, acc) -> :crypto.hash_update(acc,line) end )
|> :crypto.hash_final |> Base.encode16 "97368E46417DF00CB833C73457D2BE0509C9A404B255D4C70BBDC792D248B4A2" 

The first step in doing this is to create test data files of a specific length. An obvious approach is to simply open /dev/random and read until you've got enough sample data. A simple shell example.
 head -c 1024 < /dev/random > test.data 

Translating that to Elixir looks like
  iex(1)> File.stream!("/dev/random",[],1024) |> Enum.take(1) ** (File.Error) could not stream /dev/random: illegal operation on a directory (elixir) lib/file/stream.ex:81: anonymous fn/2 in Enumerable.File.Stream.reduce/3 (elixir) lib/stream.ex:1012: anonymous fn/5 in Stream.resource/3 (elixir) lib/enum.ex:1740: Enum.take/2
That sure looks like a bug, /dev/random is not a directory, but a char special device file. While the error message is misleading, there is no actual bug. Erlang will not open files for reading that it considers dangerous to the overall scheduler. In this case, /dev/random is a character special device file and since these kinds of files usually block on I/O, Erlang errs on the side of caution and will refuse to open the file. There is an exception in the Erlang code for /dev/null since that is considered safe for the scheduler. This post goes into the details.

Reading Device Files in Erlang

There are several ways to get around this problem. The first solution that springs to mind is actually one of the more difficult ones to do in Elixir. In many languages there is a system call that you can use to execute shell commands. Elixir has System.cmd, but it is relatively limited. You can specify the command to execute and the argument list, but you cannot use shell based I/O redirection.

The most straightforward Elixir solution is to use the rand_bytes function from the Erlang crypto library.
  iex(1)> File.write("test.data",:crypto.rand_bytes(1024))


But that isn't much fun, and while it solves this problem it doesn't give us a tool for interacting with external programs. We'll look at more general solutions in the next post...