In the process of learning how to get a cryptographic hash of a file in Elixir, I found myself switching back and forth from iex to erl -man to better understand how
to use the available Erlang functions in Elixir. After all there are over 500 modules in the standard Erlang libraries that are available in Elixir. Wouldn't it be nice I thought to just be able to type
iex> h :crypto.hash
and get some basic information about the functions. Elixir already supports listing all the functions available in both Elixir and Erlang modules, using
the module.TAB sequence
iex> h :crypto.TAB
If we have Erlang installed, the man pages must be there somewhere. And of course one of the tenants of Iex is that any functionality should be available in the Windows version if at all possible. While creating a command to shell out to erl -man would be simple, it wouldn't be portable.
Erlang installs it's man pages in a separate path to avoid conflicts with the standard unix man
pages but puts them in the standard man/man3 locations. The convention is that the
man page for an Erlang module is the Erlang module name followed by section number. The
standard Erlang functions are documented in the erlang.3 man page.
So my next step was to reimplement erl -man in purely Elixir code. The first part went really
quickly, finding the man page for a given module was relatively straight forward. We just find
where the erl executable is and work backwards from that to find where the erlang man pages
are and then search for the module name.
def manpath do
start = System.find_executable("erl")
case start do
nil -> nil
_ -> find_manpath(start)
end
end
defp find_manpath(erl_path) do
mpath = erl_path |>
Path.split |>
Stream.scan(&Path.join(&2,&1)) |>
Enum.filter( fn(p) -> File.exists?(Path.join([p,"man","man3","ets.3"])) end ) |>
List.last
if mpath , do: Path.join(mpath,"man"), else: nil
end
After that I implemented a very simple-minded man nroff macro to markdown translator and was able
to display an Erlang man page from the iex prompt. During this process I noticed that the Erlang
man pages use a very consistant, small subset of the man nroff macro package. This lead me to believe
that it would be possible to extract the specific function documentation from the man pages.
It turns out that Erlang documentation is maintained in XML and translated by an xmerl based program
into both man pages and html documentation. Unfortunately, this XML source is not included in the default
binary distributions of Erlang. Unix versions get the man pages ( and html usually ), but the Windows
distributions only get the html. Parsing the html is a lot more complicated that dealing with the nroff
and there are no HTML parsers in the standard Elixir libs.
At this point, I decided to dig a bit deeper and figure out just how
iex> h Atom.to_char
really works.
Friday, June 19, 2015
Friday, June 5, 2015
Micro Benchmarking in Elixir using Benchfella
When I finished the last post, I had complete intentions of writing a post about how to use the Ports module in Elixir to interaction with the system shell to do I/O redirection. However, I now understand why people complain about Erlang documentation. Ports in Elixir aren't documented other than by a reference to the underlying Erlang libraries and the documentation in Erlang for ports is very incomplete. ( For example, what are the default drivers available in erlang?)
Ports are gateways between the Erlang VM and external codes and processes. The Erlang VM is sensitive to code "hanging" in any fashion and Ports need drivers that are aware of how to interact with the Erlang VM safely. However I was unable to find any clear documentation of just what drivers are available and the examples I found were inconsistent.
I did however find the Porcelain Elixir module that is both well documented and very straightforward to use. Having taken this long to get back to this, I'd just as soon go on to the actual benchmarks.
Benchfella is a micro benchmarking framework that works much like ExUnit does for testing and
you can use the same Dave Thomas hack for creating many tests that iterate over a list of values.
defmodule Hash do
use Benchfella
@lengths [1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216]
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**24 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_24",@chunk)
end
end
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**26 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_26",@chunk)
end
end
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**28 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_28",@chunk)
end
end
def hash_test(file,chunk) do
File.stream!(file,[],chunk)
|>
Enum.reduce(:crypto.hash_init(:sha256),fn(line, acc) -> :crypto.hash_update(acc,line) end )
|>
:crypto.hash_final
|>
Base.encode16
end
end
Benchfella runs each test for as many times as possible in a given interval ( the default is one second ) and returns the average time per test over that interval. Data from each run is stored on the filesystem so you can do comparisons between runs. The plot below shows the results from using a chunk size in powers of 2 from 2**10 to 2**24 to hash a file of size 2**24.
The results are similar for files of size 2**26 and 2**28. As you can see there is a significant advantage to using a large chunk size. ( With an odd bump at 2**23 ) This test
was done on a MacBook Pro with 16gig of memory and an SSD disk drive.
This shows that using large binaries in Elixir ( and Erlang ) is generally the fastest way to deal with large data sets. Of course you need to make the tradeoff between total available memory and the number of binaries you want to process at a time.
The other benchmark I tested was to compare using the "chunk" method of hashing the file with a chunk size larger than the file and simply reading the entire file into a string and
computing it's hash. The simple read method was consistently twice as fast as the single chunk method.
So for my resulting application I choose to pick a chunk size that allows the code to process multiple files at a time and chooses a method for computing the hash based on the
size of the file.
Ports are gateways between the Erlang VM and external codes and processes. The Erlang VM is sensitive to code "hanging" in any fashion and Ports need drivers that are aware of how to interact with the Erlang VM safely. However I was unable to find any clear documentation of just what drivers are available and the examples I found were inconsistent.
UPDATE:
The types and drivers are documented in the man page for erlang in the open_port section.
erl -man erlang
I did however find the Porcelain Elixir module that is both well documented and very straightforward to use. Having taken this long to get back to this, I'd just as soon go on to the actual benchmarks.
Benchfella is a micro benchmarking framework that works much like ExUnit does for testing and
you can use the same Dave Thomas hack for creating many tests that iterate over a list of values.
defmodule Hash do
use Benchfella
@lengths [1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216]
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**24 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_24",@chunk)
end
end
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**26 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_26",@chunk)
end
end
for chunk <- @lengths do
@chunk chunk
bench "Hash 2**28 file by #{Integer.to_string(@chunk)}" do
hash_test("./bench/data_2_28",@chunk)
end
end
def hash_test(file,chunk) do
File.stream!(file,[],chunk)
|>
Enum.reduce(:crypto.hash_init(:sha256),fn(line, acc) -> :crypto.hash_update(acc,line) end )
|>
:crypto.hash_final
|>
Base.encode16
end
end
Benchfella runs each test for as many times as possible in a given interval ( the default is one second ) and returns the average time per test over that interval. Data from each run is stored on the filesystem so you can do comparisons between runs. The plot below shows the results from using a chunk size in powers of 2 from 2**10 to 2**24 to hash a file of size 2**24.
The results are similar for files of size 2**26 and 2**28. As you can see there is a significant advantage to using a large chunk size. ( With an odd bump at 2**23 ) This test
was done on a MacBook Pro with 16gig of memory and an SSD disk drive.
This shows that using large binaries in Elixir ( and Erlang ) is generally the fastest way to deal with large data sets. Of course you need to make the tradeoff between total available memory and the number of binaries you want to process at a time.
The other benchmark I tested was to compare using the "chunk" method of hashing the file with a chunk size larger than the file and simply reading the entire file into a string and
computing it's hash. The simple read method was consistently twice as fast as the single chunk method.
So for my resulting application I choose to pick a chunk size that allows the code to process multiple files at a time and chooses a method for computing the hash based on the
size of the file.
Labels:
benchfella,
benchmark,
cryptographic hash,
elixir,
erlang,
file hashing
Subscribe to:
Posts (Atom)