Deploying deeply quantized neural networks on FPGA devices can be a time consuming
task. This has led to research on tools to automate this procedure, specifically for the case of fast
machine learning. This is a specialized field concerned with very low latency processing of machine
learning algorithms, as opposed to the more usual task where throughput is the main goal. Existing
automated solutions are mainly based on high-level synthesis, which tends to be inefficient for larger
neural networks due to the use of polynomial time algorithms. In this paper, we present chisel4ml, a
tool for generating fully parallel and very low latency hardware implementations of deeply quantized
neural networks for FPGA devices. The circuits generated by chisel4ml utilize up to 40% less look-up
tables for a reasonably sized neural network compared to fully parallel implementations based on
high-level synthesis. As chisel4ml uses structural descriptions of deeply quantized neural networks
in the form of Chisel generators, it is able to generate the hardware in two orders of magnitude less
time than solutions based on high-level synthesis.