This article presents an enhanced-performance, hardware-efficient Softmax Function (SF) for a deep neural network accelerator. Softmax is used in the classification layer in deep learning models and is also used in hidden layers of advanced neural networks like Transformer and Capsule networks. The major challenge in designing efficient hardware architecture of SF is complex exponential and division computational sub-blocks. Utilizing mutual exclusivity of CO-ordinate Rotational DIgital Computer (CORDIC) algorithm, hardware-optimized pipelined CORDIC-based architecture is considered for the area, power, and enhanced throughput design. In order to maintain good accuracy in the deep learning models, the proposed SF design undergoes a Pareto study on the variation of accuracy for the number of pipeline stages. The proposed design is quantized to 16-bit precision, and inference accuracy is validated for various datasets. The SF is prototyped using Xilinx Zynq FPGA and can be operated at 685MHz. Also, ASIC implementation is performed for 45nm technology node at 5GHz of maximum operating frequency. The design achieves a validation accuracy loss of less than 2% in the account of reduced silicon area and Energy-Delay-Product(EDP) (by 12×). Post synthesis simulation result illustrates that the proposed design achieves 3× better performance in terms of area, power, and logic delay compared
to state-of-the-art architectures.