https://doi.org/10.5573/JSTS.2026.26.3.228
(Hyunji Kim) ; (Ji-Hoon Kim)
Sparse matrix-vector multiplication (SpMV) is a memory-bound kernel widely used in scientific computing, machine learning, and graph analytics, where bandwidth utilization (BU) serves as a key performance metric for hardware accelerators. In our prior work, we presented a DRAM bandwidth-scalable SpMV accelerator whose processing element (PE) line count scales with DRAM bandwidth, combined with offline pre-processing to eliminate bank conflicts and data dependencies, achieving an average BU of 89% of the theoretical maximum. However, its column-index-based equal partitioning can cause significant PE line load imbalance when nonzero elements are unevenly distributed across columns, leaving faster PE lines idle and degrading BU?an effect that intensifies as PE line count increases with higher off-chip bandwidth. In this paper, we analyze the relationship between column-wise nonzero distribution and PE line imbalance using 29 SuiteSparse benchmark matrices and propose a nonzero-aware partitioning strategy that assigns column ranges based on the actual number of nonzero elements to equalize workloads across PE lines. While the offline pre-processing in our prior work focused on conflict-free data rearrangement, the proposed method targets the preceding partitioning stage, requiring no modification to the on-chip accelerator hardware. Cycle-accurate simulation on 29 matrices across three configurations (2, 4, and 8 PE lines) shows average BU improvements of 17.2%, 65.4%, and 113.7% for 2, 4, and 8 PE line configurations, respectively, with zero degradation across all matrices and configurations, confirming that the method preserves the bandwidth scalability of the baseline accelerator.