BLAS3 optimization for the Godson3B1500
 Ming Zhang^{1},
 Naijie Gu^{1}Email author and
 Kaixin Ren^{1}
Received: 24 March 2016
Accepted: 17 November 2016
Published: 25 November 2016
Abstract
This paper proposes a performance model for general matrix multiplication (GEMM) on decoupled access/execute (DAE) architecture platforms, in order to guide improvements of the GEMM performance in the Godson3B1500. This model focuses on the features of access processors (APs) and execute processors (EPs). To reduce the synchronization overhead between APs and EPs, a synchronization module selection mechanism (SMSM) is presented. Furthermore, two optimized algorithms of GEMM for DAE platforms based on the performance model are proposed for ideal performance. In the proposed algorithms, the kernel functions are optimized with single instruction multiple data (SIMD) vector instructions, and the overhead of AP is almost overlapped with EP by taking full advantage of the features of the architecture. Moreover, the synchronization overhead can be reduced according to the SMSM. In the end, the proposed algorithms are tested on the Godson3B1500. The experimental results demonstrate that the computing performance of dGEMM reaches 91.9% of the theoretical peak performance and that zGEMM can reach 93% of the theoretical peak performance.
Keywords
Introduction
Basic linear algebra subprograms (BLAS) (Netlib 2016a) are basic and significant mathematics kernels that provide key functions for highperformance computing (HPC) applications. General matrix multiplication (GEMM), the kernel of level3 BLAS, is vital for the numerical software Lapack (Netlib 2016b) and performance benchmark Linpack (Netlib 2016c). Especially in Linpack, GEMM accounts for 93% of the entire execution time when it is unoptimized (Zhang et al. 2004). Moreover, GEMM is representative of applications where both computation and memory access are in high demand. Therefore, optimizing the performance of GEMM is significant for guiding improvements in the performance of other applications. Additionally, optimizing computingintensive applications such as GEMM can simulate potential problems and help to find bugs in newlydeveloped hardware platforms.
Recently, numerous studies have been conducted to improve the performance of BLAS. Many libraries such as Intel MKL, AMD ACML, ATLAS and GotoBLAS (Goto and Van De Geijn 2008a, b) have been supplied by CPU vendors or HPC researchers. These libraries are aimed at the highest level of performance on various hardware platforms. Additionally, Allen et al. (2009) described autotuning and optimized GEMM techniques for GPU. Wang et al. (2013) have presented a templatebased optimized frameworkAUGEM that can automatically generate fully optimized assembly DLA kernels. The DLA kernels generated by their templatebased approach surpass the implementations of MKL and ACML libraries. Moreover, Gu et al. (2008) have conducted much work for BLAS3 optimization on the Godson2F platform. He et al. (2012) have carried out a study on optimization of BLAS3 on the Godson3A. Zhang et al. (2012) have released a new library, OpenBLAS, which greatly improves the BLAS3 performance on the Godson3A.
The optimized algorithms and models described above can efficiently enhance performance and guide users designing optimized frameworks. However, with advancements in the peak computing capability of processors, conventional memory access methods cannot satisfy computational requirements, and traditional optimization methods will be limited. To solve the memory wall, hardware that uses asynchronous memory access technologies has been developed. As a representative, decoupled access/execute architecture (DAE) (Smith 1982, 1984) was proposed by Smith in 1982. Generally, there are several access processors (APs) and execute processors (EPs) on DAE platforms. APs are accountable for memory access, and EPs are responsible for computations. These functional units are independent and can work in parallel. DAE has now become a valued architecture for HPC applications such as BLAS and FFTW due to its superior computing ability and memory access performance.
It is difficult for applications to autooptimize performance by making full use of EPs and APs, and manual optimizations are needed for the DAE architecture. The Godson3B series consist of DAE platforms. To improve the performance of applications for the Godson3B, some studies have been conducted. Zhu (2011) has designed a new algorithm of dGEMM on the Godson3B, which has been implemented in a simulation platform. Zhao et al. (2013, 2014, 2015) introduced several autooptimization technologies for BLAS, and the optimizations of dGEMV and dGEMM (ATGEMM) were discussed in detail in the Godson3B1000. ATGEMM was optimized by using the L2 cache as the intermediate storage space.
However, these studies do not give thorough consideration to the performance impacts of various architectures, including DAE. Therefore, in order to facilitate the optimization of applications in the DAE architecture, a performance model of GEMM for DAE platforms is proposed in this paper. The impacts related to computation and memory access are parameterized in the proposed model, and the time needed by APs and EPs will be evaluated according to the computing account and computing power. The runtime of computing kernels can be preliminarily computed and presented with the features of EPs. Taking into account various factors in the performance model, the overall runtime can be preliminarily computed. The GEMM performance is improved by analyzing the variables that obviously influence the overall runtime in the model. Additionally, several optimized algorithms for ideal performance in the Godson3B1500 are proposed based on the performance model.
This paper is organized as follows: second section describes the background, including basic GEMM algorithm and the Godson3B1500. Third section discusses the performance model, followed by the optimization technologies. In fourth section, we detail the proposed algorithms of GEMM. Fifth section presents the correctness of numerical accuracy and the performance improvements. Finally, conclusions are drawn in last section.
Background
This section describes the background, including the basic GEMM algorithm and the Godson3B1500. To introduce the Godson3B1500, we mainly focus on memory access methods and vectorization instructions.
Basic GEMM algorithm
GEMM is called dGEMM when the elements of the matrices are doubleprecision floatingpoint numbers. GEMM becomes zGEMM for doubleprecision floatingpoint complex numbers. A complex number consists of a real part and an imaginary part. Unlike real numbers, the multiplication of complex numbers consists of four multiplication operations and four addition/subtraction operations. Assuming that complex number x is \(a+i\times b\) (\(i=\sqrt{1}\)), the result is \((acbd)+i\times (ad+bc)\) when x is multiplied by c + id. The real and imaginary parts of the complex numbers are stored interleaved in the memory in BLAS.
Godson3B1500
The Godson3B1500 can use the mechanism of a cache lock. When the lock window is configured, the cache blocks that are located in the locked L3 cache space cannot be replaced until they are updated manually. For computationintensive applications, many data need to be accessed and computed multiple times. Cache missing brings numerous extra overheads, which notably influence the performance. It can be even more fatal, especially for platforms with a random cache replacement policy. The lock mechanism can keep frequently required data stored in the locked cache, which can greatly reduce the influence of cache missing for computationintensive applications and enhance the application performance.
The Godson3B1500 is a DAE platform. VectDMA and TransDMA can work as APs, and the vector function units work as EPs. The GS464V core can issue two floatingpoint vector instructions, and each instruction can launch four multiplyadd operations in one cycle. There are two floatingpoint operations in the multiplyadd operation. Moreover, there are 8 cores in the EP. When the CPU frequency is 1.5 GHz, the theoretical computing peak capacity (\(Perf_{peak}\)) can reach \(2\times 4\times 2\times 8\times 1.5\) (192.0) GFlops. Generally, the frequency is configured at 800 MHz, and the peak performance is 102.4 GFlops.
Performance optimizations
These performance optimizations are mainly issued in the MIPS architecture. MIPS is a streamlined and highly scalable reduced instruction set computer (RISC) architecture. It can support SIMD vector instructions and visible pipeline delay slots. There are large number of registers, the number and the character of the instructions in the MIPS. Sometimes it can support different memory access methods such as normal CPU access and DMA access. Generally, there are multilevel caches in the MIPS architectures.
This section presents the performance model of GEMM in DAE architecture. The model is developed according to the features of the DAE architecture and briefly introduces the relationship between overall performance and times of EPs and APs for all DAE architecture. Relationship between performance and architecture, the most important part of the performance model, in which the number of functional units, computational ability of functional units, instruction pipeline structure and capacity of the APs are included, focuses on the Godson3B1500. Moreover, taking into account the performance model, some optimization technologies are discussed for performance improvements of the GEMM on the Godson3B1500.
Performance model

\(T_{mem}\), which presents the time for data transfer between different storage hierarchies (e.g., memory, caches and register files) by using normal CPU memory access instructions, can be defined as in (2).where l(j, k) denotes the amount of data that are accessed from the kth layer of memory in the jth computing stage, and L(k) denotes the total amount of data that are accessed from the kth layer in all computing stages. \(\omega (k)\) defines the memory bandwidth of access the kth layer of memory.$$T_{mem}=\mathop {\sum }\limits _{i=1}^{p}{\mathop {\sum }\limits _{k=1}^{h}{\frac{l(i,k)}{\omega (k)}}} =\mathop {\sum }\limits _{k=1}^h{\frac{L(k)}{\omega (k)}}$$(2)

\(T_{shuffle}\) It denotes the time for reorganizing the data including changing the position of the data and obtaining the negative of a number in the vector registers. Sometimes, \(T_{shuffle}\) can be partially minimized or avoided by optimizing the multiplication of complex numbers and integrating the shuffle function into SIMD vector instructions in the Godson3B1500.

\(T_{EP}\) It presents the required time for kernel computation of GEMM. The required time mainly depends on the size of GEMM, and the computing capacity of the EPs. For GEMM(N, N, N), \(T_{EP}\) can be defined as in (3).In (3), \(size_{op}\) defines the amount of operations for each operation. s defines the number of function units. v defines the average degree of parallelism for each instruction. g defines the number of operations for each function unit in one cycle. f defines the frequency of the CPU. \(\lambda _{1}\) denotes the overlapping factor of time for memory access by EPs. The parameters g, s and f are determined by the hardware and they are fixed for the platform.$$T_{EP}=\frac{N^3\times size_{op}}{s\times v \times g \times f} + T_{shuffle} +\lambda _{1}T_{mem}$$(3)

\({T_{APi}}\), which presents the time of data transfer for the ith AP, can be defined as in (4).where \(Count_{i,j}\) defines the size of data in the jth stage for ith AP. \(Speed_{AP_{i,j}}\) denotes the memory access speed of \(AP_i\) for the jstage. p denotes the amount of stages. There are two stages for the process of GEMM. The first stage is the transfer of data from the memory to the locked L3 cache, and the second stage is the transfer of data from the L3 cache to the vector registers.$$T_{APi}=\mathop {\sum }\limits _{j=1}^p{\frac{Count_{i,j}}{Speed_{AP_{i,j}}}}.$$(4)

\(T_{sync}\) It defines the overhead of the synchronization between APs and EPs, such as the time between computation and DMA in some architectures.

\(T_{extra}\) It denotes the extra overhead of other processes, such as the computation of positions for data prefetching and data storing.
As shown in (8), in order to enhance the performance P, the variables \(T_{shuffle}\), \(T_{sync}\), \(T_{extra}\) and \(\lambda _{1}\) should be reduced, while \(\varrho\) and v should be increased. In the DAE architecture, APs and EPs can work in parallel. To reduce the memory access overhead, APs accomplish most missions of memory access, and the normal memory access unit is responsible for the remaining missions of memory access. Most GEMM tasks are computations, and extra overhead makes little difference to the overall runtime. \(\varrho\) is influenced by the computation to memory access overhead ratio, and it is mainly determined by the features of the algorithm and hardware. Variables \(\varrho\) and \(T_{extra}\) will not be discussed in this paper. In the following subsections, the optimizations of v, \(T_{shuffle}\), \(\lambda _{1}\), and \(T_{sync}\) are mainly discussed.
Vectorization
Computations of dGEMM are operations between real numbers. All computations are multiplyadd operations. Figure 4 shows the operations between matrices A and B. The sizes of matrices A and B are \(4\times 1\) and \(1\times 4\), respectively, for which there exist 16 multiplyadd operations. Normal instructions operate the normal registers and can launch 1 operation in one cycle. As shown in Fig. 4, when the normal multiplyadd instruction madd is used, there are 16 madd instructions. The original value of v equals 1. In the BLAS library, the data in the matrix are arranged in columnmajor order. When blockB is preloaded to the L3 cache, matrix transportation is needed to match VectDMA. Compared to the columnmajor order of the original matrix B, the data of blockB in the L3 cache can be seen in rowmajor order. Every four neighboring numbers in matrices A and B can be accessed by the same vector registers. Then, the instruction VBCMULADDPD will be called. At the end of computations, the results are stored into four corresponding vector registers. In total, four vector instructions are needed for kernel computing. After vectorization, the number of kernel instructions will decrease from 16 to 4, and v changes from 1 to 4. There are no shuffle operations in dGEMM, and the value of \(T_{shuffle}\) is 0.
Computations of zGEMM are operations between complex numbers. Unlike dGEMM, the operations of zGEMM consist of multiplyadd and multiplysubtract operations. Figure 5 shows the kernel multiplication of zGEMM(2, 1, 2), and the result is a 2by2 matrix. When the kernel is realized with normal instructions, such as the multiplyadd instruction madd and multiplysubtract instruction msub, the instructions operate the normal floatingpoint registers. Normal registers \(r_{x}\) and \(i_{x}\) are used to store the real and imaginary parts of complex numbers, respectively. As shown in the middle subfigure of Fig. 5, 16 normal instructions are needed for zGEMM(2, 1, 2), and the original value of v equals 1.
As shown in the right subfigure of Fig. 5, there are 5 vector instructions to vectorize zGEMM(2, 1, 2). First, the data of matrices A and B are loaded into vector registers \(V_A\) and \(V_B\), respectively, by using VectDMA. Then, the results are updated with \(V_A\) and the real parts of \(V_B\) by calling VBCMULADDPD. Next, the data in the vector register \(V_A\) are reorganized and shuffled. The results are updated with \(V_A\) and the imaginary parts of \(V_B\) by calling VBCMULADDPD in the end. After vectorization, the value of v rises from 1 to 4. When the computing kernel is zGEMM(m, k, n), there are \(k\times n/2\) shuffle instructions. The total number of vectorcomputing instructions is \(m\times k\times n\). The ratio of the number of shuffle instructions to the number of overall instructions is 1/(2m + 1). In other words, shuffle operation takes up approximately 1/(2m + 1) of the overall processing time.
Mechanism for issuing multiple instructions
Assuming that the time of memory access, \(T_{mem}\), is fixed, the \(\lambda _{1}\) should be reduced to enhance overall performance. The Godson3B1500 supports the mechanism for issuing multiple instructions, and this mechanism can decrease the \(\lambda _{1}\). There are two vector floatingpoint operation units, two fixedpoint operation units, and one memory access unit in each core. Four instructions can be issued simultaneously in one cycle, including two floatingpoint instructions, one memory access instruction and one fixedpoint instruction. In the GEMM kernel, most instructions are computing instructions. To improve the performance, floatingpoint operation units should be kept working. Nonblocking cache access instructions can be used for data preloading without influencing the efficiency of the computing instruction sequence.
Synchronization module selection mechanism (SMSM)
When the time of EPs is less than that of APs, the synchronization module is needed. However, the EP and AP times are not fixed and change slightly with the change of the CPU execution state. If \(\widetilde{R}\) is casually calculated and determined, unfavorable scheduling may lead to wrong computing results when the synchronization module is not deployed. To solve this potential problem, \(\widetilde{R}\) should be determined cautiously. To ensure correct results, the EP and AP times for kernel computing are tested repeatedly, and the time results are recorded. \(\widetilde{R}\) is calculated with the minimal EP time and maximal AP time of the time results. When \(\widetilde{R}\) is larger than 1, there is no need for EPs to wait for APs.
 (a)
Run the computation kernel for n times and test the time.
 (b)
Use (9) to calculate the \(\widetilde{R}_i\) for the ith test and record the result as \(x_i\).
 (c)
Calculate the mathematical expectation \(\bar{X}\) with \(\bar{X}=\frac{1}{n}\mathop {\sum }\nolimits _{i=1}^{n}{x_i}\) and standard deviation \(\sigma\) with \(\sigma =\sqrt{\mathop {\sum }\nolimits _{i=1}^{n}{(x_i\bar{X})}^2/n}\).
 (d)
Use the onetailed tests to test whether \(\widetilde{R}>1\) (or \(\widetilde{R}\le 1\)) can be established in 95% confidence level.
Optimized algorithm based on DMA
Classic block matrix multiplications form the essential basis of our algorithms. GEMM consists of multilevel matrix partitions, and every level follows the rules for block matrix multiplication, which are discussed in Goto and Van De Geijn (2008b). When the matrix is divided, there are many small block matrix multiplications. For minor matrix multiplications, the matrix can be divided iteratively. If every matrix partition is correct, the algorithm of GEMM can be proved to be correct.
 (1)
When the matrix A is broken into submatrix blocks of dimension Mby\(k_0\) and B is divided into submatrix blocks of dimension \(k_0\)byN, it can be described as \(\sum\nolimits _{i=1}^{K/k_0}{GEMM(M,k_0,N)}\).
 (2)
When the matrix B is divided into submatrix blocks of dimension Kby\(n_0\) and A is not divided, the GEMM(M, K, N) can be described as \((GEMM_1(M,K,n_0),\ldots ,GEMM_{N/n_0}(M,K,n_0))\).
 (3)
When the matrix A is divided into submatrix blocks of dimension \(m_0\)byK and B is not divided, the GEMM(M, K, N) can be described as \((GEMM_1(m_0,K,n_0),\ldots ,GEMM_{M/m_0}(m_0,K,N))^T\).
The ratio of computation to memory access is N:4 for basic dGEMM(N, N, N) and N:2 for zGEMM(N, N, N). Compared with the computational amount, the amount of memory access is very small. Because of the rapid computational power and slow memory access performance, the memory wall is still the bottleneck of GEMM performance. Many attempts have been made to optimize the BLAS3 with the normal optimization technologies such as loop unrolling, software pipelining or data prefetching of processor. Loop unrolling is used to enhance the reuse of the data in caches to reduce the accounts of memory access. Software pipelining is used to eliminate the correlation between the execution and memory access, and the execution and memory access units can progress in parallel. However, the theoretical peak performance is too high, and the time of memory access of processors cannot be concealed by the execution. Only approximately 35% of the theoretical peak performance can be obtained. Moreover, the parameters of loop unrolling have been adjusted, and the performance is still very low. Therefore, the bottleneck cannot be solved by using normal optimization technologies.
In order to solve the memory wall and guarantee data supply, two novel algorithms based on DAE architecture are proposed, as shown in Algorithms 1 and 2. The computing kernels of these two algorithms utilize optimization technologies such as the vectorization and mechanism for issuing multiple instructions. SMSM is used to reduce the synchronization overhead for these two algorithms. The proposed performance model guides the overall design of algorithms.
Algorithm 1
The algorithm mainly discusses the multiplications of Mby\(k_0\) blockA and \(k_0\)byN blockB. First, blockB is divided with dimensions of \(k_0\times n_0\). The processor gives the blockB access to the L3 cache with normal memory access instructions. At the same time, TransDMA is configured to preload the first \(m_0\)by\(k_0\) blockA to the locked L3 cache. Then, blockB, which is stored at the locked L3 cache, is successively multiplied by many \(m_0\)by\(k_0\) blockAs in the locked L3 cache. The outerloop is responsible for data transfer of blockA and blockB from the memory to the locked L3 cache. The middleloop performs an average distribution of blockB in the locked L3 cache to four threads in the node. Each thread calculates the corresponding \(k_{0}\)by\((n_0/4)\) blockB.
VectDMA is responsible for data transfer between vector registers and memory/L3 cache. Channels a and b preload block\(A_s\) and block\(B_s\), respectively. Channel c is responsible for preloading blockC from the memory to the vector registers, while channel d is in charge of writing blockC back to the memory. When \(m_0\)by\(k_0\) blockA and \(k_0\)by\(n_0\) blockB are preloaded to the locked L3 cache, the kernelloop starts to execute. Channel c preloads the \(m_0\)by\(n_{00}\) blockC from the memory to the vector registers. Channel a preloads the \(m_0\)by\(k_{00}\) block\(A_s\), and channel b preloads the \(k_{00}\)by\(n_{00}\) block\(B_s\). At the same time, TransDMA starts to preload the next \(m_0\)by\(k_0\) block\(A_{next}\) to the locked L3 cache. When the computing kernel begins, channel c will preload the next block\(C_{next}\) to the vector registers simultaneously.
When the kernel function of multiplication of block\(A_s\) and block\(B_s\) is called, channels a and b begin to preload the next block\(A_{snext}\) and block\(B_{snext}\), respectively. After the kernel ends, the computing kernel of the next block\(A_{snext}\) and block\(B_{snext}\) is called successively. When multiplication of blockA and blockB ends, channels d and c begin to write blockC back and preload the next block\(C_{next}\), respectively. At the same time, the multiplication of the next block\(A_{next}\) and block\(B_{next}\) begins to execute.
The delay of memory access is very long for the Godson3B1500, and the cache missing rate greatly influences the performance of GEMM. The Godson3B1500 uses a random cache replacement strategy, and the cache missing rate is significantly higher than those in other strategies for GEMM. For ideal performance, data that are frequently reused should not be replaced from the cache. A mechanism of locking cache is proposed to keep some data in the cache. Experiments demonstrate that if more than half the cache spaces are locked, the system may be paralyzed due to a system deadlock.
Algorithm 2
In Algorithm 1, most overheads of memory access are concealed by the computing time. However, the time of loading matrix B to the locked L3 cache cannot be concealed. The overhead of loading matrix B will influence overall performance. To solve this problem, an optimized algorithm is proposed, in which the time of preloading the matrix B can be concealed, as shown in Algorithm 2.
Experimental results
To validate the correctness and effectiveness of the proposed algorithms, several experiments were conducted. In this section, we present the experimental testbed and detail the experiments and results.
Experimental testbed
The kernel functions of the algorithms are mainly implemented in MIPS64 assembly language. The hardware of the testbed is the Godson3B1500 clocked at 800 MHz. The peak performance of one node is 51.2 GFlops. Experiments were tested on the LoongsonServer Multilibs system. The software is the GNU Compiler Collection for Godson, and the compile options are “march=mips64 mabi=64 O2”. The compiler supports SIMD vector instructions of the Godson3B1500. According to (10), (11) and (12), the parameters that produce the best performance are shown in Table 1. VectReg is used to define the number of vector registers used for the algorithms. There are 128entry 256bit vector registers. In all, 128 vector registers are used for zGEMM and 120 vector registers are used for dGEMM.
Parameters for GEMM
m _{0}  k _{0}  n _{0}  k _{00}  n _{00}  ρ  sizeof(element)  VectReg  

dGEMM \(_{algo1}\)  12  512  992  4  12  2  8B  120 
zGEMM \(_{algo1}\)  8  512  480  4  8  2  16B  128 
dGEMM \(_{algo2}\)  12  512  480  4  12  2  8B  120 
zGEMM \(_{algo2}\)  8  512  240  4  8  2  16B  128 
Results analysis
The results are analyzed from two aspects, namely, numerical accuracy and performance. Numerical accuracy analysis is used to verify the correctness and accuracy of the algorithms, while performance analysis is used to calculate the improvements in efficiency of the proposed algorithms.
Numerical accuracy
Performance
In the Godson3B1500, Algorithms 1 and 2 were tested. The algorithm proposed in Zhu (2011) was implemented in a simulation platform (single core) and did not work in the real chips. Therefore, Zhu (2011) was introduced as a representative of studies on the DAE architecture simply and were not tested. For comparison, two standard versions of GEMM, including ATLAS and OpenBLAS, were tested. Moreover, Algorithm 2 without SMSM was tested for zGEMM. These tests were performed on the node (four cores) of Godson3B1500, where the theoretical computing peak capacity is 51.2 GFlops (\(Perf_{peak}\)/2).
In Fig. 10, Algorithm 2 performs better than the other algorithms for dGEMM when the size is larger than 2000. When the size is less than 4000, each core has very few tasks, and the power of hardware cannot be exerted. With an increasing size and ratio of computation, optimized algorithms are displaying promising performance. Due to cache missing, there are some oscillations in ATLAS and OpenBLAS with increasing size. However, APs access the data via the locked L3 cache rather than the L2 and L1 caches. The size of GEMM indicates the size of total memory needed in the GEMM. It includes the size of matrices A, B and C. Therefore, the proposed algorithms perform stably, and the performances do not change when the size of dGEMM exceeds the cache size. ATGEMM was optimized by using the L2 cache as the intermediate storage space for Godson3B. It only optimized the dGEMM for Godson3B1000, and its optimization methods were not fit for the Godson3B1500. ATGEMM is an automatic optimized algorithm for dGEMM in the Godson3B1000 and is optimized by using the L2 cache as the intermediate storage space. Algorithms 1 and 2 are optimized by using the L3 cache as the intermediate storage space. The L3 cache in Godson3B1500 is twice than the L2 cache in the Godson3B1000. Moreover, ATGEMM optimized the high levels of the blocking GEMM, and kernel based on the DAE processor was divided into 4 levels. Several levels of the ATGEMM are capable to selfadjust and the parameters are generated by using the automatic optimized algorithm. Manual adjustment to the parameters are needed. The parameters include the main sizes of matrix block for outer loop, the kernel block sizes and other parameters that influence the performance. After the manual adjustments with experience, the performance improves a little. Compared with our approaches, ATGEMM still performs badly. In the Fig. 10, only the performance of original ATGEMM is shown. Moreover, Algorithms 1 and 2 have adjusted the parameters for the Godson3B1500, and they perform better than ATGEMM in the Godson3B1500. Compared with Algorithm 1, the time of loading matrix B in Algorithm 2 is concealed by using the mechanism for issuing multiple instructions. Algorithm 2 performs approximately 2.5% better when the size is larger than 5200. Since the APs cannot be concealed by the EPs, Algorithm 2 cannot reach the theoretical peak performance. Its best performance is 47.07 GFlops, reaching 91.9% of the theoretical peak.
As shown in Fig. 11, Algorithm 2 performs better than ATLAS and OpenBLAS for zGEMM when the size is larger than 1400. Due to cache missing, there are some oscillations in ATLAS and OpenBLAS with increasing size. However, APs access the data via the locked L3 cache rather than the L2 and L1 caches. Therefore, the proposed algorithms perform stably, and the performances do not change when the size of zGEMM exceeds the cache size. Compared with Algorithm 1, the data preloading of matrix B can be concealed by using nonblocking cache access instructions and the mechanism for issuing multiple instructions. Algorithm 2 performs approximately 3.2% better when the size is larger than 6000. In addition, data shuffle is required and cannot be optimized for zGEMM in the Godson3B1500. The overhead of data shuffle occupies 6% (\(1/(2m_0+1)\)) of the total runtime. Therefore, Algorithm 2 cannot reach the theoretical peak performance. Its best performance is 47.64 GFlops, reaching 93% of the theoretical computing peak for zGEMM.
Conclusion
By virtue of the significance of BLAS, the performance optimization of BLAS has attracted attention from scholars and experts. In this paper, a GEMM performance model for DAE is proposed to analyze the impacts of parameters. Additionally, two optimized algorithms of GEMM are proposed in the Godson3B1500 based on the performance model. Experiments demonstrate that these two algorithms perform better than other versions of GEMM. The optimized algorithm reaches 93% of the theoretical peak performance for zGEMM and reaches 91.9% of the theoretical peak performance for dGEMM.
However, the performance of GEMM cannot reach the peak performance of the Godson3B1500. The memory wall is still the bottleneck for HPC applications. It is necessary to investigate how to enhance the performance of memory access in future work. Furthermore, a generic model based on a DAE architecture for BLAS will be designed.
Declarations
Authors' contributions
MZ and NG designed study and carried out the analysis. MZ and KR implemented the experiments, and contributed to writing and revising the paper. All authors read and approved the final manuscript.
Acknowledgements
Grateful acknowledgement is made to Mr. Yunkai Du who offered us the technologies to implement the experiments and helped us to revise the paper. Moreover, we thank American Journal Experts (AJE) for English language editing. Funding was provided by Natural Science Foundation of Anhui Province (Grant No. 1408085MKL06), and Project funding for academic innovation in Colleges and Universities (Grant No. B07033).
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Allen G et al (2009) A note on autotuning GEMM for GPUs. In: 9th international conference, Baton Rouge. LA, USA, pp 884–892Google Scholar
 Gao X, Chen YJ, Wang HD, Tang D, Hu WW (2010) System architecture of Godson3 multicore processors. J Comput Sci Technol 25(2):181–191View ArticleGoogle Scholar
 Goto K, Van De Geijn R (2008) Highperformance implementation of the level3 BLAS. ACM Trans Math Softw (TOMS) 35(1):1–14MathSciNetView ArticleGoogle Scholar
 Goto K, Van De Geijn R (2008) Anatomy of highperformance matrix multiplication. ACM Trans Math Softw (TOMS) 34(3):1–25MathSciNetView ArticleMATHGoogle Scholar
 Gu NJ, Li K, Chen GL, Wu C (2008) Optimization of BLAS based on Loongson 2F architecture. Univ Sci Technol China 38:854–859Google Scholar
 He SS, Gu NJ, Zhu HT (2012) Optimization of BLAS for Loongson3A architecture. J Chin Comput Syst 33(3):571–575Google Scholar
 Hu WW, Wang R, Chen YJ et al (2011) Godson3B: a 1 GHz 40 W 8core 128 GFlops processor in 65 nm CMOS. In: 2011 IEEE international solidstate circuits conference digest of technical papers (ISSCC). San Francisco, USA, pp 76–78Google Scholar
 Hu WW, Gao YP, Chen TS, Xiao JH (2011) The Godson processors: its research, development, and contributions. J Comput Sci Technol 26(3):363–372View ArticleGoogle Scholar
 Netlib (2016a) Homepage of BLAS. http://www.netlib.org/blas/. Accessed 15 Mar 2016
 Netlib (2016b) Homepage of Lapack. http://www.netlib.org/lapack/. Accessed 15 Mar 2016
 Netlib (2016c) Homepage of Linpack. http://www.netlib.org/linpack/. Accessed 15 Mar 2016
 Smith JE (1982) Decoupled access/execute computer architectures. In: The 9th annual symposium on computer architecture (ISCA ’82), CA, USA, pp 112–119Google Scholar
 Smith JE (1984) Decoupled access/execute computer architectures. ACM Trans Comput Syst 2(4):289–308View ArticleGoogle Scholar
 Wang Q, Zhang XY, Zhang YQ, Yi Q (2013) AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In: The international conference on high performance computing, networking, storage and analysis (SC ’13), New York, USA, pp 1–12Google Scholar
 Zhang WL et al (2004) Analysis and optimization discussion on parallel Linpack. In: Institute of computing technology chinese academy of sciences eighth graduate symposium on computer science and technology, DaLian, China, 2004Google Scholar
 Zhang XY, Wang Q, Zhang YQ (2012) Modeldriven level 3 BLAS performance optimization on Loongson 3A processor. In: The 2012 IEEE 18th international conference on parallel and distributed systems, Singapore, pp 684–691Google Scholar
 Zhao Z, Gu NJ, Yang YZ, Ren KX (2014) Optimizing BLAS2 for a decoupled access/execute architecture processor. J Comput Inf Syst 10(3):1231–1241Google Scholar
 Zhao Z, Gu NJ, Yang YZ (2015) GEMM optimization for a decoupled access/execute architecture processor. Int J Hybrid Inf Technol 8(7):375–388View ArticleGoogle Scholar
 Zhao Z, Gu NJ, Yang YZ (2013) Autotuning GEMM kernels for a decoupled access/execute architecture processor. In: 2013 first international symposium on computing and networking (CANDAR), Japan, pp 233–239Google Scholar
 Zhu HT (2011) Highdensity multicore processor architecture research. Dissertation, University of Science and Technology of ChinaGoogle Scholar