TY - STD TI - Allen G et al (2009) A note on auto-tuning GEMM for GPUs. In: 9th international conference, Baton Rouge. LA, USA, pp 884–892 ID - ref1 ER - TY - JOUR AU - Gao, X. AU - Chen, Y. J. AU - Wang, H. D. AU - Tang, D. AU - Hu, W. W. PY - 2010 DA - 2010// TI - System architecture of Godson-3 multi-core processors JO - J Comput Sci Technol VL - 25 UR - https://doi.org/10.1007/s11390-010-9315-3 DO - 10.1007/s11390-010-9315-3 ID - Gao2010 ER - TY - JOUR AU - Goto, K. AU - Geijn, R. PY - 2008 DA - 2008// TI - High-performance implementation of the level-3 BLAS JO - ACM Trans Math Softw (TOMS) VL - 35 UR - https://doi.org/10.1145/1377603.1377607 DO - 10.1145/1377603.1377607 ID - Goto2008 ER - TY - JOUR AU - Goto, K. AU - Geijn, R. PY - 2008 DA - 2008// TI - Anatomy of high-performance matrix multiplication JO - ACM Trans Math Softw (TOMS) VL - 34 UR - https://doi.org/10.1145/1356052.1356053 DO - 10.1145/1356052.1356053 ID - Goto2008 ER - TY - JOUR AU - Gu, N. J. AU - Li, K. AU - Chen, G. L. AU - Wu, C. PY - 2008 DA - 2008// TI - Optimization of BLAS based on Loongson 2F architecture JO - Univ Sci Technol China VL - 38 ID - Gu2008 ER - TY - JOUR AU - He, S. S. AU - Gu, N. J. AU - Zhu, H. T. PY - 2012 DA - 2012// TI - Optimization of BLAS for Loongson-3A architecture JO - J Chin Comput Syst VL - 33 ID - He2012 ER - TY - STD TI - Hu WW, Wang R, Chen YJ et al (2011) Godson-3B: a 1 GHz 40 W 8-core 128 GFlops processor in 65 nm CMOS. In: 2011 IEEE international solid-state circuits conference digest of technical papers (ISSCC). San Francisco, USA, pp 76–78 ID - ref7 ER - TY - JOUR AU - Hu, W. W. AU - Gao, Y. P. AU - Chen, T. S. AU - Xiao, J. H. PY - 2011 DA - 2011// TI - The Godson processors: its research, development, and contributions JO - J Comput Sci Technol VL - 26 UR - https://doi.org/10.1007/s11390-011-1139-2 DO - 10.1007/s11390-011-1139-2 ID - Hu2011 ER - TY - STD TI - Netlib (2016a) Homepage of BLAS. http://www.netlib.org/blas/. Accessed 15 Mar 2016 UR - http://www.netlib.org/blas/ ID - ref9 ER - TY - STD TI - Netlib (2016b) Homepage of Lapack. http://www.netlib.org/lapack/. Accessed 15 Mar 2016 UR - http://www.netlib.org/lapack/ ID - ref10 ER - TY - STD TI - Netlib (2016c) Homepage of Linpack. http://www.netlib.org/linpack/. Accessed 15 Mar 2016 UR - http://www.netlib.org/linpack/ ID - ref11 ER - TY - STD TI - Smith JE (1982) Decoupled access/execute computer architectures. In: The 9th annual symposium on computer architecture (ISCA ’82), CA, USA, pp 112–119 ID - ref12 ER - TY - JOUR AU - Smith, J. E. PY - 1984 DA - 1984// TI - Decoupled access/execute computer architectures JO - ACM Trans Comput Syst VL - 2 UR - https://doi.org/10.1145/357401.357403 DO - 10.1145/357401.357403 ID - Smith1984 ER - TY - STD TI - Wang Q, Zhang XY, Zhang YQ, Yi Q (2013) AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In: The international conference on high performance computing, networking, storage and analysis (SC ’13), New York, USA, pp 1–12 ID - ref14 ER - TY - STD TI - Zhang WL et al (2004) Analysis and optimization discussion on parallel Linpack. In: Institute of computing technology chinese academy of sciences eighth graduate symposium on computer science and technology, DaLian, China, 2004 ID - ref15 ER - TY - STD TI - Zhang XY, Wang Q, Zhang YQ (2012) Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In: The 2012 IEEE 18th international conference on parallel and distributed systems, Singapore, pp 684–691 ID - ref16 ER - TY - JOUR AU - Zhao, Z. AU - Gu, N. J. AU - Yang, Y. Z. AU - Ren, K. X. PY - 2014 DA - 2014// TI - Optimizing BLAS2 for a decoupled access/execute architecture processor JO - J Comput Inf Syst VL - 10 ID - Zhao2014 ER - TY - JOUR AU - Zhao, Z. AU - Gu, N. J. AU - Yang, Y. Z. PY - 2015 DA - 2015// TI - GEMM optimization for a decoupled access/execute architecture processor JO - Int J Hybrid Inf Technol VL - 8 UR - https://doi.org/10.14257/ijhit.2015.8.7.34 DO - 10.14257/ijhit.2015.8.7.34 ID - Zhao2015 ER - TY - STD TI - Zhao Z, Gu NJ, Yang YZ (2013) Auto-tuning GEMM kernels for a decoupled access/execute architecture processor. In: 2013 first international symposium on computing and networking (CANDAR), Japan, pp 233–239 ID - ref19 ER - TY - STD TI - Zhu HT (2011) High-density multi-core processor architecture research. Dissertation, University of Science and Technology of China ID - ref20 ER -