Mobile clusters of single board computers: an option for providing resources to student projects and researchers

Clusters usually consist of servers, workstations or personal computers as nodes. But especially for academic purposes like student projects or scientific projects, the cost for purchase and operation can be a challenge. Single board computers cannot compete with the performance or energy-efficiency of higher-value systems, but they are an option to build inexpensive cluster systems. Because of the compact design and modest energy consumption, it is possible to build clusters of single board computers in a way that they are mobile and can be easily transported by the users. This paper describes the construction of such a cluster, useful applications and the performance of the single nodes. Furthermore, the clusters’ performance and energy-efficiency is analyzed by executing the High Performance Linpack benchmark with a different number of nodes and different proportion of the systems total main memory utilized.

Section "Cluster of raspberry Pi nodes" presents a list of components of a mobile cluster of single board computers and a calculation of the energy costs.
Useful application scenarios for clusters of single board computers are analyzed in section "Useful applications".
Section "Performance of the single board computers and the network infrastructure" contains an analysis of the performance of the CPU, storage and network interface of the single nodes.
In section "Analyzing the clusters' performance with the HPL", the performance and speedup of the entire cluster of single board computers is analyzed, by using the High Performance Linpack (HPL).
Section "Analysis of the energy-efficiency" contains an analysis of the energy-efficiency of the cluster.
Finally, section "Conclusions and future work" presents conclusions and directions for future work.

Options for resource provisioning
Clusters usually consist of servers, workstations or personal computers as nodes. Since the mid-1990s, especially at universities and research institutions, cluster systems are assembled by using commodity hardware computers and Ethernet local area networks. Sterling et al. (1995) built such a cluster with the Linux operating system in 1994 and called it Beowulf cluster (Gropp et al. 2002), becoming a blueprint for numerous scientific projects afterwards.
The purchase cost for physical server resources can be challenging for student projects or scientific projects. Decommissioned hardware can be acquired for little money, but they require much space and the maintenance is labor intensive.
Another fact, which must be taken into account are costs, which arise from running physical computer resources. These include electricity cost.
If it is not mandatory to operate hardware resources in-house, outsourcing them or using services instead is an option. Dedicated server offerings and public cloud infrastructure service offerings can be used to provide compute resources to students or researchers.

Obstacles against public cloud and dedicated server offerings
In some countries, it is not common that students have a credit card. This can be an obstacle for using public cloud infrastructure services for student projects.
In contrast to cloud infrastructure service offerings, which can be used on an ondemand-basis according to the pay-as-you-go principle, dedicated server offerings usually have a minimum rental period of at least a month. Depending on the number of systems, which are required to realize a specific distributed system, using dedicated server offerings may be an expensive choice.
A drawback of both dedicated servers and public cloud infrastructure services is the lack of a physical representation, which e.g. for students complicates understanding the functioning of distributed systems.
Building clusters of single board computers is another option for providing compute and storage resources to students and researchers for running and implementing distributed systems. Table 1 contains a selection of single board computers, which provide sufficient computing power and main memory capacities to operate a Linux operating system and the required server daemons to implement clusters or private cloud services.

Related work
Clusters of single board computers have already been implemented. Cox et al. (2009) assembled in the project Iridis-pi at the University of Southampton a cluster of 64 Raspberry Pi nodes with 256 MB main memory per node. The aim of this project was among others to implement an affordable cluster for running MPI 1 applications and to evaluate the computational performance, network performance and storage performance of a cluster of single board computers.
Similar works have been done by Kiepert (2013), who assembled a Beowulf cluster by using 32 Raspberry Pi nodes with 512 MB main memory per node, at the Boise State University. He created a solid structure by using plexiglass, in which the cluster and its network and electric power infrastructure is housed. To power the single board computers, he used two standard PC power supplies and attached them by using one of the 5 V pins of the I/O header, each Raspberry Pi node provides. For the cooling of the cluster and mainly the power supplies, the cluster is equipped with 120 mm fans. The performance and speedup of this cluster was measured with an MPI-program that calculates π using the Monte Carlo method. Because this program can be parallelized very well, the speedup of the cluster is close to the theoretical maximum. Abrahamsson et al. (2013) presented their work of building an affordable and energyefficient cluster of 300 Raspberry Pi nodes, as well as several challenges and a number of opportunities. Tso et al. (2013) presented a scale model of a Data Center, composed of clusters of 56 Raspberry Pi Model B nodes, that emulates the layers of a cloud stack (the focus is resource virtualisation to network management) by using Linux containers and the supporting LXC suite. The work compares the acquisition cost, electricity costs and cooling requirements of the cluster of single board computers with a testbed of 56 commodity hardware servers.
1 Message Passing Interface (MPI) is the de facto standard for distributed scientific computing. The MPI standard defines the syntax and semantics of the MPI functions. Several implementations of the standard exist. One example is MPICH. Jamie Whitehorn presented 2 in 2013 a Hadoop cluster of five Raspberry Pi Model B nodes. For a productive usage, the performance of the Hadoop cluster is considered too slow. Especially the poor memory capacity is problematic for using Hadoop, but for educational purposes, regarding Hadoop itself, the system is well suited.
Also in 2013, the developers of the Cubieboard single board computer presented 3 a Hadoop cluster of eight nodes. In contrast to the Raspberry Pi Model A and B, the Cubieboard computer used are equipped with a faster CPU (1 GHz clock rate) and a bigger main memory of 1 GB. Because of these resource characteristics, the Cubieboard were in 2013 better suited for deploying Hadoop. Kaewkas and Srisuruk (2014) built at the Suranaree University of Technology a cluster of 22 Cubieboards, running Hadoop and Apache Spark, which are both open source frameworks for big data processing. The focus of their work is the I/O performance and the power consumption of the cluster.
Benefits of the single board computer cluster, used for this project, are the weight, which is only 7.8 kg, the reduced energy consumption (see Cluster of raspberry Pi nodes section) and that the entire cluster occupies little space and can easily be transported by the users, because all components are stored inside an aluminum hard case. This is an important feature because it allows to use the cluster for practical exercises close to the students and the cluster can easily be borrowed.
A further positive characteristic is that the cluster do not contain moving parts, such as fans or hard disk drives. The lack of moving parts makes the cluster less error prone.

Cluster of Raspberry Pi nodes
In order to examine the performance and power consumption, a cluster (see Figs. 1, 2) of the following components was constructed: • 8x Raspberry Pi Model B • 8x SD flash memory card (16 GB each) • 10/100 network switch with 16 ports • 8x network cable CAT 5e U/UTP • 2x USB power supply 40 W (5 V, 8 A) • 8x USB 2.0 cable USB-A/Micro-USB • Aluminum hard case 450x330x150 mm • Power strip with at least 3 ports • Various wooden boards, screws, cable ties, angles, wing nuts, etc.
• The purchase cost for all components were approximately 500 €. The cluster is stored in an aluminum hard case to facilitate transporting and storing the system. A 100 Mbit Ethernet switch is sufficient for Raspberry Pi nodes with their 100 Mbit Ethernet interface.
Initially, the power supply was realized via individual USB power supplies (Samsung ETA-U90EWE, 5.0V, 2.0A) for each node, which were later replaced by two 5-port USB power supplies (Anker Model 71AN7105 40 W, 5.0 V, 8.0 A). Table 2 shows the power consumption of the cluster in idle operation mode and in stress mode 4 . Using just two power supplies causes lesser energy loss, which results in a reduced energy consumption.
The energy costs per year (C Y ) for a 24/7 usage for a specific power consumption in kW during operation (E) can be calculated with Eq. (1). In the equation, energy costs of 0.25 € per kWh are assumed. 4 The nodes were put into stress mode by using the command-line tool stress. Further information about this tool provides the web page http://people.seas.harvard.edu/~apw/stress/. (1)

Useful applications
A cluster of single board computers has very limited resources and cannot compete with the performance of higher-value systems. But despite these drawbacks, useful application scenarios exist, where clusters of single board computers are a promising option. This applies in particular for small and medium-sized enterprises as well as for academic purposes like student projects or research projects with limited financial resources.

Private cloud infrastructure services
Different sorts of cloud services exist, which belong to the Infrastructure as a Service (IaaS) delivery model. One group of services allows the operation of virtual server instances and management of network resources. Popular public cloud IaaS offerings are among others the Amazon EC2, Google Compute Engine, GoGrid and Rackspace Cloud. Examples of private cloud IaaS solutions are OpenStack Nova, Eucalyptus (Nurmi et al. 2008), Nimbus (Keahey et al. 2009) and Apache CloudStack. These services have in common that they require a hypervisor like KVM (Kivity et al. 2007) or Xen (Barham et al. 2003). All evaluated single board computers (see Table 1) implement the ARM architecture and despite the fact, that numerous efforts like Dall and Nieh (2014) and Hwang et al. (2008) have been made to port KVM and Xen to this architecture, the computing power and main memory resources of the tested single board computers are not sufficient for server virtualization in a useful scale. Further services, which belong to the IaaS family are object-based storage services like the public cloud offerings Simple Storage Service and Google Cloud Storage. Examples for private cloud solutions are Eucalyptus Walrus (Nurmi et al. 2009), Nimbus Cumulus (Bresnahan et al. 2011), OpenStack Swift and Riak S2. These service solutions can be executed on single board computers. In a cluster of single board computes, each request to a object-based storage service creates little workload on a node. Eucalyptus Walrus, OpenStack Swift and Riak S2 even implement replication over multiple nodes.

Private cloud platform services
A Platform as a Service (PaaS) implements a scalable application runtime environment for programming languages. The target audience are software developers and end users who like to provide and consume services in a corresponding market place. A PaaS allows to scale from a single service instance to many, depending on the actual demand (Armbrust et al. 2009). Prominent instances of public cloud PaaS offerings are Google App Engine, Microsoft Azure Platform and Engine Yard. In some cases, it might be desirable to avoid public cloud offerings for privacy or legal reasons for example. Advantageously, private cloud solutions exist. Examples are App-Scale (Chohan et al. 2009;Bunch et al. 2010), Apache Stratos and OpenShift. Running these services is potentially possible in a cluster of single board computers as long as no virtualization layer (hypervisor) is required.

Distributed file systems
Two different types of distributed file systems exist: 1. Shared storage file systems, which are also called shared disk file systems 2. Distributed file systems with distributed memory Clusters of single board computers are an inexpensive option for testing and developing distributed file systems with distributed memory. Examples for such file systems are Ceph (Weil et al. 2006) GlusterFS 5 , HDFS (Borthakur 2008), PVFS2/OrangeFS (Carns et al. 2000) and XtreemFS (Hupfeld et al. 2008).
In order to use shared storage file systems like OCFS2 (Fasheh 2006) and GFS2, all nodes must have direct access to the storage via a storage area network (SAN), e.g. implemented via Fibre Channel or iSCSI. Connecting the nodes of a cluster of single board computers with a SAN is an option with two major drawbacks. First, the purchase cost of a SAN infrastructure would in most cases be higher as the sum of all other cluster components (including the nodes itself ). Second, the Fibre Channel interface cards, often called host bus adapters (HBA), are usually connected via PCI Express or Thunderbolt. None of these interfaces are provided by any of the evaluated single board computers. Using iSCSI via Ethernet is not a recommendable option for single board computers (see Table 1) which provide just a single Ethernet interface with a maximum data rate of 100 Mbit/s.

Distributed database systems
Numerous free database systems support cluster operation to provide a higher level of availability and a better performance for query and data modification operations compared with single node operation. Examples of distributed database systems, which have been successfully tested on clusters of single board computers are the document-oriented databases Apache CouchDB 6 , the column-oriented database Apache Cassandra 7 , the key/value database Riak 8 , as well as the relational database management system MySQL 9 .
Further relational database management systems and NoSQL database systems support cluster operation mode and should be deployable in a cluster of single board computers in principle.

High-throughput-clustering
Single board computers provide sufficient resources for running network services like HTTP servers, mail servers or FTP servers. Each request to such a service creates little load on a node.
To realize e.g. a High-Throughput-Cluster of HTTP servers, only a server software and the load balancer functionality are required. As HTTP server software, the Apache HTTP Server or a resource-saving alternative like Nginx or Lighttpd can be deployed. The Apache server software provides the load balancer module mod_proxy_balancer 10 and the Nginx server software implements load balancing functionality too. Another option is using a load balancing solution like Ultra Monkey, which can be operated in a redundant way by running a stand-by instance.
Detailed monitoring of the state and load of the single nodes can be implemented with monitoring tools like Nagios and Ganglia.

High-performance-clustering and parallel data processing
Clusters of nodes with physical hardware are the ideal environment to test and develop parallel applications for High-Performance-Clusters, because no virtualization layer and additional guest operating systems influence the performance of the nodes.
For the distributed storage and parallel processing of data, the Apache Hadoop framework, which implements the MapReduce (Dean and Ghemawat 2004) programming model, can be used.
Solutions for implementing parallel applications are among others MPI, PVM 11 , OpenSHMEM and Unified Parallel C (Dinan et al. 2010), which all can be used in clusters of single board computers.

Performance of the single board computers and the network infrastructure
The performance of the CPU, the local storage and the clusters' network infrastructure was measured in order to get an impression about the performance capability of a single node of the cluster.

CPU performance
The benchmark suite SysBench was used to measure the CPU performance. Table 3 shows the total execution time of the benchmark, while testing each number up to value 10,000 if it is a prime number.
For comparison, not only the CPU performance of the Raspberry Pi Model B, used in the cluster was measured, but also of the BananaPi and the ODROID-U3. Furthermore, the benchmark was executed on a Lenovo ThinkPad X240 notebook with an Intel i7-4600U quad-core CPU.
The benchmark scales well on multiple nodes. The measurement results in Table 3 show that doubling the number of cores nearly halves the execution time.
Increasing the clock rate of the Raspberry Pi from 700 to 800 MHz does not require overvolting the CPU and results in a noticeable increase of the processing power. For this reason, the Raspberry Pi nodes of the cluster are overclocked to 800 MHz.
The measurement results in Table 3 also show that for the BananaPi and the ODROID-U3, using more threads than CPU cores available does not result in a significant performance gain. For the Raspberry Pi, the execution time is even extended, because of the additional overhead, that results of the increased number of context switches.

Storage performance
Only a limited number of options exist to attach local storage to the cluster nodes. The Raspberry Pi provides (depending on the model) two or four USB interfaces, connected via an internal USB 2.0 hub and an interface for secure digital cards (SD cards). By using an appropriate passive adapter, microSD flash cards can be used as well. Each Raspberry Pi node in the cluster is equipped with a 16 GB flash storage card, which stores the operating system and provides storage capacity for applications and data.

Sequential read/write performance
The sequential read/write performance of the local storage of a single cluster node was measured with the tool dd, while reading and writing a single file of size 300 MB. Several (micro)SD storage cards of different manufacturers and speed classes 12 were tested. To avoid interferences, caused by the page cache, it was dropped before each read performance test and the flag oflag=sync was used for write performance tests with the dd command. The file system used was ext4 with 4 kB block size and journaling mode ordered, which implies that all file data are directly written into the file system prior to its metadata being committed to the journal. The values in Table 4 are averages of five measurement rounds. When used with the Raspberry Pi, most tested class 6 and class 10 flash storage drives provide a significant better data rate for sequential write compared with the class 4 drives. The sequential read performance of all tested drives is limited by the maximum data rate of the storage card interface, the Raspberry Pi Model B is equipped with. The SD card interface on the Raspberry Pi implements a 4-bit data bus and 50 MHz clock rate. Therefore, the theoretical maximum data rate is 25 MB/s, which cannot be reached in practice. For comparison, the data rate of the flash storage drives for sequential read and write was also measured with the internal Realtek RTS5227 PCI Express card reader of a Lenovo ThinkPad X240 notebook. The results in Table 4 show that the maximum data rate of the tested class 10 drives for sequential read is more than double the data rate, the drives provide in the Raspberry Pi.
Measuring the sequential read and write performance is a procedure, which is quite common for benchmark purposes, but its significance for practical applications is limited because reading and writing large amounts of data in row is carried out quite seldom on many systems. Use cases are e.g. streaming media and the up-and download of objects, which are at least several MB in size. More common in root file systems and during operation of e.g. HTTP servers and database systems are random read and write operations.

Random read/write performance
For measuring the random read and write performance, the benchmark tool iozone v3.430 was used. The results show that all tested SD cards provide even for record size 4 kB a random read performance of 3-5 MB/s (see Table 5). Increasing the record size increases the data rate until the performance of sequential read is reached. The performance for random write (see Table 6) is significantly lower compared with sequential write. The random write performance of SD cards is caused by the internal architecture of this type of flash storage. Memory cells of NAND flash storage are grouped to pages and so called erase blocks. Typical page sizes are 4, 8 or 16 kB. Although it is possible for the controller to write single pages, the data cannot be overwritten without being erased first and an erase block is the smallest unit that a NAND flash storage can erase. The size of the erase blocks of SD cards is typically between 64 or 128 kB. In modern SD cards, small numbers of erase blocks are combined into larger units of equal size which are called allocation groups or allocation units or segments. The usual segment size is 4 MB. The controllers of SD cards implement a translation layer. For any I/O operation, a translation from virtual to physical address is carried  out by the controller. If data inside a segment shall be overwritten, the translation layer remaps the virtual address of the segment to another erased physical address. The old physical segment is marked dirty and queued for an erase. Later, when it is erased, it can be reused. The controllers of SD cards usually cache a single or more segments for increasing the performance of random write operations. If a SD card stores a root file system, it is beneficial if the controller of the card can cache the segment(s) where the write operation(s) takes place, the segments, which store the metadata for the file system and (if available) the journal of the file system. Consequently, the random write performance of a SD card depends among others of the erase block size, the segment size and the number of segments, the controller caches. A significant performance advantage of the tested class 10 flash storage cards compared with the tested class 4 or class 6 flash storage cards is not visible.

Further options for local storage
The available USB interfaces provide further options to implement a local storage. It is possible to attach a single or multiple hard disk drives (HDD) or flash storage drives via USB. Patterson et al. (1988) described that if multiple drives are attached, the performance and availability can be improved by combining them to redundant arrays of independent disks (RAID).
No matter what storage technology is used, the USB 2.0 interface limits the possible throughput. Solid state drives (SSD) and HDDs provide enough performance for read and write to utilize the entire transfer capacity of the USB 2.0 interface. Drawbacks of attaching SSDs or HDDs on each cluster node are higher purchase cost for the cluster and increased energy consumption. Mordvinova et al. (2009) showed that RAID arrays of USB flash storage drives can be purchased for less cost, compared with SSDs or HDDs, but like SD cards they usually provide a poor performance for random write operations. Therefore, USB flash storage drives are a useful option mainly for read-mostly applications like storing the content of web servers and for CPU bound applications.
For these reasons, only SD cards are used in the cluster of single board computers.

Network performance
The network performance between the nodes was measured with the command-line tool iperf v2.0.5 and with the NetPIPE v3.7.2 benchmark. According to iperf, the network performance between the nodes of the cluster is 76-77 Mbit per second.
A more detailed analysis of the network performance is possible with the NetPIPE benchmark. It tests the latency and throughput over a range of message sizes between two processes. The benchmark was executed inside the cluster one time by just using TCP as end-to-end (transport layer) protocol and one time by using the Open MPI message passing layer library. The results in Fig. 3 show that the smaller a message is, the more is the transfer time dominated by the communication layer overhead. For larger messages, the communication rate becomes bandwidth limited by a component in the communication subsystem. Examples are the data rate of the network link, utilization of the transmission medium or a specific device between sender and receiver like the network switch inside the mobile cluster.
As described by Snell et al. (1996) and clearly evident in Fig. 3, using MPI (which also uses TCP as transport layer protocol) is an overhead that has a negative impact on throughput and latency. The best measured throughput when using MPI is 65 Mbit per second. When using just TCP, the throughput reaches up to 85 Mbit per second.
As long as the payload fits into a single TCP segment, the latency when using MPI is approximately ten times worse compared when using just TCP. The maximum transmission unit (MTU), which specifies the maximum payload inside an Ethernet frame, is 1500 Bytes in our cluster. Consequently, the maximum segment size (MSS), which specifies the maximum payload inside a TCP segment, is 1460 Bytes.
The drop of the data rate when using MPI at around 64 kB payload size is caused by the MPI library that implements the asynchronous eager protocol and the synchronous rendezvous protocol. While eager does not await an acknowledgement prior starting a send operation, the rendezvous does. This is because of the assumption that the receiver process can store small messages in its receive buffer any time. The default size limit, where the installed Open MPI library sends messages via eager protocol is 64 kB.
Further drops of the data rate when using MPI, especially around 4 MB payload size are probably caused by the limited CPU resources of the Raspberry Pi nodes. When executing the MPI benchmark with such a message size, the CPUs of the nodes are almost entirely utilized.
The poor overall Ethernet performance of the Raspberry Pi nodes is probably caused by the fact, that the 10/100 controller is a component of the LAN9512 controller 13 . This 13 Further information about the LAN9512 controller, which contains an USB 2.0 Hub and a 10/100 Ethernet Controller provides the technical specification from the manufacturer. This document is accessible via the URL http://ww1.microchip.com/downloads/en/DeviceDoc/9512.pdf.

Fig. 3
Analysis of the network performance inside the cluster by using the NetPIPE benchmark chip contains the USB 2.0 hub and a built-in 10/100 Mbit Ethernet controller, which is internally connected (see Fig. 4) with the USB hub.

Analyzing the clusters' performance with the HPL
Besides analyzing the performance of individual nodes and their resources, it is interesting to examine the performance of the cluster as a whole.
The High Performance Linpack (HPL) 14 benchmark is an established approach to investigate the performance of cluster systems. It is among others used by the Top500 project, which maintains a list of the 500 most powerful computer systems. As described by Luszczek et al. (2005) and Dunlop et al. (2010), the benchmark solves a linear system of equations of order n that is divided into blocks of size P × Q, by using double-precision (8 Bytes) floatingpoint arithmetic (Gaussian elimination with partial pivoting) on computer systems with distributed memory. The execution of the HPL can be specified manually in the config file HPL.dat with several parameters. Some tools like the Top500 HPL Calculator 15 are helpful to find some initial settings, but finding the most appropriate settings for a specific system is not a simple task and takes some time.
P × Q is equal to the number of processor cores used. The developers of the HPL recommend 16 that P (the number of process rows) and Q (the number of process columns) should be approximately equal, with Q slightly larger than P. Consequently, in the cluster of eight Raspberry Pi single board computers, the values P = 1, Q = 1 were used for benchmarking just a single node, P = 1, Q = 2 for two nodes, P = 1, Q = 4 for four nodes and P = 2, Q = 4 for the entire cluster with eight nodes.
The parameter N specifies the problem size-the order of the coefficient matrix. It is challenging to find the largest problem size that fits into the main memory of the specific system. Therefore, the main memory capacity for storing double precision (8 Bytes) numbers need to be calculated. Utilizing the entire main memory capacity for the 14 Further information about the High-Performance Linpack (HPL) benchmark provides the web page http://www. netlib.org/benchmark/hpl/.  benchmark is impossible because the operating system and the running processes still consume memory and using the swap memory has a negative impact on the performance. Thus, it is promising 17 ,18 to set N to a value 80-90 % of the available main memory.
The problem size N can be calculated with Eq.
(2). It depends on the number of nodes in the system, the reduction coefficient R which specifies how much percent of the entire main memory of the cluster shall be utilized by the benchmark and the main memory capacity M of a single node in GB. The Raspberry Pi cluster nodes used for this project are equipped with 512 MB main memory. A part of the main memory is assigned as video memory to the GPU, which lacks own dedicated memory. Because in the cluster, the GPUs are not used at all, the minimal GPU memory was set, which is 16 MB. This results in 496 MB main memory left for the operating system and the applications on each node. After the operating system Raspbian and the daemon and client for the distributed file system is started, approximately 400-430 MB main memory remains available on each node.
If for example the value of N shall be big enough to fill around 80 % of the memory capacity of four nodes (P = 1, Q = 4) of the cluster system, the calculation is as follows: A further important parameter is the block size NB. As optimization, N should be NB aligned 18 . For this example, if we consider NB = 32, we calculate 13,054 32 = 407.9375 ≈ 407 and next 407 × 32 = 13, 024 = N. For this work, the HPL benchmark was executed with different parameters in the cluster of single board computers. Figure 5 shows the Gflops when executing the benchmark with different values for the parameter NB in the cluster system when using all eight nodes and utilizing different proportions of the systems' total main memory. These tests were carried out to find a recommendable value for NB.
For NB = 16 and NB = 32, a performance drop is observed, when utilizing 95 % of the systems' main memory. This is caused by the heavy use of swap memory.
The results in Fig. 5 show that from the tested values, NB = 32 causes the best performance. For this reason, further performance investigations with the HPL benchmark were carried out with this value for the parameter NB.
17 Further information provides the document Frequently Asked Questions on the Linpack Benchmark and Top500 from Jack Dongarra, which provides the web page http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html. 18 Further information provides the document HowTo-High Performance Linpack (HPL) from Mohamad Sindi, which is accessible via the URL http://www.crc.nd.edu/~rich/CRC_Summer_Scholars_2014/HPL-HowTo.pdf. Analysis of the speedup Table 7 shows the values of the parameters N, P, Q and NB, as well as the runtimes, required to solve the linear system and the resulting Gflops 19 . The benchmark was executed in the cluster with just a single node, two nodes, four nodes and eight nodes to investigate the speedup. The speedup S P , that can be achieved when running a program on P processors is defined as where F 1 is the Gflops on a single-processor system and F P is the Gflops on a multiprocessor system.
The theoretical maximum speedup is equal to the number of single-processor nodes, which means it is value 2 for two nodes, value 4 for four nodes, value 8 for eight nodes, etc.
The results in Table 7 show, that increasing the number of nodes also increases the speedup significantly. The best benchmark results were obtained, when N is set to a value 80-90 % of the available main memory.
The low speedup, when utilizing 95 % of the systems' main memory, is caused by the heavy use of swap memory. Figure 6 highlights this observation.

Analysis of the efficiency
Especially for the previous mentioned Top500 list, two performance indicators are considered important. These are: • Rpeak, which is the theoretical peak performance of the system. It is determined by counting the number of floating-point additions and multiplications (in double precision), that can be completed during a period of time, usually the cycle time of the machine (see footnote 17). The ARM 11, which is used by the Raspberry Pi comput- 19 Flops is an acronym for floating-point operations per second.
(3) S P = F P F 1 Fig. 5 Analysis of the clusters Gflops performance, when using all eight nodes, by using the HPL benchmark with different values for the parameter NB and different proportions of the systems total main memory utilized. The concrete values for problem size N can be seen in Table 7 Fig . 6 Analysis of the clusters speedup by using the HPL benchmark with NB = 32, different numbers of nodes and different proportion of the systems total main memory utilized. The concrete values for problem size N can be seen in Table 7  ers, can process a floating-point addition in one cycle and requires two cycles for a floating-point multiplication 20 . The calculation of Rpeak of a system is as follows: Thus, the Rpeak of a cluster of eight Raspberry Pi nodes (with 800 MHz clock speed) is 6,4 Gflops for floating-point addition operations and 3,2 Gflops for floating-point multiplication operations. • Rmax, is the maximal performance that was achieved with the HPL. In case of our cluster, Rmax has value 1.367 Gflops (see Fig. 5; Table 7). • The efficiency of a specific system in percent is calculated via Rmax Rpeak * 100. In case of our cluster, the efficiency depends of the executed operations and is only between ≈ 21 % and ≈ 42 %. The exact reason for this low efficiency was not evaluated. But as described by Luszczek et al. (2005), the HPC Challenge benchmark test suite stresses not only the processors, but the memory system and the interconnect too. Therefore, it is likely that the low network performance (see Network performance section), as well as the memory performance of the Raspberry Pi computers have a negative impact here.

Analysis of the energy-efficiency
Knowing the clusters' electric energy consumption (see Table 2) and its performance when executing the HPL benchmark (see Analyzing the clusters' performance with the HPL section) is the precondition to analyze the clusters' energy-efficiency.
The Green500 list, which is a complement to the Top500 list, uses the flops per Watt metric (Sharma et al. 2006) to rank the energy efficiency of supercomputers 21 . The metric is defined as P(Rmax) is the average system power consumption while executing the HPL with a problem size that delivers Rmax. When executing the HPL benchmark, the power consumption of the cluster depends of the number of nodes used for the benchmark. The average system power consumption while executing the HPL is approximately 26 W when using all eight nodes. With Rmax = 1.367 Gflops, the cluster provides approximately 52.57 Mflops per Watt.

Conclusions and future work
The performance of single board computers cannot compete with higher-value systems because the performance of their components, especially CPU, main memory and network interface. The same applies for clusters of single board computers. The maximum observed performance Rmax of the cluster system, implemented for this work is 20 Further information provides the Technical Reference Manual-Components of the processor-Vector Floating-Point (VFP) for the ARM1176JZF-S processor, which can be accessed at the URL http://infocenter.arm.com/help/ topic/com.arm.doc.ddi0301h/Cegdejjh.html.

Rpeak [Gflops] = Clock speed per core [GHz]
× Number of cores × Operations per cycle 21 Further information provides the Power Measurement Tutorial for the Green500 List, which can be accessed at the URL http://www.green500.org/sites/default/files/tutorial.pdf.