[Novalug] a system hw / engineering question

John Franklin franklin at elfie.org
Tue Oct 9 22:56:55 EDT 2007


On Oct 9, 2007, at 1:26 PM, Megan Larko wrote:

> Happy Tuesday Folks,
>
> Although I have built my own assemblage of hardware for spec I have  
> never actually designed/engineered any computer hw components.   
> That said, please be patient with me if you choose to respond to  
> this query:
>
> At my job we have several computers running RHEL5 and Fedora Core6.  
> There is a Luster file system with user data connected to the  
> master/head nodes as well as to the compute nodes (accessed by  
> users via the torque scheduler).  The network interconnects is  
> infiniband.
>
> A user submits a job via torque to the compute nodes (cn#)  
> requesting 8 processors and 16Gb of RAM.  If the requested  
> processors are all on a single cn, the job fails because it states  
> that it does not have sufficient memory resources.  Each cn has  
> 32Gb of RAM in it.  If the exact same script/code is submitted to  
> two cns, requesting 4 CPUs per cn and 16Gb memory, the job runs,  
> but it uses more wallclock time per step.
>
> The CPUs involved are Dual Core Opterons, Dual Core Xeon.  All of  
> which are between 2.0GHz and 2.4GHz.
>
> ASCII Diagram of motherboard (Tyan for AMD, Asus for Intel) layout:
>
>                                       --------
>                                       --------
>                                       --------
>                                       --------
>    | | | |                             XX
>    | | | |                             XX
>    | | | |  XX
>    | | | |  XX
>
>             XX                    XX  | | | |
>             XX                    XX  | | | |
>       --------                        | | | |
>       --------                        | | | |
>       --------
>       --------
>
> ...where...
>  dashed lines indicate memory slots (fully populated)
>  XX symbol indicates CPU hw
>
> I am guessing (really genuinely guessing) that if the users job is  
> using  both parts of a dual-core CPU then its access to memory is  
> coupled to those DIMMS positioned close to that CPU unit and as  
> such each virtual CPU of a dual-core, for example, would have to  
> share that memory resource.  If the users job accesses only 1 core  
> of the CPU then that one core (assuming no other jobs on the box at  
> the time) would have access to the full population of memory seated  
> next to  it.  IOW one core has access to all of my dashed lines and  
> running dual-core one a single CPU has to share (split??) that  
> dashed line memory access.  So the user is better off using only  
> part---one core---of the CPU and extending the job over more cn's  
> than trying to run on a multi-core CPU for an apparently memory- 
> intensive job.
>
> Is this a reasonable guess?   Could the problem perhaps lie  
> elsewhere such as shared L1 and L2 cache on the physical CPUs?
>
> We are considering purchasing a Quad-core Intel 5335 (771 socket)  
> in the very near future.   If I am going to see many job failures  
> because of insufficient memory errors I may push for no more than  
> dual-core in our system.   Would changing to 2Gb DIMMS and giving a  
> cn 64Gb (max the boards can recognize) be a reasonable action to  
> pair with the purchase of a Quad-core processor?
>
> Okay hardware and engineering gurus, strut your stuff!!
>


You're right in saying each AMD CPU has direct access to the memory  
coupled with its socket.  Each CPU package (regardless of core count)  
has an on-board memory controller that manages the RAM nearest it at  
chip speeds.  However, each CPU package also has three HyperTransport  
links leaving it.  Two go to two other processors, the last goes to a  
PCI bridge or some other peripheral interconnect.  Each CPU can  
access the memory controlled by another CPU over the HT bridge, or  
access the PCI bridge connected to another CPU.  Sometimes this means  
jumping across multiple HT links.

The Intel chips have very fast memory going to a single Northbridge,  
and from there dual-independent busses to each of the sockets (if  
dual or quad-core is supported.)  All CPUs have to use these two  
busses and the Northbridge to get to memory, so while individual  
memory accesses are faster on an Intel system, the contention between  
CPUs makes it slower overall than the Opteron systems.

Because each Opteron CPU can access its own memory very fast, and  
independent of the other CPUs, the Opteron system scales better than  
Intel systems to 4 and 8 socket configurations.  In two socket  
configurations, the Intel systems tend to win.  When you get to one- 
socket, cost or other factors tend to sway the decisions more than  
performance.

Getting back to your setup, it sounds like the two CN request is  
running on two distinct systems, shared over IB, which succeeds  
because the cluster can find two systems with 4CPU/8GB, but no single  
system 8CPU/16GB.  The two-job approach is slower because the two  
jobs have to communicate over the Infiniband (which is slower than  
the HyperTransport interconnect.)  Would dropping in 64GB of RAM  
help?  Almost certainly.  However, if time is really more important  
than budget, I'd consider picking up an 8 socket AMD system like a  
Sun x4600 and load it up to 128GB of memory with 2GB DIMMs (or 256GB  
with 4GB DIMMs if you've got the budget.)  If it's not, then  
encouraging your customers to use more, but smaller jobs is a win.   
The downside of an 8 socket system is that cache coherency becomes  
increasingly expensive as the number of independent caches increases.

Honestly, even if you have 80 sockets and decicore chips, there's a  
scientist out there that will oversubscribe it.  Geeks, by their  
nature, tend to consume all available resources plus one.  Reworking  
the cluster manager to automatically break up jobs so they span  
multiple systems, or finding ways to consolidate jobs onto fewer  
systems with advanced load-balancing so you can have a large pool of  
resources on-demand may be a better plan in the long run.

AMD has quad-core chips coming out Real Soon Now™.  Depending on what  
socket they have, there's a good chance that your current Tyan will  
accept AMD quad-cores with a BIOS upgrade.

jf


More information about the Novalug mailing list