[Novalug] a system hw / engineering question
John Franklin
franklin at elfie.org
Tue Oct 9 22:56:55 EDT 2007
On Oct 9, 2007, at 1:26 PM, Megan Larko wrote:
> Happy Tuesday Folks,
>
> Although I have built my own assemblage of hardware for spec I have
> never actually designed/engineered any computer hw components.
> That said, please be patient with me if you choose to respond to
> this query:
>
> At my job we have several computers running RHEL5 and Fedora Core6.
> There is a Luster file system with user data connected to the
> master/head nodes as well as to the compute nodes (accessed by
> users via the torque scheduler). The network interconnects is
> infiniband.
>
> A user submits a job via torque to the compute nodes (cn#)
> requesting 8 processors and 16Gb of RAM. If the requested
> processors are all on a single cn, the job fails because it states
> that it does not have sufficient memory resources. Each cn has
> 32Gb of RAM in it. If the exact same script/code is submitted to
> two cns, requesting 4 CPUs per cn and 16Gb memory, the job runs,
> but it uses more wallclock time per step.
>
> The CPUs involved are Dual Core Opterons, Dual Core Xeon. All of
> which are between 2.0GHz and 2.4GHz.
>
> ASCII Diagram of motherboard (Tyan for AMD, Asus for Intel) layout:
>
> --------
> --------
> --------
> --------
> | | | | XX
> | | | | XX
> | | | | XX
> | | | | XX
>
> XX XX | | | |
> XX XX | | | |
> -------- | | | |
> -------- | | | |
> --------
> --------
>
> ...where...
> dashed lines indicate memory slots (fully populated)
> XX symbol indicates CPU hw
>
> I am guessing (really genuinely guessing) that if the users job is
> using both parts of a dual-core CPU then its access to memory is
> coupled to those DIMMS positioned close to that CPU unit and as
> such each virtual CPU of a dual-core, for example, would have to
> share that memory resource. If the users job accesses only 1 core
> of the CPU then that one core (assuming no other jobs on the box at
> the time) would have access to the full population of memory seated
> next to it. IOW one core has access to all of my dashed lines and
> running dual-core one a single CPU has to share (split??) that
> dashed line memory access. So the user is better off using only
> part---one core---of the CPU and extending the job over more cn's
> than trying to run on a multi-core CPU for an apparently memory-
> intensive job.
>
> Is this a reasonable guess? Could the problem perhaps lie
> elsewhere such as shared L1 and L2 cache on the physical CPUs?
>
> We are considering purchasing a Quad-core Intel 5335 (771 socket)
> in the very near future. If I am going to see many job failures
> because of insufficient memory errors I may push for no more than
> dual-core in our system. Would changing to 2Gb DIMMS and giving a
> cn 64Gb (max the boards can recognize) be a reasonable action to
> pair with the purchase of a Quad-core processor?
>
> Okay hardware and engineering gurus, strut your stuff!!
>
You're right in saying each AMD CPU has direct access to the memory
coupled with its socket. Each CPU package (regardless of core count)
has an on-board memory controller that manages the RAM nearest it at
chip speeds. However, each CPU package also has three HyperTransport
links leaving it. Two go to two other processors, the last goes to a
PCI bridge or some other peripheral interconnect. Each CPU can
access the memory controlled by another CPU over the HT bridge, or
access the PCI bridge connected to another CPU. Sometimes this means
jumping across multiple HT links.
The Intel chips have very fast memory going to a single Northbridge,
and from there dual-independent busses to each of the sockets (if
dual or quad-core is supported.) All CPUs have to use these two
busses and the Northbridge to get to memory, so while individual
memory accesses are faster on an Intel system, the contention between
CPUs makes it slower overall than the Opteron systems.
Because each Opteron CPU can access its own memory very fast, and
independent of the other CPUs, the Opteron system scales better than
Intel systems to 4 and 8 socket configurations. In two socket
configurations, the Intel systems tend to win. When you get to one-
socket, cost or other factors tend to sway the decisions more than
performance.
Getting back to your setup, it sounds like the two CN request is
running on two distinct systems, shared over IB, which succeeds
because the cluster can find two systems with 4CPU/8GB, but no single
system 8CPU/16GB. The two-job approach is slower because the two
jobs have to communicate over the Infiniband (which is slower than
the HyperTransport interconnect.) Would dropping in 64GB of RAM
help? Almost certainly. However, if time is really more important
than budget, I'd consider picking up an 8 socket AMD system like a
Sun x4600 and load it up to 128GB of memory with 2GB DIMMs (or 256GB
with 4GB DIMMs if you've got the budget.) If it's not, then
encouraging your customers to use more, but smaller jobs is a win.
The downside of an 8 socket system is that cache coherency becomes
increasingly expensive as the number of independent caches increases.
Honestly, even if you have 80 sockets and decicore chips, there's a
scientist out there that will oversubscribe it. Geeks, by their
nature, tend to consume all available resources plus one. Reworking
the cluster manager to automatically break up jobs so they span
multiple systems, or finding ways to consolidate jobs onto fewer
systems with advanced load-balancing so you can have a large pool of
resources on-demand may be a better plan in the long run.
AMD has quad-core chips coming out Real Soon Now™. Depending on what
socket they have, there's a good chance that your current Tyan will
accept AMD quad-cores with a BIOS upgrade.
jf
More information about the Novalug
mailing list