Making Sense of the Intel Haswell Transactional Synchronization eXtensions

Name: Making Sense of the Intel Haswell Transactional Synchronization eXtensions
Item: Making Sense of the Intel Haswell Transactional Synchronization eXtensions
Author: Johan De Gelas

by Johan De Gelas on September 20, 2012 12:15 AM EST

29 Comments | Add A Comment

29 Comments

Easy to Use?

As we mentioned, HLE is backwards compatible. How does that work? If developers rely on libraries like glibc for locking, Intel claims only those libraries have to be changed. Take a look at the slide below:

Notice the rectangle with the pointer "application". For the developer of such an application that uses dynamic linking nothing changes. The library simply has to place xacquire in front of each lock (here a spinlock) and xrelease at the end of each lock. So (dynamic) linking to a new library and running it on an Intel Haswell CPU should let your application use TSX. If you run the same application plus a TSX enabled library on a previous generation CPU, it will still run but will not use TSX (HLE).

Haswell's TSX TSX Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

29 Comments

View All Comments

1008anan - Thursday, September 20, 2012 - link
Good summary Johan De Gelas. Look forward to future articles that further elaborate on how exactly Transactional Synchronization technology (TSX) achieves hardware accelerated transactional memory in a scalable generalized way.

Part of the solution in my opinion is for a core that can have 2 or more simultaneous thread to only execute 1 thread at a time for single threaded computationally heavy work loads. To give an example, a single Haswell Corps can execute 32 single precision (32 bit) operations per clock. At 2 gigahertz SoC speed, a single Corps can execute 64 billion operations per second with single threaded code. Unfortunately this can only help so much. To make major performance gains an efficient generalized scalable way needs to be found to distribute single threaded computational work loads to multiple different cores. Much easier said than done.
extide - Saturday, September 22, 2012 - link
32 single precision ops is wrong. How many ops they can do per clock has to do with the front end and how many ports it has in it, not how many bits a particular instruction is.
extide - Saturday, September 22, 2012 - link
to further that, Ivy Bridge has 6 ports and Haswell 8, but each of the ports don't necessarily have the same capabilities.
twotwotwo - Thursday, September 20, 2012 - link
Very smart idea, very clever to implement it with backwards compatibility, and it's good that Intel's working out uses for die area that aren't just multiplying cores or cache.

But a sad thing is, when this feature really helps your app, like doubles or triples throughput, then it means you must be leaving speed on the table on _other_ platforms--manycore Ivy Bridge or AMD or whatever--because you didn't take fine-grained locks. If the difference matters, then for many years you'll have to go to the effort to do fine-grained locks and make it fast on Ivy Bridge too.

The other thing is, the problem of parallelizing tasks is deeper than just fine-grained locks being tricky. If you want to make, say, a Web browser faster by taking advantage of multiple cores, you still have to do deep thinking and hard work to find tasks that can be split off, deal with dependencies, etc. Smart folks are trying to do that, but they _still_ have hard work when they're on systems with transactional memory.

That may be overly pessimistic. It's likely apps today _are_ leaving performance on the table by taking coarse locks for simplicity's sake, and they'll get zippier when TSX is dropped in. Or maybe transactional memory will be everywhere, or on all the massively multicore devices anyway, faster than I think. Or, finally, maybe Intel knows TSX won't have its greatest effects for a long time but is making a plan for the very long term like only Intel can.
Paul Tarnowski - Thursday, September 20, 2012 - link
I think it's the last one more than anything else. It really seems to be about setting up architecture for the future. Right now with four and eight cores the losses aren't that high, and effects won't be seen on anything but servers. While it is a big deal, it will be even less important to the consumer market than most other architecture extensions.
USER8000 - Thursday, September 20, 2012 - link
This an article from them nearly three years ago:

http://blogs.amd.com/developer/2009/11/17/the-velo...
NeBlackCat - Thursday, September 20, 2012 - link
I stopped reading at the end of the 2nd para after the bar chart on the first page:

"The root of the locking problems is that locking is a trade-off. Suppose you have a shared data structure such as a small table or list. If the developer is time constrained, which is often the case, the easiest way to guarantee consistency is to let a certain thread lock the entire data structure (e.g. the table lock of MySQL MyISAM). The thread acquires the lock and then executes its code (the critical section). Meanwhile all other threads are blocked until the thread releases the lock on the structure (*). However, locking a piece of data is only necessary if two threads want to write to it. (**)"

(*) all other threads arent locked, only those that also need access to the same data.
(**) locking a piece of data is only *one* thread wants to write it (else you risk a reader reading it before the writer has finished writing it)

And if the first locker is doing something that may take (in CPU terms) considerable time, likely in the database scenario, then the OS will probably schedule something else (there are always other processes and threads wanting to run) on a core/hyperthread running a blocked thread, so it wont sit idle anyway.

Unless things have changed since I was a real time software engineer, anyway
Wardrop - Thursday, September 20, 2012 - link
I think your nitpicking. We all understand the locking problem (those of us that it's relevant to anyway). There's no point in the author going into more detail to clarify what we all should already know - he gives us enough information for us to at least know what he's talking about, and that was the point of that paragraph.
JohanAnandtech - Thursday, September 20, 2012 - link
"all other threads arent locked, only those that also need access to the same data."

With a bit of goodwill, you would accept that this is implied.

" then the OS will probably schedule something else "

Right. But it is no free lunch. In the case of a spinlock, there will several attempts to lock and then a context switch. That is thousands of cycles lost without doing any useful work. BTW if you really want to see some in depth cases, we linked to

http://www.anandtech.com/show/5279/the-opteron-627...
http://www.anandtech.com/show/5279/the-opteron-627...

which goes in a lot more detail.
gchernis - Thursday, September 20, 2012 - link
OP did not explain why software-transactional memory (STM) is good, before diving into Intel's hardware solution. Current breed of databases use STM, but so can other applications. The prerequisite is the assumption that there's much more reading than there's writing going on.

Making Sense of the Intel Haswell Transactional Synchronization eXtensions

Post Your Comment

29 Comments

View All Comments

1008anan - Thursday, September 20, 2012 - link

extide - Saturday, September 22, 2012 - link

extide - Saturday, September 22, 2012 - link

twotwotwo - Thursday, September 20, 2012 - link

Paul Tarnowski - Thursday, September 20, 2012 - link

USER8000 - Thursday, September 20, 2012 - link

NeBlackCat - Thursday, September 20, 2012 - link

Wardrop - Thursday, September 20, 2012 - link

JohanAnandtech - Thursday, September 20, 2012 - link

gchernis - Thursday, September 20, 2012 - link

Log in

Don't have an account? Sign up now