Invisible Engines

June 4, 2014

I first read Invisible Engines back in 2007. I still rate it highly now, and I’m pleased to see you can download the whole book as a PDF from the MIT site. It’s topic is the economic and technical aspects of software platforms. Anyone who’s followed the fortunes of IBM, Microsoft, Apple and Sun, and their respective software and hardware platforms should get a lot from this book. I had high expectations when I originally read it, and I wasn’t disappointed. Looking at the download again today, I can see it’s stood the test of time. The book goes well beyond operating systems as platforms and has excellent material on gaming consoles and the three way tension between mobile handsets, their OSes and network operators. It came out in 2006, so pre dates the rise of social software platforms. But the principles it elucidates on multi sided markets and APIs are obviously applicable to Facebook, Google and Twitter. And in the financial sector they apply equally to industry giants like Bloomberg who are a classic multi sided play. Bloomberg as a platform derives revenue from clients paying $1000/month for a terminal. The other sides of its market are dealers contributing quotes and liquidity to Bloomberg as an ECN, and software developers using Bloomberg APIs.

When you’re building a server product that will support C++ APIs you need to consider your ABI – the binary interface. Typically, C++ APIs are distributed as headers and libs. If the functions your API exports include parameters that use, for instance, std::string, you immediately have a problem, as you’re requiring client code to use the same STL implementation as you did to build the lib. That’s OK if client code has access to the source, and can rebuild. But commercial, proprietary products, tend not to distribute source. So how to avoid forcing dependencies on API client code? I went searching for some resources, and found two especially good ones I had to flag.

Here’s  Thiago Macieira on binary compatibility: an excellent presentation with guidelines for library authors. Here’s a summary of Thiago’s recommendations…

 

  1. Use pimpl idiom to hide object size
  2. Use plain old data types in function signatures
  3. Don’t hand out ptrs or refs to internals
  4. No inline funcs
  5. All classes need one non inline virtual; probably the dtor
  6. Avoid virtuals in template classes
  7. Do not reorder or remove public members, or change access levels

2 means no STL or Boost types in function parameters. I’ll address 6 by avoiding templates in my API.

This article by Agner Fog is a superb detailed survey of data sizes & alignments,  stack alignment, call conventions for register usage and parameter handling, name mangling schemes and object file formats inc COFF for all the major x86 & x64 OSes and C++ compilers. Strongly recommended.

I’ve been getting into windbg while working on my POC. I’ve long been a fan of Microsoft’s Visual Studio debugger. Even back in the late 90s, when I got a serious case of Open Source Religion after falling under the spell of Eric Raymond’s Cathedral and the Bazaar, and went through an anti Microsoft phase, I never stopped rating the debugger. Visual Studio debugger is great, but windbg is a whole ‘nother thing. It’s the debugger MS use themselves for debugging Windows. Yes, the interface is a little clunky compared to the VS debugger, but the power of the command set more than compensates. It’s got it’s own scripting system built in, so you can construct custom breakpoints: for instance, break the 5th time round the loop when this int is greater than 100. And it has an API too, so the debug engine can be driven from other languages like Python. Personally, I’m really enjoying discovering the power of windbg. While I do so I’m capturing tips and tricks here.

Python DAGs

May 5, 2014

Tales flags some interesting developments in the Python world. The demand for Python developers in finance does seem to be building. Both Man and Getco are big users, and as Tales points out, JP Morgan and BAML both use Python as the primary programming languages in their Athena and Quartz systems, both of which are inspired by Goldman’s SecDB/Slang. Tales wonders if Washington Square Tech will the fourth implementation of this paradigm; I believe it may be the fifth, as Morgan Stanley had an Athena like project called Pioneer during Jay Dweck’s ill fated tenure. Apparently that project is now defunct. The Athena paradigm is technically a very powerful solution for trading businesses that have run on ad hoc solutions using Excel for pricing and risk. Partly because they seek to replace Excel based pricing and risk, and partly because it’s compute efficient, all implementations of the Athena SecDB/Slang paradigm implement Directed Acyclic Graphs. I’m guessing that Washington Square Tech will think this could be very appealing for buy side firms that don’t have big in house tech stacks, together with incumbent tech teams defending them against replacements.

Directed Acyclic Graphs (DAGs) are a powerful implementation technique for minimising the load of compute intensive tasks like pricing and risk, and this is one good reason why Excel has been so successful in this area, as Ben Lerner of DataNitro explains persuasively here. So it should be no surprise to see that Man Groups own open source Python codebases include a DAG implementation, MDF. The MDF docs include a very good illustration of the power of the DAG approach.

tornado & websockets

April 26, 2014

The core of my POC prototype is a server engine, which doesn’t really make for a great demo. Generally, people grasp concepts quicker if they can see a tangible realization. So I needed a realistic way to show live ticking data getting cranked out by the server. A browser GUI seemed a natural candidate. And being a Pythonista I wanted to do the server coding in Python. Until recently getting live ticking data pushed up to a browser was a big deal, requiring sophisticated server products like the Caplin Liberator, and rich GUI toolkits like Caplin Trader. Fortunately, it’s now possible to hack some demoware in the form of a live, ticking webpage using some really simple jQuery & websockets in the browser, and tornado on the server side. JavaScript and browser GUIs are not my forte, so I won’t comment any further, except to note how much easier it seems than five years ago. On the server side, though, I do have more experience. Back in 2000 I was doing server side web dev in Python using Zope. Zope is a very powerful system, featuring a built in Object DB and an inheritance by instance rather than class mechanism called acquisition. Consequently it has a rather steep learning curve. In recent years Plone has had some traction as a CMS built on top of Zope. In 2001/2 I discovered Twisted Matrix, a general networking toolkit you can use to build any IP based networking functionality. Again there’s a steep learning curve, but it’s much lighter than Zope, and is now very mature. I will be using Twisted to build a general socket server capability for my core product: I’ve got C++ and Python APIs, but I’ll need a socket server for Java support. But what I needed for my demo purpose was real time server push to the browser. And tornado proved to be a good choice. Simple, lightweight, lots of worked examples and focused entirely on websockets. It didn’t take long to get ticking data into a webpage. Recommended!

Nodally Xenograte

April 10, 2014

Well, xenograte from nodally sounds pretty cool: loosely coupled software components in the cloud. Brad Cox’s dream of snap together building blocks, finally realised. Yahoo Pipes, anyone? I’ve even hacked around similarly motivated code myself, but never got so far of course. The problem with these new paradigms is that they ask you to throw away all your old software assets so you can rebuild them again in the new framework. A bit like media companies asking us to buy the same content over and over on different formats: LPs, tapes, CDs, audio DVDs, downloads, pono, VHS, DVD, bluray….  Why can’t someone find a way to breath fresh life into existing assets without reengineering them. Why not, indeed?

POC M1 running

February 16, 2014

I’ve been building a proof of concept for a new product since last August. I’ve just got the first milestone running, which is a big, big step forward. When I’ve got M2 done it will be time to come out of stealth mode, get the message out, do some demos, and start fund raising so I can put a team together…

procmon

November 9, 2013

I’ve long been a fan of Mark Russinovich’s sysinternals utilities, especially procmon and procexp. Today I discovered that in procmon, when browsing filtered file system events, you can get a stack trace by right clicking on the event. Wow!  I didn’t know that. Very powerful diagnostic technique.

LNK1120

April 5, 2013

Just fixed one of those infuriating Visual Studio link errors. There’s lots of good stuff on stackoverflow, of course, but this one was a bit more subtle. It’s still probably quite common though, if you’re integrating code with differing build systems. The root of the problem is linking code built with Visual C++’s own STL with code using STLport. There are various tools and options in Visual Studio that will help you find the root of the error.

dumpbin: command line utility to dump all the symbols defined and referenced by a binary. You can use it on .exe, .lib and .obj files.

undname: command line utility to undecorate C++ names. Given a mangled name it will demangle it.

/P: compiler option to write pre-processor output to file as a .i file. Useful for seeing which headers are actually pulled into each unit of compilation.

/VERBOSE: linker option which will tell you which libraries and objects are searched, and which symbols they resolve.

 

CPU cache size

September 21, 2012

Here’s some code I wrote back in 2009 to figure out the L1 and L2 cache sizes on a Windows host. It works by populating an area of memory with randomized pointers that point back into that same area. The randomization defeats stride based attempts by CPUs to predict the next cache access. The size of the memory region is stepped up in powers of two, and timings taken. The timings should show when the region size exhausts the L1 and L2 caches. This code illustrates why control of memory locality is so important in writing cache friendly code.

#include <iostream>
#include <cmath>
#include <windows.h>

typedef unsigned __int64 ui64;
typedef int* iptr;

const int _1K = 1 << 10;
const int _16M = 1 << 24;

iptr aray[_16M/sizeof(iptr)];

inline ui64 RDTSC( ) {
   __asm {
      XOR eax, eax        ; Input 0 for CPUID, faster than mv
      CPUID            ; CPUID flushes pipelined instructions
      RDTSC            ; Get RDTSC counter in edx:eax matching VC calling conventions
   }
}


int main( int argc, char** argv)
{
    // Ensure we're running on only one core
    DWORD proc_affinity_mask = 0;
    DWORD sys_affinity_mask = 0;
    HANDLE proc = GetCurrentProcess( );
    GetProcessAffinityMask( proc, &proc_affinity_mask, &sys_affinity_mask);
    std::cout << "proc_affinity_mask=" << proc_affinity_mask << ", sys_affinity_mask=" << sys_affinity_mask << std::endl;
    if ( proc_affinity_mask > 1) {
        if ( proc_affinity_mask & 2)
            proc_affinity_mask = 2;
        else
            proc_affinity_mask = 1;
        SetProcessAffinityMask( proc, proc_affinity_mask);
    }

    // avoid interrupts 
    SetPriorityClass( proc, REALTIME_PRIORITY_CLASS);
    
    // stepping up thru the candidate cache sizes
    for ( int bytes = _1K; bytes <= _16M; bytes *= 2) {

        // populate the array with ptrs back into the array
        int    slots = bytes/sizeof(iptr);
        int    slot = 0;
        iptr    start_addr = reinterpret_cast<iptr>( &aray[0]);
        iptr    end_addr = start_addr + slots;
        iptr    rand_addr = 0;
        iptr    addr = 0;

        std::cout << "slots=" << std::dec << slots << ", start_addr=" 
            << std::hex << start_addr << ", end_addr=" << end_addr << std::endl;

        // clear memory first so we can spot unpopulated slots below
        for ( addr = start_addr; addr < end_addr; addr++)
            *addr = 0;

        for ( addr = start_addr; addr < end_addr; addr++) {
            // pick a random slot in the array
            slot = int( float( slots) * float( rand( ))/ float( RAND_MAX));
            rand_addr = start_addr + ( slot == slots ? 0 : slot);

            // look for the next empty slot - nb we may need to wrap around
            while ( *rand_addr) {
                rand_addr++;
                if ( rand_addr >= end_addr)
                    rand_addr = start_addr;
            }
            *rand_addr = reinterpret_cast<int>( addr);
        }

        // sanity check
        for ( addr = start_addr; addr < end_addr; addr++) {
            if ( !*addr)
                std::cout << "empty slot at " << std::hex << addr << std::endl;
        }

        // now we're ready to ptr chase thru the array
        int accesses = int( 1e6);
        addr = aray[0];
        ui64 start_time = RDTSC( );
        for ( int i = 0; i < accesses; i++)
            addr = reinterpret_cast<iptr>( *addr);
        ui64 end_time = RDTSC( );

        ui64 cycles = end_time - start_time;
        float rw_cost = float( cycles)/float( accesses);
        std::cout << "size=" << std::dec << bytes << ", cycles=" 
                    << cycles << ", cost=" << rw_cost << std::endl;
    }

    return 0;
}