I’ve been getting into windbg while working on my POC. I’ve long been a fan of Microsoft’s Visual Studio debugger. Even back in the late 90s, when I got a serious case of Open Source Religion after falling under the spell of Eric Raymond’s Cathedral and the Bazaar, and went through an anti Microsoft phase, I never stopped rating the debugger. Visual Studio debugger is great, but windbg is a whole ‘nother thing. It’s the debugger MS use themselves for debugging Windows. Yes, the interface is a little clunky compared to the VS debugger, but the power of the command set more than compensates. It’s got it’s own scripting system built in, so you can construct custom breakpoints: for instance, break the 5th time round the loop when this int is greater than 100. And it has an API too, so the debug engine can be driven from other languages like Python. Personally, I’m really enjoying discovering the power of windbg. While I do so I’m capturing tips and tricks here.

Python DAGs

May 5, 2014

Tales flags some interesting developments in the Python world. The demand for Python developers in finance does seem to be building. Both Man and Getco are big users, and as Tales points out, JP Morgan and BAML both use Python as the primary programming languages in their Athena and Quartz systems, both of which are inspired by Goldman’s SecDB/Slang. Tales wonders if Washington Square Tech will the fourth implementation of this paradigm; I believe it may be the fifth, as Morgan Stanley had an Athena like project called Pioneer during Jay Dweck’s ill fated tenure. Apparently that project is now defunct. The Athena paradigm is technically a very powerful solution for trading businesses that have run on ad hoc solutions using Excel for pricing and risk. Partly because they seek to replace Excel based pricing and risk, and partly because it’s compute efficient, all implementations of the Athena SecDB/Slang paradigm implement Directed Acyclic Graphs. I’m guessing that Washington Square Tech will think this could be very appealing for buy side firms that don’t have big in house tech stacks, together with incumbent tech teams defending them against replacements.

Directed Acyclic Graphs (DAGs) are a powerful implementation technique for minimising the load of compute intensive tasks like pricing and risk, and this is one good reason why Excel has been so successful in this area, as Ben Lerner of DataNitro explains persuasively here. So it should be no surprise to see that Man Groups own open source Python codebases include a DAG implementation, MDF. The MDF docs include a very good illustration of the power of the DAG approach.

tornado & websockets

April 26, 2014

The core of my POC prototype is a server engine, which doesn’t really make for a great demo. Generally, people grasp concepts quicker if they can see a tangible realization. So I needed a realistic way to show live ticking data getting cranked out by the server. A browser GUI seemed a natural candidate. And being a Pythonista I wanted to do the server coding in Python. Until recently getting live ticking data pushed up to a browser was a big deal, requiring sophisticated server products like the Caplin Liberator, and rich GUI toolkits like Caplin Trader. Fortunately, it’s now possible to hack some demoware in the form of a live, ticking webpage using some really simple jQuery & websockets in the browser, and tornado on the server side. JavaScript and browser GUIs are not my forte, so I won’t comment any further, except to note how much easier it seems than five years ago. On the server side, though, I do have more experience. Back in 2000 I was doing server side web dev in Python using Zope. Zope is a very powerful system, featuring a built in Object DB and an inheritance by instance rather than class mechanism called acquisition. Consequently it has a rather steep learning curve. In recent years Plone has had some traction as a CMS built on top of Zope. In 2001/2 I discovered Twisted Matrix, a general networking toolkit you can use to build any IP based networking functionality. Again there’s a steep learning curve, but it’s much lighter than Zope, and is now very mature. I will be using Twisted to build a general socket server capability for my core product: I’ve got C++ and Python APIs, but I’ll need a socket server for Java support. But what I needed for my demo purpose was real time server push to the browser. And tornado proved to be a good choice. Simple, lightweight, lots of worked examples and focused entirely on websockets. It didn’t take long to get ticking data into a webpage. Recommended!

Nodally Xenograte

April 10, 2014

Well, xenograte from nodally sounds pretty cool: loosely coupled software components in the cloud. Brad Cox’s dream of snap together building blocks, finally realised. Yahoo Pipes, anyone? I’ve even hacked around similarly motivated code myself, but never got so far of course. The problem with these new paradigms is that they ask you to throw away all your old software assets so you can rebuild them again in the new framework. A bit like media companies asking us to buy the same content over and over on different formats: LPs, tapes, CDs, audio DVDs, downloads, pono, VHS, DVD, bluray….  Why can’t someone find a way to breath fresh life into existing assets without reengineering them. Why not, indeed?

POC M1 running

February 16, 2014

I’ve been building a proof of concept for a new product since last August. I’ve just got the first milestone running, which is a big, big step forward. When I’ve got M2 done it will be time to come out of stealth mode, get the message out, do some demos, and start fund raising so I can put a team together…

procmon

November 9, 2013

I’ve long been a fan of Mark Russinovich’s sysinternals utilities, especially procmon and procexp. Today I discovered that in procmon, when browsing filtered file system events, you can get a stack trace by right clicking on the event. Wow!  I didn’t know that. Very powerful diagnostic technique.

LNK1120

April 5, 2013

Just fixed one of those infuriating Visual Studio link errors. There’s lots of good stuff on stackoverflow, of course, but this one was a bit more subtle. It’s still probably quite common though, if you’re integrating code with differing build systems. The root of the problem is linking code built with Visual C++’s own STL with code using STLport. There are various tools and options in Visual Studio that will help you find the root of the error.

dumpbin: command line utility to dump all the symbols defined and referenced by a binary. You can use it on .exe, .lib and .obj files.

undname: command line utility to undecorate C++ names. Given a mangled name it will demangle it.

/P: compiler option to write pre-processor output to file as a .i file. Useful for seeing which headers are actually pulled into each unit of compilation.

/VERBOSE: linker option which will tell you which libraries and objects are searched, and which symbols they resolve.

 

Going off topic here, as this isn’t etrading related, but I’ll blog it as I suspect others might be having problems with Windows 7 and 8 PCs dropping connections to BT HomHub 2 modem routers. If you’re connection only lasts for a few minutes, and then shows as “limited connectivity” it could be because recent Windows does DHCP differently than Homehub. I found an MS KB article that suggests adding a registry flag in HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Services/Tcpip/Parameters/Interfaces/{GUID}. The flag is called DhcpConnEnableBcastFlagToggle, it should be a DWORD, and be set to 1. There may be several GUIDs under Interfaces, depending on how many Wifi hubs your PC talks to. Look inside each and examine IP addresses to figure out which is the Homehub.

CPU cache size

September 21, 2012

Here’s some code I wrote back in 2009 to figure out the L1 and L2 cache sizes on a Windows host. It works by populating an area of memory with randomized pointers that point back into that same area. The randomization defeats stride based attempts by CPUs to predict the next cache access. The size of the memory region is stepped up in powers of two, and timings taken. The timings should show when the region size exhausts the L1 and L2 caches. This code illustrates why control of memory locality is so important in writing cache friendly code.

#include <iostream>
#include <cmath>
#include <windows.h>

typedef unsigned __int64 ui64;
typedef int* iptr;

const int _1K = 1 << 10;
const int _16M = 1 << 24;

iptr aray[_16M/sizeof(iptr)];

inline ui64 RDTSC( ) {
   __asm {
      XOR eax, eax        ; Input 0 for CPUID, faster than mv
      CPUID            ; CPUID flushes pipelined instructions
      RDTSC            ; Get RDTSC counter in edx:eax matching VC calling conventions
   }
}


int main( int argc, char** argv)
{
    // Ensure we're running on only one core
    DWORD proc_affinity_mask = 0;
    DWORD sys_affinity_mask = 0;
    HANDLE proc = GetCurrentProcess( );
    GetProcessAffinityMask( proc, &proc_affinity_mask, &sys_affinity_mask);
    std::cout << "proc_affinity_mask=" << proc_affinity_mask << ", sys_affinity_mask=" << sys_affinity_mask << std::endl;
    if ( proc_affinity_mask > 1) {
        if ( proc_affinity_mask & 2)
            proc_affinity_mask = 2;
        else
            proc_affinity_mask = 1;
        SetProcessAffinityMask( proc, proc_affinity_mask);
    }

    // avoid interrupts 
    SetPriorityClass( proc, REALTIME_PRIORITY_CLASS);
    
    // stepping up thru the candidate cache sizes
    for ( int bytes = _1K; bytes <= _16M; bytes *= 2) {

        // populate the array with ptrs back into the array
        int    slots = bytes/sizeof(iptr);
        int    slot = 0;
        iptr    start_addr = reinterpret_cast<iptr>( &aray[0]);
        iptr    end_addr = start_addr + slots;
        iptr    rand_addr = 0;
        iptr    addr = 0;

        std::cout << "slots=" << std::dec << slots << ", start_addr=" 
            << std::hex << start_addr << ", end_addr=" << end_addr << std::endl;

        // clear memory first so we can spot unpopulated slots below
        for ( addr = start_addr; addr < end_addr; addr++)
            *addr = 0;

        for ( addr = start_addr; addr < end_addr; addr++) {
            // pick a random slot in the array
            slot = int( float( slots) * float( rand( ))/ float( RAND_MAX));
            rand_addr = start_addr + ( slot == slots ? 0 : slot);

            // look for the next empty slot - nb we may need to wrap around
            while ( *rand_addr) {
                rand_addr++;
                if ( rand_addr >= end_addr)
                    rand_addr = start_addr;
            }
            *rand_addr = reinterpret_cast<int>( addr);
        }

        // sanity check
        for ( addr = start_addr; addr < end_addr; addr++) {
            if ( !*addr)
                std::cout << "empty slot at " << std::hex << addr << std::endl;
        }

        // now we're ready to ptr chase thru the array
        int accesses = int( 1e6);
        addr = aray[0];
        ui64 start_time = RDTSC( );
        for ( int i = 0; i < accesses; i++)
            addr = reinterpret_cast<iptr>( *addr);
        ui64 end_time = RDTSC( );

        ui64 cycles = end_time - start_time;
        float rw_cost = float( cycles)/float( accesses);
        std::cout << "size=" << std::dec << bytes << ", cycles=" 
                    << cycles << ", cost=" << rw_cost << std::endl;
    }

    return 0;
}

C++ Disruptor

September 20, 2012

I’ve collected some handy links on the Disruptor and lock free programming in general here. I’m coding up a Windows specific C++ Disruptor implementation at the moment, using the Interlocked* family of Win32 API functions. I’m using the volatile keyword, in the Microsoft sense to enforce the code ordering intent on the compiler, and the Interlocked functions to address cache coherence when there are multiple writers to, for instance, the RingBuffer cursor.