SpreadServe AMI part II

January 17, 2017

The core component in a SpreadServe deployment is the SpreadServeEngine, a headless C++ server binary that implements the Excel compatible calculation engine. The engine discovers its hostname through the win32 API using GetComputerNameExA( ComputerNameDnsFullyQualified, …). On AWS this was giving me hostnames like WIN-THU4IQNRN6F, when what I wanted was the fully qualified domain name, like ec2-54-186-184-85.us-west-2.compute.amazonaws.com. Harry Johnston helpfully advised on StackExchange that, since the host is not joined to a domain, GetComputerNameExA will only return the FQDN if I explicitly set it via Control Panel. Naturally I want to avoid manual fixes on a SpreadServe AMI so I settled on using Amazon’s EC2 instance metadata. The FQDN hostname can be discovered with an HTTP GET on this URL from any EC2 host: http://169.254.169.254/latest/meta-data/public-hostname. I built a small helper server process to query instance metadata using Tornado’s async HTTPClient and write it to the localFS, where SpreadServeEngine can read it. Result: any new SpreadServe AMI will automatically discover its public DNS.

SpreadServe AMI part I

January 16, 2017

Recently I’ve been working on building an EC2 AMI for SpreadServe, so deployment becomes a one click operation for Amazon AWS users. I ran into an interesting snag so I thought I’d capture it here. My aim was to deploy SpreadServe as a Windows Service on an AWS Windows Server 2012 R2 image, so I used pywin32‘s excellent win32service module. Here’s my github boilerplate project for a Windows Service in Python. On my AWS host my SpreadServe Windows Service was failing to start, and leaving no trace in the system or application event logs. pywin32service has a debug mode; when I tried that I got a Windows 0xc00007b error, which indicates a mix of 32 and 64 bit binaries. SpreadServe is 32 bit all the way, so something was wrong. I turned to procmon to try and figure what was failing. procmon showed that my 32 bit pythonservice.exe was loading a 64 bit python27.dll, instead of the 32 bit python27.dll that’s part of the SpreadServe install tree. The 64 bit DLL was coming from the C:\Program Files\Amazon\cfn-bootstrap directory, which is added to the standard Windows 2012 R2 image by Amazon to support CloudFormation, and is on the system path. After much experimenting I couldn’t find a way to stop Windows Service Host from using the system path, so I had to change it to replace cfn-bootstrap with SpreadServe directories. Problem solved…

Recently I’ve been rediscovering the fact that threading is hard. I’ve been extending the SpreadServe Addin to support Tiingo‘s IEX market data feed. Real live ticking market data is usually only found inside investment banks, brokers and big hedge funds as it takes a lot of cash and infrastructure to connect to exchanges directly or to subscribe via Reuters. Even newer internet contenders like xignite are very expensive too. Tiingo’s IEX feed provides live ticking equity top of book data at an unprecedented price point. That is an exciting new development that I want to support in SSAddin. Coding it up has renewed my appreciation of how tricky multithreaded code can be. The SSAddin is implemented in C# packaged as an XLL using ExcelDNA. As with any Excel XLL, the worksheet functions it defines are executed on the main Excel thread. If they are long running, then they’ll block the GUI. So the worksheet functions pass off their work to a background thread. This means that SSAddin can do quandl and tiingo historical data queries without blocking the main Excel thread. Query results are cached, and there’s a set of worksheet functions to pull results out of the cache. So far so good. However, adding subscriptions to Tiingo’s IEX market data adds more complexity. In .net callbacks for web socket events are dispatched on pool threads. Ticking data is pushed back into Excel via RTD. So lots of lock statements are necessary to coordinate access to the queue for passing work from the Excel thread to the background thread, and for coordinating access to subscription management data structures and the RTDServer between the background thread and the pool threads that dispatch the socket callbacks. All good fun which has prompted a few thoughts. Firstly, threading is hard! Secondly, I must get round to learning Rust and understanding the borrow checker. Thirdly, thanks heavens for lock reentrancy in .net!

I’ve been wanting to use RethinkDB for the cloud based SpreadServe service offering for sometime, so when I heard the Windows version had gone into beta there was no excuse for further delay. NoSQL DBs are all the rage now, with Mongo, Cassandra, Couch and Redis to choose from. For me, RethinkDB stood out from the crowd for several reasons. Firstly, its changefeeds. All distributed systems have to resolve the challenge of keeping process caches in sync with the DB. I’ve seen two quality hand rolled solutions to this at major banks in the past few years. One based on SQL Server triggers that caused pub sub broadcasts of XML formatted updated or inserted rows. And on JP Morgan’s Athena project I saw Twisted object serialisaton used to update socket subscribers with recently changed objects. Both approaches scaled up well. What makes RethinkDB special is that it solves that problem for you out of the box with changefeeds. The second appealing feature of RethinkDB for me is that the Python API is a first class citizen andnot an afterthought. The Python API’s event handling and coroutine implementation style is neatly integrated with Tornado, which I’m also using in SpreadServe. And thirdly, I liked the fact that RethinkDB’s core implementation is in C++, and is open source. Like RethinkDB, and like JP’s Athena for that matter, SpreadServe is C++ on the inside with Python APIs.

So I’ve been working with the RethinkDB 2.3.0 beta build for Windows for a few days now. I’ve been delighted by several aspects of Rethink, and I’ve hit a few gotchas. I’ve also realised that the shift to coroutine based coding is a big, big deal. So let me lay that out here, for the record. First, the things that have delighted me…

And here are the gotchas that I hit…

  • When coding in Python, don’t forget a .run( ) on the end of your r.table( ).get( ) or r.table( ).insert( )
  • Method names aren’t consistent across APIs. For instance getAll( ) in JS is get_all( ) in Python. Even if you’re coding in Python, as I am, you’ll still find yourself using JS in the admin GUI’s data explorer, so this is an irritation.
  • You need tornado 4 or better as RethinkDB’s Tornado integration imports tornado.tcpclient, which isn’t in 3.x. It took me a while to track down as the server process which I was connecting to RethinkDB was exiting silently, with no trace of an import error in log or console. However, Python docs do say that imp.load_module( ), as used in r.set_loop_type( ) can throw ImportError. Once I got a try/except clause around r.set_loop_type( ) I caught the exception and realised I needed to upgrade from Tornado 3.2 to 4.2.1.

Once I was past the gotchas I realised I needed to upgrade my coding style to embrace coroutines. They’ve been in Python since 2.7, and Tornado has adopted them. They’re all over the RethinkDB examples. I’ve been coding in a single or low threaded async callback style for at least ten tears now, having realised that the multiple blocking worker thread approach is horribly inefficient and prone to deadlocks and races. But all my code has been very callback oriented, and coroutines are a big shift away from that. One of my big challenges over the last few days has been figuring out how to combine the two styles. I have my own C++ & Python framework with uses a single threaded async style. And I’ve got a load of Tornado based code in the same style. Now I need to combine that with RethinkDB code written in a coroutine style. I found this fantastic blog post with detailed commentary on refactoring a bunch of callback style Tornado code to use coroutines: https://emptysqua.re/blog/refactoring-tornado-coroutines/

It’s been invaluable. One mistake I’ve made is thinking that Rethink/Tornado coroutines can be invoked directly like generators. They can’t, you must use loop.add_callback( ) to schedule them. I’ll be back with more as I explore RethinkDB and coroutines more, and I aim to post more code samples like this gist of a minimal, complete Tornado Web Server with RethinkDB changefeed.

There are a couple of spreadsheets in the SpreadServe beta that illustrate point 3 (component reuse) from my recent Spreadsheets are code post. One of them – ycb_quandl_pub.xls – is running on the AWS host, and a recent post explained in detail how it uses Quandl data to drive QuantLib’s yield curve bootstrapping functions. ycb_quandl_pub.xls is paired with ycb_quandl_sub.xls. You can download both of them from here, and as their names suggest, ycb_quandl_pub.xls is a publisher, and ycb_quandl_sub.xls is a subscriber. ycb_quandl_pub.xls will run equally happily in Excel or SpreadServe, but it only becomes a reuasable component when it’s running in SpreadServe. Try downloading ycb_quandl_sub.xls and running it in Excel on your desktop. You’ll need to install SSAddin to make it work. Then you’ll see that ycb_quandl_sub.xls is updated with the dates and rates of the bootstrapped curve calculated by ycb_quandl_pub.xls. You may see #N/A in the cells for a few minutes until the first tick arrives from the server, which recalcs every five minutes. The s2cfg sheet in ycb_quandl_sub.xls configures the SSAddin to use its s2websock function to subscribe to the rates published by the RealTimeWebServer every time the ycb_quandl_pub.xls sheet hosted in a SpreadServeEngine instance recalculates. The RealTimeWebServer can support many subscribers, so all the logic in ycb_quandl_pub.xls from Quandl, QuantLib and the worksheet formula is shared by all the subscribers. A user with edit permission could change some aspect of the model on the publisher side, the Interpolator or TermStructureCalendar perhaps, and all the subscribers would get the same updated data as a result. Those familiar with typical pricing engine architectures in investment banks will recognise the makings of a graph of pricing engines here. But the major difference is that no server side C++, C# or Java coding is necessary to make it happen. Graphs of quant or trader developer spreadsheets can be strung together very rapidly. The benefit of the spreadsheet level component reuse that SpreadServe makes possible should be apparent.

SpreadServe resources

August 7, 2015

In preparation for the launch of SpreadServe‘s beta program I’ve added a page of resources to this blog. I’ve just finished moving the documentation on to readthedocs.org. It’s very cool to be able to edit the docs on my laptop, push the changes to github, and have them appear automatically, via webhook, on readthedocs. The source ReStructured Text docs are on the SpreadServe github repository. Also on github is the SpreadServe Addin which extends Excel with background thread quandl queries and cron like scheduled triggers. And there’s a link to the Amazon hosted instance running a yield curve bootstrapping sheet that automatically pulls depo, futures and swap rates from quandl. More on that in another post. Finally, there’s a link to the Google Group for SpreadServe. Please join the group if you’d like to download the SpreadServe beta and kick the tyres.

Python 3 & PyCharm

July 1, 2015

I’ve been coding in Python since 2000, and for a long time my dev env preferences haven’t changed. Like many I used Python 1.5.2 with a basic text editor, often vim, for a long time. Once the 2.x series of Python releases started I held off and stuck with 1.5.2 for a long time. I never used 1.6.x. I can’t remember whether I made the jump to 2.1 or 2.2, but I’ve been using 2.x for a long time now, usually with notepad++ as my editor. Part of the reason is that it takes time for the extensive Python ecosystem to catch up and port all the libraries and frameworks. Anyway, I’ve just finished a contract where I used Python 3.3 and the PyCharm IDE, and it was a breath of fresh air. I’d never consider development in Java or C++ without an IDE, and my preferences are IntelliJ & MS Visual C++ respectively. Previously I’d felt an IDE was unnecessary in Python, mainly because the rapid cycle time is so quick. Unlike C++ the cycle is not edit, compile, link, test. In Python one just edits and tests, which makes the printf style of debugging far more effective. PyCharm turbocharges the debugging process with breakpoints and visual object graph traversal. And during coding it interactively highlights syntax errors and variable references. That’s a big time saver too, since it makes code run at the first attempt without throwing syntax errors. +1 for PyCharm!

So what about the shift from Python 2.x to 3.x ?  For me the important points have been the move to more iterator based coding. The iteritems( )/iterkeys( )/itervalues( ) methods no longer exist as items( )/keys( )/values( ) no longer return lists, they return iterable view objects. Those view objects are not stand in replacements for lists. And I had to get used to using the next( ) operator with generators. And, of course, print is now a function and no longer a statement. But apart from that it was straightforward.

Update 2015-07-01: I’ve just been pinged by an old coding compadre who downloaded PyCharm on my recommendation, and needs a tip on fixing up interpreter paths to pick up libs. I had to read a couple of StackOverflow articles to figure this out too, so I though I’d document it here. I’m using PyCharm Community Edition 4.5.2, and to add libraries to my interpreter search path I go to the File/Settings dialog. In the left had tree control, under the Project: <myproj> node I select Project Interpreter. Then I click on the cog icon in the top right, next to the selected interpreter, and choose the More… option. This throws up another dialog: Project Interpreters. On the right are several icons. The bottom one is a mini tree control that shows a pop up tooltip saying “show paths for the selected interpreter”. Click on that, and finally you get the Interpreter Paths dialog, and you can add your library. Phew!! Could this config be buried any deeper? IntelliJ: sort it out! PyCharm is very, very good, but this is quite a useability flaw….

I’ve been doing a lot of Excel RTD addin coding recently, as I’ve been adding RTD support to SpreadServe. As part of that work I’ve developed two new addins, both of which I’ve posted on github. Of course, both addins work in Excel and SpreadServe. The first, SSAddin, supports quandl.com queries and Unix cron style timer events on background threads. Both these things can be done with VBA of course, and that’s how quandl’s existing Excel addin does it. However, SSAddin gives you the means to achieve automated, scheduled downloads from quandl with no Visual Basic and no manual keystrokes into a GUI. The second addin, kkaddin, is based on Kenny Kerr‘s example C# RTD code. While I was researching RTD I read Kenny’s excellent material on the topic. John Greenan also has some quality content on his blog too. However, I wasn’t able to find a single, simple, download with C# boilerplate code that would build and run; so that’s what kkaddin addresses.

quandl badly formed URL

April 20, 2015

I’ve started working on some new code that pulls data from quandl, and I was getting this error…

 { "error":"Unknown api route."}

I was using the first example from quandl’s own API page

https://www.quandl.com/api/v1/WIKI/AAPL.csv

and googling didn’t turn up any answers. Fortunately the quandl folk responded on Twitter, and all’s well. The URL should be…

https://www.quandl.com/api/v1/datasets/WIKI/AAPL.csv

So I’m recording the issue here for any others that get stuck. Looks like “unknown api route”==”badly formed url”.

Excel industrialisation

April 3, 2015

John Greenan has produced an excellent series of posts on Excel VBA Industrialisation on his blog. It’s a topic dear to me, so I figured I’d better respond. In his posts JG presents a series of VB Extensions based techniques to enable the export of embedded VB from a spreadsheet, so it can be version controlled, as well as techniques for error logging and reporting. The code is out there on github, and it’s a valuable addition to the public domain, especially since there are several commercial offerings addressing this space. For instance, spreadgit, ClusterSeven and Finsbury Solutions. JG kicks off his discussion in part one by observing that VBA is in the doldrums, and that the cool kids are using MEAN, Scala, OCaml or Haskell. Sure, the cool kids are never going to use VBA. But that’s not just because other languages are cooler, it’s because VBA and the latest programming languages are aimed at completely different audiences. Scala, OCaml & Haskell are for developers, and Excel is for non developers, end users, business users. The very reason for Excel’s phenomenal success and ubiquity is because it enables end users to create software solutions. Apparently there are eleven million professional software developers in the world. But even those eleven million can’t meet the world’s demand for software, so end users have to generate their own solutions, and they use Excel to do it. The result is, as JG points out in the comments to part six in his series: “In many cases the requirement for Excel Industrialisation is for a firm with an existing portfolio of ‘000s of spreadsheets that cannot all, in a cost-effective manner, be manually rewritten to conform to a coding standard.”

A version control system is an important part of controlling those portfolios of end user developed spreadsheets. However, it solves only part of the problem. Another major underlying factor that causes so many spreadsheet problems is their manual, desktop operation. Since Excel is a desktop application, Excel spreadsheets must be manually operated by their users. Users have to start up Excel, load the sheet, key in unvalidated data, hit F9, and then copy & paste or email the results out. All of that is error prone. And all of this manual operation is a major factor preventing any organised, systematic testing. All of these problems were writ large with the London Whale. All these problems could be resolved if we could decouple Excel as a development environment from Excel as a runtime. It’s great that end users can develop their own solutions in Excel, but it’s burdensome and error prone for this solutions to be operated manually on desktop PCs. Those solutions should be automated, resilient and scalable, and hosted by a server side rumtime. That, of course, is SpreadServe.