November 21, 2016
The forthcoming cloud version of SpreadServe uses a Tornado based server to persist a breakdown of all formulae used in a spreadsheet loaded by SpreadServe. For complex sheets I found that the insertion of many formulae in the formula table could be timeconsuming. In one test scenarion a multi-formula insert took 5 minutes. So I checked out the RethinkDB’s troubleshooting page where there are some useful performance tips. Batch insertions with the recommended batch size of 200 brought the insert time down from 5 mins to 21 secs. Further improvements came from using soft durability and noreply, bringing the insert time down to ~3.5 secs. However, I found that my Tornado server couldn’t respond to incoming HTTP GETs while the insert coroutine was looping on the insert batches. I figured that noreply meant that the yield in the loop resumed immediately, without waiting for the reply IO from the DB. Taking out noreply allowed the single threaded server to handle HTTP GETs in the middle of an insert. If improved performance is necessary in future, splitting the Tornado server into two processes may be the way to go, but for current test scenarios performance is acceptable.
May 26, 2016
Recently I’ve been rediscovering the fact that threading is hard. I’ve been extending the SpreadServe Addin to support Tiingo‘s IEX market data feed. Real live ticking market data is usually only found inside investment banks, brokers and big hedge funds as it takes a lot of cash and infrastructure to connect to exchanges directly or to subscribe via Reuters. Even newer internet contenders like xignite are very expensive too. Tiingo’s IEX feed provides live ticking equity top of book data at an unprecedented price point. That is an exciting new development that I want to support in SSAddin. Coding it up has renewed my appreciation of how tricky multithreaded code can be. The SSAddin is implemented in C# packaged as an XLL using ExcelDNA. As with any Excel XLL, the worksheet functions it defines are executed on the main Excel thread. If they are long running, then they’ll block the GUI. So the worksheet functions pass off their work to a background thread. This means that SSAddin can do quandl and tiingo historical data queries without blocking the main Excel thread. Query results are cached, and there’s a set of worksheet functions to pull results out of the cache. So far so good. However, adding subscriptions to Tiingo’s IEX market data adds more complexity. In .net callbacks for web socket events are dispatched on pool threads. Ticking data is pushed back into Excel via RTD. So lots of lock statements are necessary to coordinate access to the queue for passing work from the Excel thread to the background thread, and for coordinating access to subscription management data structures and the RTDServer between the background thread and the pool threads that dispatch the socket callbacks. All good fun which has prompted a few thoughts. Firstly, threading is hard! Secondly, I must get round to learning Rust and understanding the borrow checker. Thirdly, thanks heavens for lock reentrancy in .net!
April 21, 2016
I’ve been wanting to use RethinkDB for the cloud based SpreadServe service offering for sometime, so when I heard the Windows version had gone into beta there was no excuse for further delay. NoSQL DBs are all the rage now, with Mongo, Cassandra, Couch and Redis to choose from. For me, RethinkDB stood out from the crowd for several reasons. Firstly, its changefeeds. All distributed systems have to resolve the challenge of keeping process caches in sync with the DB. I’ve seen two quality hand rolled solutions to this at major banks in the past few years. One based on SQL Server triggers that caused pub sub broadcasts of XML formatted updated or inserted rows. And on JP Morgan’s Athena project I saw Twisted object serialisaton used to update socket subscribers with recently changed objects. Both approaches scaled up well. What makes RethinkDB special is that it solves that problem for you out of the box with changefeeds. The second appealing feature of RethinkDB for me is that the Python API is a first class citizen andnot an afterthought. The Python API’s event handling and coroutine implementation style is neatly integrated with Tornado, which I’m also using in SpreadServe. And thirdly, I liked the fact that RethinkDB’s core implementation is in C++, and is open source. Like RethinkDB, and like JP’s Athena for that matter, SpreadServe is C++ on the inside with Python APIs.
So I’ve been working with the RethinkDB 2.3.0 beta build for Windows for a few days now. I’ve been delighted by several aspects of Rethink, and I’ve hit a few gotchas. I’ve also realised that the shift to coroutine based coding is a big, big deal. So let me lay that out here, for the record. First, the things that have delighted me…
- Very simple install process
- Nice docs
- Great admin UI: the data explorer is very good.
And here are the gotchas that I hit…
- When coding in Python, don’t forget a .run( ) on the end of your r.table( ).get( ) or r.table( ).insert( )
- Method names aren’t consistent across APIs. For instance getAll( ) in JS is get_all( ) in Python. Even if you’re coding in Python, as I am, you’ll still find yourself using JS in the admin GUI’s data explorer, so this is an irritation.
- You need tornado 4 or better as RethinkDB’s Tornado integration imports tornado.tcpclient, which isn’t in 3.x. It took me a while to track down as the server process which I was connecting to RethinkDB was exiting silently, with no trace of an import error in log or console. However, Python docs do say that imp.load_module( ), as used in r.set_loop_type( ) can throw ImportError. Once I got a try/except clause around r.set_loop_type( ) I caught the exception and realised I needed to upgrade from Tornado 3.2 to 4.2.1.
Once I was past the gotchas I realised I needed to upgrade my coding style to embrace coroutines. They’ve been in Python since 2.7, and Tornado has adopted them. They’re all over the RethinkDB examples. I’ve been coding in a single or low threaded async callback style for at least ten tears now, having realised that the multiple blocking worker thread approach is horribly inefficient and prone to deadlocks and races. But all my code has been very callback oriented, and coroutines are a big shift away from that. One of my big challenges over the last few days has been figuring out how to combine the two styles. I have my own C++ & Python framework with uses a single threaded async style. And I’ve got a load of Tornado based code in the same style. Now I need to combine that with RethinkDB code written in a coroutine style. I found this fantastic blog post with detailed commentary on refactoring a bunch of callback style Tornado code to use coroutines: https://emptysqua.re/blog/refactoring-tornado-coroutines/
It’s been invaluable. One mistake I’ve made is thinking that Rethink/Tornado coroutines can be invoked directly like generators. They can’t, you must use loop.add_callback( ) to schedule them. I’ll be back with more as I explore RethinkDB and coroutines more, and I aim to post more code samples like this gist of a minimal, complete Tornado Web Server with RethinkDB changefeed.
October 13, 2015
Recently Microsoft has added support for containers to Windows Server. It’s available on the Azure cloud on VMs running Windows Server 2016 Tech Preview 3. I’ve been playing with it and I’ve got SpreadServe running inside a container. There’s much more detail here. But to summarise I found three workarounds were necessary…
- A two step process to build images as Windows container doesn’t like SpreadServe’s NSIS installer
- Web server inside the container should be on port 80 only internally
- A one line launch script that sets up environment variables is necessary
October 5, 2015
Recently I’ve been using the excellent Very Sleepy profiler to performance tune the SpreadServeEngine’s loader, compiler and interpreter. One of our beta users had helpfully supplied a very large spreadsheet which was causing very long load and calc cycles. Back in the 90s you had to spend serious money on licenses for Pure Software‘s Purify and Quantify tools for this kind of work. Now tools like Dr Memory and Very Sleepy are free and OSS. It’s a while since I did a real, systematic performance profiling and tuning exercise, and I was soon reminded of how quickly preconceptions about which parts of the code might be CPU hogs can be shattered. It wasn’t long before I was soon nose to nose with one of the eternal truths of C++ development, or indeed development in any language: malloc & free are expensive. That’s why the LMAX team coded their own Java Collections. Printf isn’t cheap either. The answer was to introduce memory pooling for many of the most heavily used compiler and interpreter classes, and to set up config switches for the interpreter tracing. Interpreter tracing needs to be available in release builds as it a tremendously useful way of looking inside the execution of your spreadsheet. The result was a thirty fold improvement in load time, and much snappier calc cycles on very large sheets.
September 19, 2015
September 10, 2015
August 14, 2015
There are a couple of spreadsheets in the SpreadServe beta that illustrate point 3 (component reuse) from my recent Spreadsheets are code post. One of them – ycb_quandl_pub.xls – is running on the AWS host, and a recent post explained in detail how it uses Quandl data to drive QuantLib’s yield curve bootstrapping functions. ycb_quandl_pub.xls is paired with ycb_quandl_sub.xls. You can download both of them from here, and as their names suggest, ycb_quandl_pub.xls is a publisher, and ycb_quandl_sub.xls is a subscriber. ycb_quandl_pub.xls will run equally happily in Excel or SpreadServe, but it only becomes a reuasable component when it’s running in SpreadServe. Try downloading ycb_quandl_sub.xls and running it in Excel on your desktop. You’ll need to install SSAddin to make it work. Then you’ll see that ycb_quandl_sub.xls is updated with the dates and rates of the bootstrapped curve calculated by ycb_quandl_pub.xls. You may see #N/A in the cells for a few minutes until the first tick arrives from the server, which recalcs every five minutes. The s2cfg sheet in ycb_quandl_sub.xls configures the SSAddin to use its s2websock function to subscribe to the rates published by the RealTimeWebServer every time the ycb_quandl_pub.xls sheet hosted in a SpreadServeEngine instance recalculates. The RealTimeWebServer can support many subscribers, so all the logic in ycb_quandl_pub.xls from Quandl, QuantLib and the worksheet formula is shared by all the subscribers. A user with edit permission could change some aspect of the model on the publisher side, the Interpolator or TermStructureCalendar perhaps, and all the subscribers would get the same updated data as a result. Those familiar with typical pricing engine architectures in investment banks will recognise the makings of a graph of pricing engines here. But the major difference is that no server side C++, C# or Java coding is necessary to make it happen. Graphs of quant or trader developer spreadsheets can be strung together very rapidly. The benefit of the spreadsheet level component reuse that SpreadServe makes possible should be apparent.
August 13, 2015
Felienne Hermans has made it her mission to point out that “spreadsheets are code”. She’s most definitely right about that, and a whole host of the other consequences that she draws from that insight, specifically that we should apply the techniques developed by mainstream software engineering to spreadsheets: version control, testing and design guidelines for clean structure, like the FAST standard. Whenever you create a sheet with formulae in it you’re programming. Ignoring that fact is one of the reasons spreadsheet disasters keep happening. I couldn’t agree more with Prof Hermans on that score. But I think we need to go further in the comparison of spreadsheets with code, and point out some major differences.
- Conventional code, when deployed to its production runtime environment does not come with an IDE that enables any user to change the implementation! A trader can’t reach inside his Bloomberg or TradeWeb terminal and change its implementation. But Excel allows any user to change any formula in a financial model.
- Conventional code enables reuse through components. Each Excel spreadsheet is like an island, and monolithic. How can spreadsheets be composed together to draw input and feed output to each other? Only with manual, error prone operations.
- Unit testing: the unit testing philosophy calls for any significant component to have a set of separate test code that proves compliance with pre and post conditions as well as yielding specified results. Also required is the ability to run a set of tests automatically and record the results. All of that is a capability that Excel simply doesn’t have.
To realise points 1 to 4 for spreadsheets we need an alternate run time that can host spreadsheets on a server, and decouple the financial logic expressed in worksheet formulae, VBA & XLLs from the user interface. In the next post I’ll give more detail on how SpreadServe solves all the issues raised above.
August 8, 2015
In yesterday’s post I promised to give more detail on the Yield Curve Bootstrapping sheet running on the Amazon hosted SpreadServe instance. If you’d like to try running the sheet on your own desktop you can download it from the repository; just click on ycb_quandl_pub.xls. To run the sheet in your own Excel you’ll need to download the QuantLib and SpreadServe addins. ycb_quandl_pub.xls is based on one of QuantLibXL’s example spreadsheets, YieldCurveBootstrapping.xls, which gives a sample QuantLib Excel solution to a common fixed income rates maths problem: bootstrapping a yield curve. If you look at the original sheet you’ll see that all input data is present as simple cell values. To change it you must rekey it. Ideally this would be automated, so that deposit, futures and swap rates could be regularly pulled from a clean data source, and the bootstrapping results recalculated and published. ycb_quandl_pub.xls uses the SpreadServe Addin to pull the depo, futures and swap rates from quandl. Look at the top left block on the Quandl sheet within the ycb_quandl_pub workbook to see the invocations of the s2quandl function that pull the rates into the sheet from quandl.com. Lower down on the same sheet you can see the s2cron invocation that schedules a timer to go off every 5 minutes and trigger a new download of the same data. The same trigger is used as input to QuantLib’s qlPieceWiseYieldCurve function on the Bootstrapping sheet to force a recalculation when freshly downloaded data arrives. All that is great for automating an Excel spreadsheet. With SpreadServe we can take it one step further and get the sheet off the desktop and onto a server. The whole process is then automated, centralised and freed from possible manual disruption on the desktop.
NB QuantLib date calcs mean the results of this sheet are only good on weekdays, Mon-Fri, and not Sat or Sun.