Tag Archives: A-Team

Ultra Low Latency (ULL) Market Data – Current State and Future

Still a DRAFT

All points I make are exclusively mine, not my employer.  I mention vendor solutions but do not endorse any vendors.  This white paper will be used in my NYU Summer 2017 class – Ultra Low Latency Architectures for Electronic Trading

Ted Hruzd

picturespike

picturecorvil

 

whitepaper-jan28

Introduction

 

Increasing number of electronic trading (ET) firms rely on market data architectures that are integral parts of single micro second (uS) Tick-to-Trade (T2T) processing times.  Some architectures now also render T2T times of under 1 uS.

 

How do we define T2T?

T2T is the difference from T0 to T1 where T0 ==  time that a trading application receives market data quotes:

  • directly from a Trading Venue (TV),
  • or from the Securities Information Processor (SIP) that creates consolidated quotes (Consolidated Quote System (CQS) and Unlisted Quote Data Feed (UQDF)), from TV direct feeds,

And T1 == time the trading app is at point of sending an order to a TV

 

In CoLo environments, Multi-Layer switches with FPGA’s connect with trading venues and fan-out market data in 5 ns directly to FPGA devices for not only market data normalization but also all order and risk flow tasks leading to order generation (T2T).  Other CoLo environments fan-out the market data to multi processor Intel Linux based servers, tuned to the max, and programmed utilizing deep vectors (Intel AVX-512) for increasing instructions per clock cycle and maximizing concurrent multi thread processing (Intel TBB), to the point of T2T.

 

All serious ULL trading apps strive for deterministic latencies in order to attain best trades even during high volume periods. FPGA processing results in deterministic latencies, processing at line speed.  Hence, most ULL application rely at least in part on FPGA’s.

 

A major driver for deterministic ULL is due to significant enhancements to Big Data (BD) / Machine Learning (ML), resulting in alpha-seeking opportunities at times in uSecs.  FPGA’s, GPU’s, high speed memory, and new robust ML API’s are a root cause.  Aggressive firms increase their odds of catching alpha by architecting their infrastructure and application software for deterministic latencies, with the market data component key to not only single digit T2T but also for acting on timely market data confident of high % of successful fills.  This new competitive landscape has lead to a new speed concept we refer to as Meta-Speed (information about speed, its relevance, and how to act on it – ex: Y/N for specific trades).

 

All major ET applications track their trading partner fill rates and latencies with OLAP, multi-dimensional metrics (time by second or less, symbol, quote-bid spreads, volume, price and volume volatility, etc ..), real time and historical.  Decisions whether to trade per internal market data latencies and expected order ack & exec times of trading venues, are most critical to market makers, per their risky, ULL sensitive drive to generate revenue while establishing markets.  Key themes and details behind these trends follow, including an appendix for reference and deeper research.

 

 

Tick to Trade (T2T) core components after trading application receives quote:

 

  • Market data feed processing / normalization that leads to consolidated data (to distribute right away to subscribers) or direct feed data (to likewise distribute right away to subscribers and/or create multi level order books that subscribers can utilize in multiple ways)
  • Concurrently (multi processing – using FPGA’s and/or Intel Cores) for:
    1. FIX (and non FIX protocol) message order processing that includes trading logic (ex: algo’s or simple direct orders), pre trade risk checks, identifying TV’s to send to, connecting to ‘exchange connectivity’ software & infrastructure, with order(s) ready to be routed over network to TV.

A leading T2T solution (under 1 uSec) is from Algo-Logic.

picturealgologic

The details of a T2T under 1 micro second for CME market data / order flow : [1]

KEY THEMES

 

Meta-Speed

Speed1 is raw speed; Speed2 revolves around “meta-speed” (information about speed).  An important aspect of Speed2 is deterministic speed.  Speed2 is more important today than Speed1, with deterministic mkt data a key component.

 

Importance of Metrics

Metrics regarding speed (including mkt data speed, and especially tick to trade or “T2T”) are critical to maximizing fill rates and thus trading revenues:

  • a large global investment bank has stated that every millisecond lost results in $100m per annum in lost opportunity.[1]

 

Alpha-Seeking

Alpha Trading Opportunities can be only few milliseconds long, and at times only in micro seconds, even nanoseconds now, due in part to Faster Real Time Market Data – Big Data Analytics

  • Real Time Big Data (Machine Learning), some in Cloud, now can send more time sensitive alpha trading signals over high speed interconnects to Trading Apps

 

 

Pool for non ‘Low Latency’ Trading is Decreasing

Infrastructure and application development continue to improve, speed up, and become deterministic, for all Electronic Trading (ET) aspects, including mkt data.  This includes recent Nasdaq implementation of SIP market data now (FIF November stats) at approximately 20 micro seconds, down from prior 350 micro seconds.  At present, buy-side and exec brokers (sell-side), with ‘deterministic’ mkt data latencies (Exchg mkt data egress à networkàMktDataTickeràApp) of 1-2 milliseconds, may still be relevant, but less, with decreasing opportunities.

Key Questions

 

 What does optimal access to high-speed / low-latency market data look like? Why is it important to achieve this?

A concise answer:

  • They tend to be collocated (CoLo) with trading venues (TV).  This will significantly decrease mkt data latency from TV’s to your infrastructure.  Therefore Set up trading applications collocated (CoLo) with Trading venues for ULL market data (direct feeds) and order flow.
  • A total FPGA based market data / order flow infrastructure allows for Tick to Trade latencies in single uSecs, even just under 1 uSec. Below is the flow:
    • Exchg mkt data egress à CoLo networkàApp (can be 1 uSec or less)

 

  • CoLo Managed Services that are used include Options-IT and Pico Trading.

 

  • An innovative ULL network featuring a proprietary software defined network (SDN) solution, for connecting trading clients/applications to TV’s is available from Lucera. One can consider this a ULL extranet.  Other extranets are available.  Negotiate quality of service with them very meticulously.
  • FPGA’s (in switches, appliances, NICs) are ideal for data transformation – hence a popular choice for feed handlers.. FPGA’s in appliances, switches, and NICs are increasingly being used for ingress and egress of market data, data normalization, order book builds, and analytics, as FPGA’s can be programmed for less market data jitter than multi threaded software in server CPU cores.
  • Direct Feeds are used over Consolidated; build your own BBO (ex: UBBO or User BBO vs NBBO), as you will then receive direct feeds at same time consolidate feed vendors receive them. Exegy appliances calculate UBBO before it forwards NBBO;  Exegy and all market data tickers receive NBBO via extra hop at SIP infrastructure that calculate NBBO (CTA and UTP) to subscribers.

 

  • They use Multi Layer FPGA based L1-3 Switches. These are ideal in CoLo.  A leading vendor is Metamako.  Their Meta-Mux switches L2 aggregate market data and order flow data in 70 ns.  Market data fan-outs are directed via L1 in 5 ns (ideal for subscribers who can then further normalize market data via FPGA’s).  Multi layer switches (L1-3) can also significantly contribute to deterministic latencies, critical for competitive trading advantage.

 

  • Some ULL infrastructures utilize appliances with integrated L1-3 switches, FPGA’s, and Intel cores. These appliances are available from Metamako and ExaBlaze as another alternative to consolidated market data, order flow processing, via multi thread FPGA and Intel core programming
  • ULL architectures tend to reside on flatter networks with less hops; this includes a trend transitioning to a single tier network.  Therefore, one strongly consider single tier network design, with Software Designed Network (SDN to optimize bandwidth and dynamically change and optimize routes for deterministic performance (may work well for market data analytics
  • ULL architectures track latency metrics in real time, addressing Speed2 or Meta-Speed via appliances, such as Corvil.  What is meta-speed?
    • Corvil and Tabb coined ‘Speed 2’ or ‘meta-speed’. Briefly it references a decision point as to whether to send in an order or not.  If the order is acting on a market data latency with high % of fill, then firm will tend to send it, else hold off.  This is most critical to market making firms.  We detail meta-speed further later.
    • ULL ET firms track Historical and real time Big Data analytics / Machine Learning to address challenges to meta-speed.
  • If FPGA solutions are not utilized then strict attention to following best practices must be adhered to:
    • Kernel bypass must be implemented
    • NICs must be segmented(separate for market data & order flow)
    • Market data infrastructures must be performance optimized / tuned app (ex: multi threaded TREP – Hub, servers)
      • Ex: Tune Reuters Elektron/TREP configuration with optimal #s threads for TREP hubs (ADH) and servers (ADS), maximizing market data throughput
    • Conduct Capacity planning (all infrastructure, bandwidth, routers, switches, servers, middleware, client apps, caches); eliminate all MultiCast broadcast storms.  This is great advice for all infrastructure

 

Following is a quote from a vendor that is not the fastest market data provider:

“whether you are running Alpha-seeking strategies or seeking best Ex for your agency business, consistently fast market data is a foundational requirement.”

 

Below diagram depicts a CoLo ULL market data ingress/fan-out + ( market data normalization / order flow) solution.  The latter can be fully FPGA based (ex: algo-logic CME infrastructyure) or part FPGA or Intel core solution optimized for multi threaded programming (TBB) and deep instruction vectors (AVX-512) to allow more instruction per clock cycle (versus prior AVX-256).

Metamako MetaMux switches with ( market data normalization / order flow) in the middle (diagram below): [2]

 

picturemeta

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Advances in Kernel Bypass:

 

Total bypass of system and user space with TCP stack as part of FPGA based NIC for deterministic sub micro second network IO’s.  Note the progression from 1st to 2nd to 3rd diagrams and comments by Enyx Laurent deBarry:

picturekernel

 

Who are some of the leading vendors in this space ?

  • Metamako
  • Intel/Altera
  • Algo-Logic
  • NovaSparks
  • Exegy
  • SolarFlare
  • ExaBlaze
  • Enyx
  • Mellanox
  • FixNetics
  • Arista
  • Cisco
  • Corvil
  • Options-IT
  • PicoTrading
  • Plexxi Networks

 

 

Why is it important to maintain deterministic latencies, with mkt data a key component?

 

End 2 end connectivity, infrastructure, and the software must be architected for not only speed (ex: tick-2-trade or T2T times in micro, not milliseconds) but also deterministic speed.  Deterministic speed is important because serious traders, for over a decade, have tracked latencies and fill rates with their trading partners.  Technology is now available to largely eliminate peaks in latencies at times of market data spikes.  Serious traders know the sell-side, execution brokers, dark pools, and trading venues that exhibit deterministic speed (and low latencies).  They trade with them especially in volatile times.  The best opportunities for most competent algo traders are to take advantage of volatile and high volume periods. Their trading signals have a short life, often in micro seconds.  They know where to route orders.  The most successful buy side traders will also route to the fastest and most deterministic sell side firms.  Sell side will optimize their dark pools for deterministic speed and know when to send to other dark pools and lit exchanges, in large part to very significant advances in Big data analytics.

 

What are the factors that disrupt this optimal scenario?

 

Optimal infrastructure may be full or partial FPGA, layer 1 switch, and CoLo based.  Those that are not, can still be relevant for speed (T2T in microseconds) and deterministic latencies if they follow know best practices that will reduce market data latency jitter. If one does not follow these best practices, then that will disrupt not only an optimal scenario, but their market share / business. These include:

 

  • Never configure market data and order flow on the same network interface (NIC).  A spike in one will negatively impact latencies of the other.
  • Configure kernel bypass on all of your NICs.  Several vendors offer out of the box kernel bypass for much lower I/O latencies – to 1 microsecond from 12 or more microseconds.
  • One can attain additional NIC speed with Optimized APIs for publishers and subscribers of the NIC data. The fastest and most deterministic kernel bypass devices are FPGA based.  Full TCP stacks can be implemented in FPGA cards in NICs; hence not only is system space bypassed but user space too in CPU cores, allowing much more in CPU core processing power.  Kernel bypass has been successful for over 6 years.  Upgrade immediately if you do not use kernel bypass
  • Optimize your Linux server kernels – ex RH 7.2 profile “network latency” priorities speed over bandwidth energy savings. Speed up your CPU core processing, decrease your jitter, with this profile [3]
  • Continuously monitor and update metrics, your KPI’s such as latencies and fill rates per trading partners, by symbol, by time of day, per economic events, per variance from expected price and volumes, etc.  This real time OLAP type of analytics is part of the new Big Data, which now can provide significant seeking alpha advantages.

 

Please explain what Tabb and Corvil refer to as ‘Speed2’ or ‘Meta-Speed’ (Sept 2016).

 

Also – why have increasing number of trading firms been addressing Speed2 opportunities for increase revenue?

 

Speed (or raw, pure speed which we can refer to as ‘Speed1’), is relative. There is no single speed that guarantees success.  This is where ULL analytics can pay huge dividends, acting on data relating to speed; hence, aspects of speed must be continually measured.

 

Recent advances in technology and current and evolving market structures mandate that traders must view more nuanced aspects of speed. This brings us into a new era that Tabb and Corvil refer to as “Speed II,”.  It revolves around meta-speed (information about speed).  [4]

 

This pertains to understanding the dynamics, timeliness, measurability, auditability and transparency of speed and latency. Hence, the information around speed and latency is as important, if not more important, than speed itself. A critical aspect of Speed2 is deterministic latency.

 

What good is raw speed if firms do not understand whether the sell-side systems, exchanges, markets, and clients they are connecting to are fully functional? Or if a trading partner is in the beginning stages of failure, or in the process of being degraded?

 

A sell-side execution broker app is NOT fully functional if its T2T or order ack times exponential increase within few milliseconds following a Fed interest rate increase or Donald Trump tweet (because –for example – the execution broker erred weekend before in not setting NIC kernel bypass on, hence delaying ingress of bursty market data, following an upgrade).  This is 1 example of many that can disrupt deterministic latencies and performance, and hence fill rates. However, technologies are available whereby a buy-side firm can learn of the order ack latency spikes in milliseconds, and immediately route to alternate sell-side execution brokers.  *** This is where use of high speed memory Big Data analytics  can provide a significant competitive advantage.

 

Many firms are developing internal business process management (BPM) benchmarks to fully understand the status of their own systems.  A selection of those metrics increasingly should be provided to clients so they can better judge the trading ecosystem.  This is good will among trading partners.  But technology exists today where every trading client can, very precisely measure, analyze their trading ecosystem, and thus alert, and then alter real time trading apps, for competitive advantage.

 

Tabb and Corvil have stated that:

Today’s markets require microseconds and tomorrow’s will require nanoseconds

 

Why must market makers be keenly aware of ‘Meta-Speed’

 

Speed2 includes addressing decision points as to whether to send in an order or not.  If the order is acting on an internal trading system market data latency pointing to high % or confidence of fill per most recent market data, then the firm will tend to send it, else hold off.  This is most critical to market making firms, as they need to timely create markets to stay in business, let alone generate profits.  Real time analytics are key to such trading decisions. 
To trade and maintain market share, market makers not only need to connect with the largest exchanges, they increasingly need to link to the majority of, if not all, trading venues. Further, to manage adverse selection risk (the risk of being picked off), market makers need to be increasingly fast. This means buying direct market data feeds, exchange access, colocation services, and very fast intermarket connectivity.

 

How does a deterministic approach to low latency connectivity help create an optimal situation? How does this work and how is it accomplished?

An optimal situation includes a pertinent infrastructure, with application software  and Linux kernels configured and tuned for deterministic latencies.  This includes infrastructure configurations addressed earlier and software design principles to take advantage of multi threading, utilizing all available cores, with deep vector processing, thereby executing more instructions per CPU core clock cycle.[5]

To accomplish this successfully, 2 points are critical:

  • ROI

Trading Firms vary in their Electronic Trading (ET) priorities, expertise, commitments, and expectations of ET impact to revenue and net income.  Those that aspire to be leaders and successful (generate profits) in this space, especially must meticulously construct ROI projections for their proposed investments and commitments to be most competitive and increase market share.

 

To accomplish this, such firms must very extensively rely on Big Data analytics with metrics:

  • Continuously extract stats of fill rates at low, median, peak 99% latencies.
  • How about any fill rate stats over time?
  • What is the fill% and revenue gain if T2T decreases by 100 micro seconds?
  • What is the cost of infrastructure to accomplish this and to maintain it?
  • What is the cost of not upgrading (loss of market share and revenue as competitors increase Speed1 (raw) and their use of Speed2.

 

Use these metrics to project capacity upgrades.

 

Machine Learning / Neural Networks (part of evolving Big Data evolution for ET) can project with significant confidence – what are multi variable factors that project latencies and fill rates, hence revenues.  *** Again this is another excellent capability now available (and not too expensive) for Big Data analytics impacting the bottom line and major financial decisions.

Examples of firms A and B, opting for different strategies:

  1. Firm A ROI analytics point to very significant market share growth for very complex trading algo’s; hence they may opt to upgrade their market data infrastructure (and maybe even order flow) to a 100% pure FPGA based – for SIP and Level 2 feeds, with a vendor certified FPGA based normalization IP, and with a high speed interconnect (5 ns) forward market data to their proprietary algo trading system that combines FPGA’s for relatively simple algo’s and combo FPGA’s / Intel cores for more complex algos (Firm A has talented FPGA developers). AlgoLogic + Metamako provide such a solution.
  2. Firm B is smaller, with less expertise, projects modest market share improvements with lower market data latencies and overall deterministic T2T latencies. Their analytics points to modest expenditures and thus pursues a predominantly managed CoLo service with lower and more deterministic latencies.  Another option may be a carefully configured/tuned vendor- distributed solution.  An example: Thomson Reuters TREP

 

.

  • Test, Profile, Analyze, Project

Trading Firms must utilize infrastructure architects/engineers and Developers, QA staff to determine what new infrastructures project to positive ROI, and then meticulously engineer, configure, profile tune, validate expected latencies, fill rates, and thus revenue (positive ROI)

 

In the end such metrics may play significant role as to whether a trading firm should stay in ET or exit the business. [6]

 

 

How can in-memory and other high performance databases and feed handlers support key functions like best execution, transaction cost analysis, market surveillance and algo back-testing?

Best Ex, TCA, and market surveillance can now be done much faster and comprehensively via Big Data Analytics / Machine Learning.  This data includes historical and real – both structured and unstructured data. HPC and Exabyte in-mem databases, with high speed and more memory are available, along with more cores, GPU’s, FPGA’s, high speed interconnects to forward products of analytics – ex: TCA or alpha seeking event over high speed interconnect to alter a real time trading app

  • FPGA based feed handlers with relevant data can accelerate the analytics
  • Algo back-testing, with high speed memory will allow one more capabilities to test many variances and combinations of variables’ impacts on seeking alpha
  • Products such as OneTick, can receive gamut of ‘reduced’ or ‘final’ data for correlation analytics, even machine learning, act on this historical and MAYBE real time data in memory and address all above areas.
  • Speed of analytics, including generated events with a short-life of value, is increasingly important – especially in seeking alpha and in determining if sending in an order for execution has high probability of fill and profit.

 

Hence, the most deterministic latency market data, can be a major revenue maker, being input to TCA. OneTick can be a ULL subscriber of market data just as algo trading apps and thus have data ASAP for TCP analysis.  For pure and deterministic speed, raw direct feeds, with Layer 1 switch market data fan-out will be optimal.

 

Timely and deterministic analytical historical and real time tick market data are required for TCA.  To validate that TCA analytics will not impact ULL real time market data for trading apps and subscribers, offload the analytics then transmit alpha trading signals asynchronously via high speed interconnects and/or messaging middleware such as 60 East Technologies AMPS.  The OneTick analytics and stream processing product can analyze and provide alpha alerts via their OneTick cloud offering, utilizing large and high speed memory.

 

Sell-side is forced to up the ante, offering expanded services to better understand and manage the trade lifecycle. Complex event processing (CEP) and tick data management are the consummate tools that can easily be recast, molded to unearth trade performance, a goal that is central in the investment process as liquidity continues to be fragmented and fleeting. Now uncovering the performance of trading behavior through customized, personalized transaction cost analysis is a critical component to any investors’ profitability.

 

Real-time TCA provides traders with information on costs, quality and potential market impact as it happens, where analytics become actionable information at the point of trade. Determining these metrics on an internalized basis offers the ability to adjust an algorithm’s behaviour in real time. Execution strategies and routing logic can be adjusted intelligently in response to outlier conditions; either aggressively or passively in reaction to market conditions or broker behaviour.

In Sum:

Measure TCA results in real time as you’re trading and make adjustments and changes accordingly

 

Players in real time TCA:

 

OneTick (MAYBE BE POSSIBLE if not now maybe soon)

  • Built-in high precision analytical library for TCA analytics and market price benchmarks
  • Disparate data is normalized and cleansed (Deals, Quotes, Books, Orders, Executions)
  • Real-time stream processing engine, historical time-series database and visual dashboards
  • Flexible choices for building analytical models for execution/broker performance, market impact, etc

 

TabbMetrics Clarity Tool

As partners with TabbMetrics – Bloomberg, Cowen, Sanford Bernstein, Weeden

Pragma

Instinet

How do functions such as execution, transaction cost analysis (TCA), market surveillance & algo back testing impact market data system latency?

Any market data for transaction cost analysis, market surveillance & algo back testing must NOT impact T2T or order ack times or trade execution times.  Offload any analytics and above functions asynchronously to T2T real time order flow.  Proper software can replay market data with multiple algo’s at original rates, latencies, or alter (ex: speed-up, change dynamics, algo goals, etc)

Execution

For order execution, T2T and order ack times  are critical performance metrics; hence, market data infrastructures are optimized for speed and deterministic  latencies.

 

TCA

From Tabb:

TCA is increasingly being used in real time.  TCA can therefore generate alpha by projecting and exposing lower costs and specific trading venues for buying or selling securities.  This is referred to as “opportunity cost” and is very time sensitive. The first trading firm(s) to identify this (may be micro, not milliseconds) stands to benefit.  This is another example of ‘Speed2’.

 

Hence, the most deterministic latency market data, can be a major revenue maker, being input to TCA.

 

 

 

Market surveillance

For the same reasons as TCA, add order flow data, and realize that fast and more deterministic market data will speed up Market surveillance, post trade and compliance.

 

Algo Back Testing

Note speed up mentioned for TCA and understand that a Tick DB will have all data for algo back testing replays.

 

 How can high-performance technologies like hardware acceleration and in-memory databases improve trade executions and TCA ?

Also – are these ‘apples & oranges’? Are both capabilities or technologies needed for higher performance in market data management?

 

Both are required as hardware acceleration will speed up (and deterministically) market data to high speed memory regions for analytics that may include timely alpha seeking strategies.

 

 Are best practices for accessing high-speed market information widely agreed upon? Are they much the same across all types of firms? Do they change according to specialization/function/size?

Are they agreed upon- NO.  Some feel few milliseconds latencies for market data still suffices.  Others strive for nanoseconds.  Key is to ID your business goals and ROI on speed up spending.  Specialization and function are significant factors.  Goals of Prop Traders, Market Makers, HFT traders, arbitrage, are much more latency sensitive than most Buy side and asset managers.

Have they changed in the past two years? How will the change over the next two years?

 

Last 2 years:

  • Already mentioned: impact of what Tabb/Corvil refer to as Speed2, thus taking advantage of recent advances in Big Data analytics (infrastructure and methodology such as Machine Learning) for acting on market data and all aspects of T2T.

 

  • Layer 1 switches, with no need for packet examination +FPGA for speed, are now part of ULL CoLo market data fan-outs in 5 ns, significantly contributing to decrease in latencies.

 

  • Big Buffer switches to hold (not drop) spikes in real time data for Big Data analytics. While such switches are not ideal for market data or order flow T2T latencies, they are an excellent choice for ULL Big data analytics which can quickly forward alpha seeking trade signals over a high speed interconnect to a real time trading app.

 

  • Very significant enhancements in FPGA processing, significantly lowering T2T latencies. 2016 marked a year where some simple AlgoTrade apps may be 100% fully FPGA with T2T in approximately 1 microseconds.  One can now develop Order Book builds and FIX engines on FPGA based NICs.

 

  • Raw Speed continues to increase:
  • Per STAC event Nov 7, 2016 – NYC – One participant noted Industry leadingTick-2-Trade latencies have decreased by factor of 10 every 3 years. I recalled where I was in each time period and what were leading technologies for the lowest ET technologies and find these accurate -as I always tracked the competition, and worked to beat the competition!

picturegraph

 

picturetable

My chart above that I created per attendance at Nov 7 STAC event in NYC

 

Some factors (totally my creation):

 

  • Key point to this is that the proportion of “ULL” trading continues to increase; hence this would leave one to project that trading firms that choose not to “speed up” or engineer more deterministic latencies, will decrease their market share.

 

  • More firms are categorizing news sentiment analytics as “market data” and receiving feeds from vendors of these analytics, which are now part of more complex but possibly more reliable trade recommendations (up to the firm to take advantage of this). Key vendors – RavenPack. Thomson Reuters, Bloomberg

 

  • Real Time Big Data (Machine Learning), some in Cloud, now send more time sensitive alpha trading signals over high speed interconnects to Trading Apps

 

  • Kernel Bypass is now ubiquitous; I/O’s are at approx. 1 uSec, down from 12 uSecs
    • Increased use of FPGA based kernel bypass to sub uSec I/O latencies

 

  • GPU’s remain important speed factor but more for risk (ex MonteCarlo) and analytics

 

 

 

FUTURE

  • Intel’s synergies per Altera acquisition may result in further lower and deterministic latencies for all aspects of ET, including market data, algo trade design, risk analytics, TCA, and order flow.

 

  • Ability to more easily program in FPGA:
    • New libraries from Intel, along with a C like language ‘A++’, may supplement OpenCL for increases in programming complex trading algos in FPGA

 

  • High Speed interconnects such as Intel’s Omni Path Architecture (OPA) may be faster and more deterministic than InfiniBand (IB), due to no need for a bus adapter and IB’s requirement to fill a buffer before sending

 

  • Binary FIX Protocol

 

 

REFERENCE

 

Link below lists virtually all expertise I have in ULL Electronic Trading(ET) architectures.  I will teach an ULL ET course at NYU Summer 2017. 

https://homerunfitness.wordpress.com/2016/12/28/update-ull-ultra-low-latency-architectures-for-electronic-trading-nyu-summer-2017/

also – was Dec 8, 2016  Panelist for A-Team Webinar re: perspectives on strategic ULL market data architectures & how trading firms can realize ROI, seek alpha, expand market share, address risks, compliance. Access webinar recording here:http://bit.ly/2fXujEo.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[1] www.algo-logic.com

 

[2] www.metamako.com

 

[3] https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf

[4] http://tabbforum.com/opinions/rethinking-speed-in-financial-markets-part-3-the-speed-rule?print_preview=true&single=true

 

[5] http://tabbforum.com/opinions/redefining-and-reimagining-speed-in-the-capital-markets?ticket=ST-14852041740069-qgIDZjoi2HWsc6MNhDsNrNzvNIlTfGDkJWUQjb1b

[6] http://tabbforum.com/opinions/welcome-to-the-jungle-understanding-speed-in-the-capital-markets

 

Update-ULL (Ultra Low Latency) Architectures for Electronic Trading @ NYU Summer 2017

nyu

ULL (Ultra Low Latency) Architectures for Electronic Trading
NYU SPS Summer 2017 Session 9 sessions (approx 2 ½ hours each)  –– Ted Hruzd
Est: Tuesdays May 30, Jun 6-13-20-27, July 6-11-18-25, 6:15-8:45 pm

On-Line Registration to begin in February 2017 (approx cost $650-$675):

https://www.sps.nyu.edu/professional-pathways/topics.html

Course Objectives

Develop advanced skills in architecting electronic trading (ET) and market data applications for ultra low latency (ULL), for competitive advantage, and for positive ROI.  At end of course one will have developed expertise in end-end architecture of ET applications and infrastructure, including:

  •  Tick-2-Trade applications with single digit micro seconds, even with sub 1 micro seconds
  • How to architect for deterministic latencies even in times of volume spikes
  • Why ‘Meta-Speed’ (info how to used speed) is more important than pure speed
  • Proper use of multi-layer 48 port $18K ULL switches, FPGA’s, GPU’s, MicroWave technologies
  • Integration of FPGA’s and Intel cores via high speed caches, eventually FPGA’s and cores on same die (Intel-Altera current and upcoming enhancements)
  • When, how to architect market data order books and FIX engines in FPGA based NIC’s
  • Multi core, high speed cache Intel based servers
  • Linix 7.2 kernel and NIC tuning
  • Kernel bypass technologies including RDMA and LDMA
  • Leading FPGA based NIC(s) – from SolarFlare, ExaBlaze, Enyx
  • Single tier (or simplified spine-leaf); Ex: from Plexxi networks
  • Layer 1 network switches (Metamako & ExaBlaze)
  • SDN (Software Defined Networks) – when applicable for ULL trading applications
  • New binary FIX protocol for ULL order routing
  • ULL messaging middleware (29 West LBM/UME) and 60 East Tech AMPS
  • ULL software design (deep vectors ex Intel’s AVX-512 & multi threading – OpenMP, TBB)
  • Storage, including NVME Flash
  • Tools (some free) to attain performance optimization insights
  • Network appliances – detailed timings/analytics – network, market data, and order routing
  • Big Data and Event Stream processing, real time analytics for seeking alpha (trade opportunities)
  • Fundamentals of FPGA design and programming
  • Network performance analysis via WireShark and potentially also via Corvil
  • Programming trading algo’s via basic Python
  • Machine learning / neural networks for seeking alpha via basic R programming
  • ROI analysis

PreReq – (for most, expecting basic to intermediate expertise, unless noted)

  • Most important: at least 2 years working with electronic trading applications/infrastructures as Developer, SA, network admin/engineer, Architect, QA analyst, tech project mgr, operations engineer, manager, CTO, CIO, CEO, vendor or consultant providing technology to Wall Street IT,
  • TCP/IP, UDP, multicast (basic knowledge),
  • Linux OS and shell or scripting (ex bash, perl); at minimum basic familiarity of output and usefulness of core Linux commands such as sysctl –a, ethtool, ifconfig, top, ls, grep, awk, sed, and others listed later in this syllabus
  • Intel servers, cores, sockets, GHz clock speed, NUMA
  • Network routers, switches
  • 1 or more network protocols from BGP, OSPF, EIGRP, MPLS, IB
  • FIX protocol
  • Market Data, at minimum contents of equities consolidated feeds
  • Visio (will use for homework assignments)
  • Python (very basic will be fine – a 2 hour reading assignment will be arranged for beginners). We will use a text written for traders with zero programming experience that quickly trains them how to use small set of Python for creating trading algo’s
  • R programming (nice to have. Will use basics that one can learn in 1-2 hours),

Course Logistics

  • 8 or 9 sessions -, 2 ½ hours each ex: 6:30-9pm or 6:15-8:45 pm), start May 30 (Tue) or 31 (Wed), once a week
  • Tech book(s) to download to kindle
    • Architects of Electronic Trading, Spephanie Hammer, Wiley 2013
    • Ultimate Algorithmic Trading Systems ToolBox, George Pruitt, Wiley, 2016
    • (optional) Trading and Electronic Markets: What Investment Professionals Need to Know, Larry Harris, CFA, 2015
  • Multiple web site links to technical white papers and tech analyses -ex: nextplatform.com, http://intelligenttradingtechnology.com/, http://datamanagementreview.com, www.tabbforum.com,  www.tradersmagazine.com
  • Visio (some homework assignments)
  • Extensive use of white board by instructor and students. Sessions will present students with few infrastructures to architect per specific business success criteria
  • Grading:
    • 1/3 class participation in in-class architecture designs -white board sessions)
    • 1/3 quizzes / tests
    • 1/3 – Homework – visio, wireshark analysis, basic python algo programming

Session 1 – Tue May 30

ULL components: CoLo, switches, FPFA, servers, OS, networks, software & middleware, market data

  • Will present a Visio diagram with a co-lo ULL architecture that generates orders destined for trading venues, utilizing Layer 1 switching + FPGA’s for market data & order flow with a target of sub 1 micro second Tick-2-Trade (T2T) latencies. Latencies will be deterministic, even at peak loads, as long as switch and FPGA’s process at line speed
  • Will briefly present an alternative architecture utilizing a Single Tier network (Plexxi)
  • We will periodically revisit the co-lo ULL architecture throughout this course when we cover specific architecture components in depth (Algo Trading and/or SOR that feeds this architecture, use of FPGA, and Layer 1 switching)
  • Partial FPGA and non FPGA alternative architectures
  • Key advantages of FPGA (Ted’s A-Team Doc)
  • Why speed of processing (& ULL market data) still matter & will for next several years at least
  • Why Meta-Speed is more important than pure speed (reference Corvil-Tabb Doc)
    • Meta-Speed Deep Dive (10 minutes)
  • Why Layer 1 switches
  • Layer 1 switch with integrated cores and FPGA for risk checks
  • High speed real time analytics for seeking alpha (trade opportunities) & infrastructure analytics
  • Exchange (Trading Venue) connectivity
  • Layer 2/3 aggregation in new switch appliances
  • Some leading ULL vendors:
    • Metamako
    • Algo-Logic
    • Nova-Sparks, with Nova-Link product
    • Corvil
    • Intel / Lenovo
    • SolarFlare
    • ExaBlaze
    • Enyx
  • Role of Linux kernel tuning for ULL – use network-latency profile & common Linux best practices
  • Present some Linux configurations to critique (ex: no K bypass, same NIC for mkt data & order flow)
  • What electronic trading organizations will prosper in space of ULL ET now & in future? Which may very well fail, even disappear?  Why is role of ROI critical?  Difficulties of proper ROI analysis

lass Exercise (at end of class we will do this together)Given few server, Linux configurations with flaws, respond with measures to optimize performance & lower latencies

Session 2 – Tue June 6

Deep Dive into Red Hat Linux 7.2 low latency configuration & tuning, kernel bypass, PTP & NTP, then more details regarding ULL architectures from Class 1

Entertain review of last week + questions & discussion of assigned readings.

Next:

I will present/explain following best practices regarding Linux tuning in a way that will lead to some  white boarding designs.  To large extent the below is an outline of the Linux reading assignment.  **** Do we have access to NYU Linux server(s) for review of Linux configurations and to run some basic commands – else is Dev CoLo available? Other alternative – Ted will have same Linux and server configurations with flaws, open for optimization

  • Deep dive into Linux 7.2 network-latency configuration
    • Base config includes (perf over power saving):
      • Tcp_fastopen=3 (2 way handshake – encryption of cookie of client @ init, so reconnect is 2 way, using the cookie)
      • Enable Intel_pstat & min_perf_pct =100 (Ghz steady; disable fluctuations)
      • Disable THP (Transparent Huge Pages of 2 MB under K control)
      • Cpu_dma_latency
        • @ c_states, keep cores from sleeping; part of QoS
      • Busy_read 50 uSec (100 uSec for large# pkts) & busy_poll 50 uSec (skt poll recvQ of NIC, disable net interrupt); cores “active”
        • BUT — K bypass much better (discuss 3 methods of K bypass)
      • Numa_balance 0 (no auto NUMA mgt)
    • Disable unnecessary daemons and services (ex firewalld & iptables)
    • Max # ring buffer size
      • Dev driver drains buf via soft IRQ (other tasks not interr vs hard interr)
    • Set RFS (Recv Flow Steering)- increase CPU cache hits,forwards pkts to consuming app
    • TCP SACK- retrains only missed bytes)- tcp_sack+1
    • TCP Window scaling – up to 1 GB
    • Sysctl –w net.ipv4.tcp_low_latency=1
    • Timing and scheduling:
      • Sched_latency_ns (20 ms default; increase!!)
      • Sched_min_granularity (4 ms default; increase!)
        • Increase # procs, threads – formula may lower this 4 ms
      • Some applications may benefit from tickles kernel
        • (ex: small # procs, threads at no more than # cores)
      • Sched_migration_ns (default500 uS; increase!)
        • This pertains to period of “hot” cache, prevents pre task migration
      • Basic Linux and Server measures and utilities for performance analytics:
        • BIOS updates and tuning
        • Turbostat
        • Lstopo
        • Lscpu
        • Numactl
        • Numastat
        • Tuned
        • Tuned-admin network-latency configuration (set profile)
        • Isolcpus
        • Interrupt affinity or isolation
        • Irqbalance
        • Busy_poll
        • Check gamut of process (pid) info, much pertaining to performance in /proc/<pid>; for ex: files numa_maps, stat, syscall
        • Tuna – control processor and scheduler affinity
          • Options: Isolate sockets from user space, push to socket 0
        • VTune Amplifier 2016
          • CPU, GPU, threads, BW, cach, locks, spinTm, FxCalls, serial+Par Tm,
          • ID code section ID for parallelization; ex: TBB – more control over OpenMP
          • MPI analysis ex locks, MCDRAM analysis
        • Intel’s PCM (Performance Counter Monitor) – major enhamcements
          • Ex: times specific threads hit/miss L1-2-3 caches and measures cache times and impacys of misses; helps ID priority procs, threads for cache
        • Tx Profilers – Wily, VisualJVM, BeaWLS, valgrind, custom FREE– T/S, ESP correl, ML
        • Perf: perf top –g (functions)
          • Perf counters in hardware (cpu) , with kernel trace points (ex: cache miss, cpu-migration, softirq’s)
        • strace
        • Ftrace- uses the frysk engine to trace systemcalls
          • sycalls of procs and threads
          • Dynamic kernel fx trace, including latencies (ex: how long proc wakes/starts
          • /debug/tracing
          • Trace_clock
        • Dtrace for Linux:
          • Dynamic – cpu, fs, net resources by active procs, can be quite specific
          • Log of args /fx
          • Procs accessing specific files
          • # New processes with arguments
          • dtrace -n ‘proc:::exec-success { trace(curpsinfo->pr_psargs); }’

§  # Pages paged in by process§  dtrace -n ‘vminfo:::pgpgin { @pg[execname] = sum(arg0); }’§  # Syscall count by process§  dtrace -n ‘syscall:::entry { @num[pid,execname] = count(); }’      ….. specific syscall ct per process or thread§  also ‘canned’ scripts for processes with top tcp and udp traffic, ranking of processes by bandwidth o    SystemTap – ex:probe tcp.setsockopt.return§  Uses strace points for kernel and user probes§  Script thief.stp – interrupts by procs histogramo    dynamically instrumenting running production Linux kernel-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.o    SysDig Tool – only syscalls, dump for post processing  scripting

 

  • Oprofile uses hw counters, tracks mem access and L2 cache, hw interrupts
    • Mpstat, vmstat, iostat, nicstat, free, top, netstat, ss [filter/script for analytics]
  • VM (Virtual Memory) and page flushes, optimize market data caches
    • Slab allocation= mem mgt for k objects, eliminates frag
  • Slow network connections and packet drops
  • Intro to NetPerf tool
  • NIC tuning
  • Kernel bypass, LDMA, RDMA
  • Kernel bypass with NIC vendors (SolarFlare, Mellanox, ExaBlaze,) – description how each work
    • SolarFlare OpenOnLoad sets up all socket calls in user space instead of kernel space, with dedicated socket connection & data handled in NIC memory
    • Mellanox VMA linked library to user space, also sets up user space calls to NIC; Connect-IB NIC allows non-contiguous memory transfers for app-app; RV offload – speeds up MC; MLNX OFED open fabric verbs for IB and Ethernet; PCIe switch & NVMe over Fabric; MPI offloads; 2 ports at 100 Gbps; IB & Ethernet connections < 600 ns latency
    • Enyx NICs: differs from SF and MX network stack in user space (can be CPU intensive)
      • Enyx places full TCP stack in hardware (FPGA); reduce jitter
    • Network appliances:
      • ExaBlaze Fusion
      • Metamako MetaApp
      • FixNetics ZeroLatency
    • Precision Timing – PTP and NTP
    • PTP Symmetricom Sync Server S300s – NTP & PTP, owned by Microsemi GM
    • GPS Satellite satisfies UTC Req.
    • MIFIF II and PTP (software (sw) + hardware(hw) critical for accuracy; Req: 100 uSec + UTC
      • Symmetricom PTP GM 6 ports +/- 4 ns, <25ns to UTC
      • GPS -> GM-Spectracom->B-Clock(Arista7150s-FPGA timing+NAT)->servers-PTP-sw with FPGA based NIC’s ex: Exablaze ExaNIC models) – or SolarFlare NIC’s with HW timestamps
        • Linuxptp – ptp4I & phc2sys (can act as B-Clk) sync PTP hw clock on client, including VLAN tagged interfaces and bonded interfaces to master (GM) but with kernel; Dmons can’t consume MC; K delivers pkt to bonded interface SF’s sfptpd does all in HW; can sync every SF adapter; ptpd – mult platforms but just sw.
          • Timemaster – on start, reads NTP & PTP time servers, starts daemons, can sync sys clock to all time servers in multiple PTP domains
        • Master-slave time sync (ex:
        • PTP Timing within 6 ns –
        • consider disable tickless kernel :  nohz=off (for accuracy) BUT test this and app impact
        • PTP in hardware best but costs; do ROI
        • If multiple interfaces in diff networks, set reverse FWD mode to loose mode
        • Cmd: ethtool –T <int> -verify timestamp —- (for hw)
        • “timemaster” reads config of PTP time source
        • Cmd: systemctl start timemaster
        • ExaNIC FPGA can be programmed for extra analytics; some base programs available
        • MC if sync msg from Master but UDP unicast delay msg from slave to Master
        • PTP assumptions:
          • Network path symmetry (hence switch, router, FW, OS impact this)
          • Master and slave accurately measure when at pt of send/receive
          • Every hop can reduce PTP accuracy
        • PTP options:
          • Each slave clock direct cables to master .. but complexity. Cost …
          • Dedicate PTP switch infrastructure; switch PTP aware & eliminate switch delay or act as PTP M B-Clk; do not mix traffic
          • In dedicated LAN, PTP thru switch L2 Bcast to PTP bridge (server as B-Clk & bonded interface mgr), sends MC to FW (–if no SF; FW has list MC groups, IGMPv3 config ), MC to PTP clients for Time Sync, best if clients have with SF <add PICTURE>
            • FW configured for IGMP3, has necessary config allowing PTP-Bridge & clients to join std PTP MC group 224.0.1.129
            • Sfptpd can work on bonded interfaces so PTP clients need specify mgt interface to get PTP TS (from PTP bridge)
          • Hardware time stamps at every point
        • More PTP details:
          • Slaves periodically send messages back to Master (sync)
          • Sfptpd à file or syslog; ptp4l à stdout
          • Offset: amt Slave Clk off from Master
          • Freq Adjustment: how much clock oscillator adjusts to run at same rate as Mstr
          • Path Delay: how long to Slv & VV
          • Metrics – collectd, applies RegEx
        • NTP – selects accurate Time servers from multiple (ex 3); polls 3 or more servers
          • Keep stratum levels to no more than 2
          • Keep 3 clock sources near for sync
          • Use switches with light of no queuing
          • Use “timekeeper” – transforms any server into a timing appliance
        • Class Exercise (we will do together in class)–Explain different approaches to kernel bypass of following: ExaBlaze, SolarFlare, Mellanox, Enyx).  Explain strengths and advantages of each; advise what specific electronic trading applications would best benefit from each.
          • Explain how following Linux Tuning options will will impact latencies
            • Swappiness =0;
            • Dirty-ratio 10;
            • Background- ratio 10
            • NIC interrupt coalescing (pre kernel-bypass)
            • Ring buffer increase
            • UCP receive buffer at 32 MB
            • Netdev-backup 1000000 (traf stored before TCP/IP proc; 1/core)
          • Explain what following commands produce for latency analysis:
            • ifconfig command
            • Netstat –s (send/recv Q’s)
            • Ss utility
          • Detail major benefits of VTune and DTrace and when you would use either

 HOMEWORK – complete prior week reading assignments, prepare for 30 minute quiz next class regarding:

  • Optimal Linux kernel tuning per application requirements (Red Hat Doc + class notes)
  • Benefits of FPGA’s and GPU’s (Text book + Algo-Logic doc)
  • How multi layer switches work (Metamako Doc)
  • Differences in tuning to Speed 1 (raw) vs Meta-Speed (Corvil & Tabb Doc)
  • **** IN CLASS — I will spend 15-20 minutes detailing what is most important from the above.

Session 3 – Tue June 13

Quizz then FPGA’s, MultiCast, Market Data

 QUIZZ (30 minutes)

 *** we will immediately review quiz over next 30 minutes

Remaining  1 ½ hours of class:

FPGA’s & Market Data

  • Hardware accelerated appliances for ULL and deterministic performance
  • Ted’s FPGA Hand-out: — Intro to FPGA’s including intro to FPGA design & programming (I/O blocks + Logic blocks, OpenCL for creating “kernels” + synchronization for parallelism )
  • Why performance tends to be very deterministic with FPGA’s & why deterministic performance (latencies) are critical for HFT and algo traders
  • Pitfalls of FPGA’s
  • FPGA’s vs GPU’s, Intel Phi (Intel Doc), and multi cores
  • Feeds in FPGA –architecture, performance, design, support
  • Switch crossbars or caches for fan out with TCP distribution
  • Ted’s MC hand-out
  • Multicast (MC) performance considerations
    • Turn on IGMP Snooping on Switch
      • Switch listens to IGMP conversations between hosts/routers; maps links that require MC streams; Routers periodically query; 1 member per MC group per subnet reports.
    • Clients issue IGMP join requests to MC groups
    • Routers solicit group member requests from direct connect hosts
    • PIM-SM (Sparse Mode …low % MC) requires a Rendezvous Point (RP) router
    • Routers in PIM domain provide mappings to RP (exchange info for other routers)
      • PIM domain: enable PIM on each router
      • Enable PIM sparse mode on each interface
    • After RP, forward to receivers down shared distribution tree
    • When receiver’s 1st hop router learns source, it sends join message directly to source
    • Protocol Independent Multicast (PIM) is used between the local and remote MC routers, to direct MC traffic from the MC server to many MC clients.
  • Message based appliances, including FPGA based
  • Direct feed normalization
  • Conflation to conserve bandwidth
  • NBBO
  • Levels 1 and 2 market data
  • Depth of book builds (in FPGA’s or new multi core servers)
  • Smart order routers
  • Doc regarding leading vendor solutions
  • Exablaze NICs and switches v Metamako switches for market data
  • ENYX FPGA NICs and Appliances for market data and order flow
  • Nova Sparks FPGA based market data ticker
  • Fixnetics Zero latency – multi thread risk checks in FPGA and order processing in parallel on a core(a)
  • Other products — Exegy, Algo logic, Redline, SR labs
  • Consolidated feed vendors Bloomberg and Thomson Reuters
  • Use of new Intel Technologies (hardware & software) for alpha seeking strategies
  • Class Ex – (1) white board sessions where students will design ULL market data and multi cast architectures, per specific business/application criteria. (2) Given a Visio of a large network but with only a few MC groups and subscribers, identify the likely path(s) to sources few.  Include choice of router as RP. 

HW – Mkt Data white paper.

Week 4 will have a visio assignments:

HOMEWORK – 2 Visio designs – 1 for a 1 uSec T2T, 2nd for more modest (includes internal alpha seeking) 10 uSec T2T

 Session 4 – Tue June 20

Review of Visio assignments, then  intro’s to Python for Algo Trading, FIX protocol, Wireshark (maybe Corvil), R-Neural Networks,

Quick intro Python – reference — Ultimate Algorithmic Trading Systems ToolBox, George Pruitt, Wiley, 2016

  • Python algo trading examples
  • Intro to Wireshark
  • Intro to FIX Protocol
  • Intro to Wireshark with FIX protocol “Plug-in”
  • TCP, UDP, multicast (MC), then analysis via WireShark, Corvil
  • Intro to R / Neural Networks

Last Hour of Class – FPGA and T2T 1 Usec  Deep Dive – slot open for FPGA expert  John Lockwood of Algo-Logic; if John NA then we will start with Class 5 content and hope John can join us

 HOMEWORK – additional Visio design TBD + basic Python algo trading program to code + reading assignments – Python, Wireshark, R-Neural Networks

 Session 5 – Tue June 27

Review of Visio  assignment, then  continue with Python for Algo Trading, Wireshark (maybe Corvil), R-Neural Networks, then cover latest ULL Intel Technologies, other server, memory, Flash devices, ULL messaging architectures, network protocols, SDN

 Programming with Multiple core multi thread, parallelism

  • Vectorize application code
  • Design – Internal loops with deep vector instructions, outer loops with parallelization (threads)
  • Servers, sockets, cores, caches. MCDRAM (Intel Phi)
  • Core speeds GHz vs more cores, larger and faster caches
  • Over clocked servers – features and what applications can benefit
  • Linux, Solaris, Windows, other ex SmartOS, Mesosphere DC OS
  • How to benchmark performance, analyze, tune
  • NUMA aware processes and threads
  • Optimize cache assignments per high priority threads
  • Intel technologies including …
  • AVX-512 deep vector instructions (speeds up FP ops)
    • 6-8 registers; more ops/instruction; less power
  • TBB thread Building blocks (limit oversubscription of threads)
    • OpenMP- explosion of threads
  • Omni-Path high speed / bandwidth interconnect (no HBA, fabric QoS, MTU to 10K, OFA verbs,105 ns thru switch ports, 50 GB/s bi ) & QPI
    • Uses Silicon Photonics (constant light beam, lower latencies and deterministic)
  • QuickPath: mult pairs serial links 25.6 GB/s (prior to Omni-Path)
    • Mem controllers integrated with microprocessors
    • Replaced legacy bus technology
    • Cache coherent
  • Shared Memory is faster than Mem Maps; allows multiple procs read/write into shared mem among the procs – without OS read/write commands. Procs just access the part of shared mem of interest.
    • Discuss ex of server proc sending HTML file to client; file is passed to mem then net function copies mem to OS mem; client calls OS function which copies to its own mem; contrast with Shared Mem
  • PCIE
  • C ++ vs Java for ULL
  • Lists vs vectors
  • Iterate lists
  • Role of FPGA, GPU, MicroWave networks for ULL
  • C/C++, Java, Python, CUDA, FPGA – OpenCL: programming design considerations
  • Java 8 new streams API and lambda expressions – for analytics
  • Class Ex – Explain how Quick Path & Omni Path both improve latencies and advise which is preferred for ULL and why

 

  • New age networks – Spine leaf to single tier
  • SDN (Software Defined Networks)
    • Cisco ACI + Tetration
    • Cloudistics
    • Plexxi
    • NSX
  • Pico – ULL SDN vendor
  • Options-IT – colo managed ULL infrastructure
  • Cisco and Arista switches for ULL
  • Cisco ACI and Cisco Tetration – Deep machine Learning to automatically optimize large networks
  • Switches with deep buffers, great for Big Data Analytics
  • Configure Routers for ULL – LLDP, MLAG, VRRP, VARP (active-active L3 gateway)
  • Network protocols – BGP, OSPF, HSRP
  • Arista 7124FX with  EOS
  • Plexxi switches – a disruptive technology – single tier
  • Plexxi optimal bandwidth via its SDN
  • Optimal VLANs configuration for analytics
    • Use trunks from 1 switch to another switch after defining a VLAN, or use router
  • VPLS (Virtual Private LAN Service) also for analytics
    • Enet based multiPt-multiPt over IP or MPLS
  • Decrease network hops for speed
    • (ex: Slim-Fly: Low diameter network architecture if not ready for Single Tier)
  • Network protocols:
    • EBGP: external, distance vector, via paths, network policies, rule sets, finite state machine; BGP peering of AS-AS
    • BGP-MP: multi protocol + IPv6, unicast & MC; use for MPLS-VPN
    • OSPF: interior within AS, link state routing with metrics of RTT, amt data thru specific links, and link reliability
    • MOSPF: uses group membership info from IGMP + OSPF DB, builds MC trees
    • EIGRP: OSPF + more criteria: latencies, effective BW, delays, MTU
    • MPLS: between network nodes, short path lables – avoid complex lookups in RTE table; multi protocol- includes ATM, Frame Relay, DSL
    • IB: hw, light wt, no pkt reordering; link level flow control, loss less, QoS virtual lanes 0-14, RDMA verbs (adds latency), UFM tool, 4000 byte MTU (adds latency as this MTU must fill before transmission)
    • OPA: Intel’s Omni Path Architecture: 100 Gbps, new 48 ports switch silicon, silicon photonics, no HBA, 50% decrease infra vs IB, 100-110 ns / port, congestion control – reroutes traffic, MTU up to 10K
    • FC: FiberChannel – optical pkts, units of 4 10-bit codes, 4 codes=TransWord; meta-data sets up link and Seq; tools-Agilent, CATC, Finisar,Xyratex; FC: similar to Enetpkt ex: mult frames assembled with src, dest
    • IGMP: used by hosts and adjacent routers on IPv4 networks to establish multicast group memberships
  • Next Gen Firewalls (ex: Fortinet)
    • 1 platform end-2-end, with multiple security related aspects including anti-virus, malware, intrusion detection, database and OS access controls, web filtering, web app security, user ID awareness, standard rules access, internal segmentation (into functional security zones, limits spread of malware & mischief, identifies mischief & quarantines infected devices; shares all info via its fabric to whole network; Zero Trust Policy – places the FW in network center, in front of data
    • Empow – Security Orchestration product
  • Kerberos
    • Client/server network authentication protocol via secret key cryptography, stronger than traditional firewalls as they focus on external threats whereas Kerberos focuses on internal
    • Each KDC has copy of Kerberos DB; Master KDC has copy of realm DB, which is replicated to slave KDC’s @ regular intervals; DB password changes are in the Master; slaves grant Kerberos ticket servers / services, time series critical & create ACLs too;  Kerberos daemons are started on the Master, assigns hostnames to Kerberos realms, ports, slaves
    • Opportunity for Docker container security:
      • Kerberos for access to multiple levels of container types (ex: checking account KYC vs withdrawal .. acct mgr vs authenticated client)
    • IPTables – may opt to disable for ULL, rely on external FW
      • Set up, inspect tables of IP packet filter rules; each table: built-in “chains” & user defined chains; chains list rules to match set of packets
      • Required for server “routers” NAT aware
        • Rte intercepts, determines NAT@
      • END- OPTIONAL
      • Class Exercise – Determine whether Single Tier Networks improve ULL versus Spine Leaf. If so explain why.  Several scenarios will be presented and students will architect networks on white boards.

 

Session 6 – Thu July 6 (1 ½ hour class, location TBD)

  • Special lab session: Python for algo trading, extra wireshark training, and extra R / neural networks

 

 

Session 7 – Tue July 11

Complete / Review  topics from last week; start with role of Big Data for alpha seeking and input into ULL trading algo’s

 

 

Middleware, Analytics, Machine Learning, leading to end-end ULL Architectures

 Analytics  & Machine Learning:  to seek alpha and for infrastructure analytics

  • Intro to Big Data Analytics & Machine Learning (focus on neural networks)
  • Role of Java 8 new streams API
    • Speeds up extracting insight from large collections via methods such as:
      • Filter, sort, max, map, flatmap, reduce, collect
    • Use with ArrayLists, HashMaps (does not replace them)
    • Stream is 1-time use object
  • Intro to Complex Event Processing (CEP) and Event Stream Processing (ESP)
  • Databases – Never in the path of ULL
  • Column based (contiguous memory) vs relational
  • KDB and OneTick – leading players in high speed market data tick databases
  • Event Stream Processing (ESP) – use ESP to seek alpha
  • Combine market data with News sentiment analytics to seek alpha,
  • Intro to Ravenpack news sentiment analytics
  • Intro to Spark
  • Role of new storage technology (ex NVMe Flash drives)
  • In-mem analytics ex HANA, Spark
  • Corvil – intro to how to configure Corvils and how to analyze FIX order flow with it
  • Machine learning, neural networks in R or Python – create equations to project latencies
  • Machine learning for Latency analysis, tuning insight, seeking alpha -trade opportunities
  • Programming for multi threaded trading risk analytics
  • Class Ex – output from Corvil streams will be provided. Students will analyze and determine how latencies can be projected using neural networks (design only – no programming)  — or we may do this together in class – Ted to attain sample data as input to neural networks for latency predictions + sample data for alpha seeking

 

HOMEWORK – prepare for 30 minute quiz start session 7 regarding Big Data analytics for ULL trading; key points will be stressed in class; also – 2 weeks to complete visio integrating alpha seeking opportunities to a ULL trading architecture – full details in class

 

 Session 8 – Tue July 18

Quiz then go over quiz; then cover high speed messaging & middleware, infrastructure ROI, and cloud technologies for ULL, + application specific details regarding ULL apps

 Middleware, High Speed Messaging

  • 60 West AMPS
  • 29 East LBM (UME)
  • New FIX Binary protocol in beta promises to lower latencies
  • Importance of High Speed messaging for Algo Trading
  • Intro to basic algo’s for trading equities (ex VWAP, Volume Participation, use of AVX and RSI)
  • How to back-test algo’s for trading
  • How to conduct ROI for new ULL architectures
  • Why traditional cloud architectures fall short for ULL
  • Cloud for analytics – pitfalls vs best practices
  • Micro services potential
  • Class Ex – output from application logs will be provided. Students will analyze and determine how AMPS can be configured for both high speed middleware and event stream processing for analytics

End-End ULL Architectures

  • Co-Lo with 500 ns order ack times (revisited with our new knowledge)
  • Dark pools
  • Algo Trading (servers, appliances, or FPGA’s) in the architecture
  • Smart Order routers
  • Prop Trading
  • Exchanges
  • Class Ex – VISIO or white boarding of a new trading system TBD, applying all learned in course

 

Session 9 – Tue July 25

Review visio assignment then cover future ULL architectures ; then cover high speed messaging & middleware, infrastructure ROI, and cloud technologies for ULL, + application specific details regarding ULL apps

 Futures, including Cloud Architectures for ULL

 Ted’s A-Team strategic projections for next 2 years

  • Projections on new technologies’ impacts on ULL – may include:
    • new Intel cores & software,
    • adoption of Single Tier networks,
    • impact of in memory machine learning for alpha generation of trading signals,
    • integration of deep machine learning from cloud to live trading networks via high speed interconnects, to an asynchronous Q, with NO latency impact,
    • applicability of block chains,
    • system reliability engineering (SRE),

< open slot to catch up on past material + student’s questions pertaining to future of ULL for ET>