"Sundry" lecture: ================= * Cover a few different aspects of how computer systems interact with finance * Mostly from the technical side. * A few financial aspects for color. 3 topics in particular: ================= * Distributed Systems: Consensus * Programming Languages: OCaml * Generative AI: BloombergGPT * Not exhaustive in any way, just a particular set of examples * Idea is to whet your appetite so that you can dig deeper as your interests take you * All of these involve computer systems at the bleeding edge: --> High performance consensus, functional programming, and generative AI are all topics of mainstream research in the CS research world Distributed Systems and Consensus: ================= * Heart of many large-scale services today * You may have heard of protocols like Raft and Multi-Paxos * Typically give you reasonable performance --> (~10K requests per second, a few ms latency) * What if you want an order of magnitude better performance? * One way to do it: --> Build a highly customized network fabric --> Use "bare-metal" technologies like RDMA --> Nezha's approach: use clock synchronization instead. * Much more in the Nezha slide deck Programming Languages and OCaml: ================ * Paper: Caml trading: experiences with functional programming on Wall Street * Describes Jane Street's experience with OCaml * Unusual for its time (and even now): * Most finance firms use C++ for performance in production or Python for productivity in research * What does JS use OCaml for? --> Live monitoring of risks and positions --> Trading Systems --> Order Management and transmission systems, --> Historic data management --> Quantitative research But, first, what is OCaml? ================= * Functional programming language * Functions are first class citizens * Declarative rather than imperative * Typically associated with a few different features Features of functional programming: =============== * higher-order functions (functions that take functions as arguments) --> filter, map, fold, etc. * algebraic data types * strong static type systems * immutability * lambda functions * Many other mainstream languages have functional features now: --> Python, C++, Go, Java * Functional programming is now more of a style of programming, rather than an attribute of specific programming languages Important point that keeps coming up in the paper: ================ * The ability of a system "not to trade" * cf. safety incidents like Knight Capital Group * "one of the easiest ways that a trading company can put itself out of business is through faulty software" What JS values about OCaml: ================ * Readability * Performance * Macros Readability: ================ * Readability is critical for code that makes trading decisions * Important for code reviews and catching errors before production * Common practice in most of the software industry, but esp. important in trading * Terseness: contrast a for loop vs. a map function * Immutability: arguments to functions are not mutable by default * Pattern matching: algebraic data types: give examples of expressions and parsing * Labeled Arguments: so that you don't swap arguments * Type systems: make illegal states unrepresentable: --> Make the compiler work for you * Polymorphic variants: Avoid exceptions, but instead define erroneous states as part of the possible values of a data type * Modularity Performance: ================= * They are mostly discussing performance _predictability_, * Easy to determine how fast a piece of code will run and how much space it is going to use * Important for systems where responsiveness and scalability matter * Can move garbage collection work off the critical path * Foreign function interface to interact with C++ libraries, e.g., numpy in Python Need to do FFI to interact with certain native libs Alternative is to use poor FFI interfaces from other managed langs like C# * Compiler is easy and straightforward to understand Macros: ================ * Modify language at syntactic level * camlp4 is a macro system that: understands the OCaml AST and can be used to add new syntax to the system or change the meaning of existing syntax * allows you to modify/rewrite parts of a program into other parts using a rewrite engine * similar to the C pre-processor OCaml drawbacks: ================ * Generic operations, e.g., generic printers * objects in OCaml can hamper productivity esp. for programmers from other langs * lack of optimizations in compiler * lack of parallelism * cathedral development model cf. Eric Raymond's Cathedral vs. Bazaar * programming in the large: ecosystem, build tools, package manager, stdlibs Some counterintuitive benefits: ================ * Hiring * Easier for others to become productive in the language What's unclear: =============== * Could these benefits have accrued from other functional langs? * Could we do this using functional programming styles in other mainstream langs? Why it may have succeeded: ============== * Stringent requirements for correctness * Early success in OCaml for research made it easier to use as primary language * Small team size * Specialized in-house software BloombergGPT paper: ================= * An LLM tailored to finance What's interesting about it: ================ * A (smaller) LLM tailored to finance: only 50 billion parameters (gpt-4 is rumored to have 1 trillion+) * Outperforms prior approaches on finance tasks * Competitive on general-purpose language tasks Dataset that went into training: ================ * Carefully curated * Table 1: About half of the tokens are fintech specific AND the remaining half are public: 363 B vs. 345 B. Financial datasets: ================ * WEB: crawl high-quality websites that have financially relevant information, not a general crawl of the web * NEWS: news sources excluding news articles written by Bloomberg journalists * FILINGS: Financial statements made by companies and that are made available to the general public. * PRESS: Company press releases * BlOOMBERG: Bloomberg news and opinions and analysis, real-time news Public datasets: =============== * The pile (includes GitHub and FreeLaw) * C4 (includes patents) * Wikipedia Tokenization: ============== Unigram tokenizer algorithm to decide how to split words into tokens Parallel tokenizer training Training: ============= Tried to leverage recent results on Chinchilla scaling laws to decide how big the model should be (parameters) and how big the dataset should be (tokens) They arrive at 50B parameters and 1000B tokens They only have about 700B, but are limited by amount of domain-specific data This gives them some headroom for failures, restarts, etc. Essentially: answers the question of how best to make use of compute budget Used SageMaker for training Training chronicles that document their various attempts along the way to getting to their final model Evaluation: ============== * Public fintech benchmarks * Internal benchmarks * General language benchmarks * Main takeaway: Competitive with much larger models on general tasks, but performs better on financial tasks Use cases enabled by it: =============== * Generating Bloomberg Query Language * Suggesting news headlines given section content * Financial question answering, e.g., who is CEO of company X? Openness: ============= * decided not to release model * worries around public model weights eventually lead to leaks of training data Takeaways: ============ * interesting domain-specific use of LLMs * not totally clear how it does both (1) better perf on fintech tasks AND (2) competitive perf on general tasks. * likely because of all the attention that went into data cleaning * nicely integrates many cutting edge AI techniques along with domain-specific data --> unigram tokenizer --> chinchilla scaling --> public datasets --> private datasets, ...