From Speech Technology to Banking



By Paul Taylor, Chief Executive Officer

June, 2018

A question I am frequently asked is how on earth did I get from speech technology and AI into banking. Well, here goes.

Text-to-Speech

The high point of my previous career was the launch and subsequent success of the Google text-to-speech system. This went live in June 2012 as part of the new Google Now service, which enabled users to do complicated voice searches, and ask simple questions of the service. The launch was a success and the service is now used for all of Google’s spoken output, including driving directions, interactive voice search and voice assisted technologies. The text-to-speech system works over the internet and on-device, and as of 2015 has been installed on over 1 billion Android devices.

Our team’s main contribution was to provide a highly natural pleasant sounding voice. Users are unaccepting of bad speech synthesis: it grates on the ear and they generally turn it off if they can. Only by developing something of very high quality would the service be deemed usable.

So how was it done? The text-to-speech system used the unit-selection framework, a technology which has a long pedigree, developed over the years by world class labs and researchers. Essentially, one records a speech database from a single speaker. Each sentence is split into phonemes and these form the “units” of the algorithm’s name. The sound patterns of a particular unit of speech are highly context dependent, for example in “top” the /t/ sound has plenty of aspiration, whereas in “stop” the /t/ is quite different in nature and is more of a short (50ms) section of silence. Because of these context effects, recording one /t/ and using it everywhere would sound jarring and unnatural. In fact, one has to record thousands and sometimes tens of thousands of each phoneme to get enough units to cover the space of required sounds properly.

Once a database has been collected and analysed it is compiled into an efficient run-time format. When a sentence comes, it is converted into phonemes and then the database is searched for the “best” sequence, meaning the sequence with the most appropriate units that also fits together with the least jarring. Doing this the dumb way would require hours of processing time for a single sentence. But a combination of dynamic programming, clever indexing, customised data models, caching, pruning, and search optimisation means that a sentence can be generated about a hundred times quicker than the subsequent time taken to play the audio.

None of this was built overnight. In fact, by the time we got to launch the Google system in 2012, I had been working on this problem since 1989. I once quipped (and later regretted) in an interview that “it only took Michelangelo seven years to paint the Sistine Chapel”. The formative years were spent at Edinburgh University, where I completed my PhD in 1992. I spent the next 8 years there, culminating in becoming director of the University’s Centre for Speech Technology Research in 1999. For most of this time, I was working on text-to-speech, and the Festival System that I co-authored with Alan Black and Richard Caley is still used as a teaching and research tool. (And if you want some proof I could once code, check out the github archive).

The 1990s in Edinburgh University were a great time. The University had a vibrant AI, machine learning, speech technology, linguistics, cognitive science and computer science community and it was a wonderful learning environment. When I see the attention AI gets today I honestly don’t quite know how to react - in the 1990s there were perhaps a few hundred of us in the UK, we mostly knew each other, and kept ourselves to ourselves, mainly because from a commercial perspective, nothing much really worked. It was tremendous fun though. The conceptual problems we debated then were the same ones as today - should machines have a moral code? Could they be dangerous? What about people’s jobs? And so on. But these were debates among an introverted community. I am still dumbfounded that I see an article on AI in the press every day, with even prime ministers and presidents chipping in.

Startup Life

I co-founded my first tech company, Rhetorical Systems, in 2000, and from really the first moment I realised that was the life for me. Apart from an interlude at Cambridge in 2004-2006, I left the academic life behind.

Life at Rhetorical was the usual roller coaster of tech start up life that has been documented plenty of times, so I won’t bother here, except to say there was very little guidance as to how to do it then. Business books were geared towards large corporations, and what start up books there were were about opening shops or restaurants or “normal” businesses. Anyone starting a company now at least has a great community and breadth of reading material to fall back upon.

Rhetorical developed the text-to-speech technology I had previously invented in the university. It was a great success in terms of quality but less so commercially. The reason was that even when gaining acceptance, it is difficult to sell this technology into the wider market. People might want to use it but are reluctant to pay for it. This problem is still pervasive in the wider AI market: Alexa, Siri, Google Home make extensive use of AI and speech technology, but these are provided by the large tech companies, not stand alone startups. Often the technologies the large tech companies deploy was invented in start ups and came to the tech companies through acquisition, so it is clear both types of company need each other. Inevitably, Rhetorical was acquired in 2004 after which I spent the two years in Cambridge University’s Engineering Department.

In 2006 I co-founded Phonetic Arts, this time as CEO. We did a pretty similar thing, but this time the market was the entertainment industry. Again, the technology was excellent, but finding a paying customer was still hard. In 2010 history repeated itself, this time when Google phoned me one day and asked if we wanted to do a deal. The 5 months of tortuous negotiation that followed could fill a book, but the ending was happy and at the end of 2010 Phonetic Arts was acquired and the team joined Google and became the Google text-to-speech team.

Google

Google was great fun, and it was a relief not to be worrying about making payroll each month. The culture at Google was great but what really fascinated me though wasn’t the free lunches and back rubs, but how Google had built such a scalable company. While Google today is pretty big, it wasn’t always so. But even in the early days a few hundred engineers could build products that became used by tens of millions. When I joined the Android team it only had 300 engineers but had already shipped to 500 millions phones.

There are many secrets here. First, only hire excellent engineers and let them be the heart of the tech company. While many companies might aspire to this, very few really do it. At best, they hire a few stars, but for the more “boring” or “regular” engineering jobs, stars aren’t needed. The Google approach was to hire stars at all levels, and eliminate boring jobs by writing code that does the job automatically. There are very few manual processes at Google at all.

Second, the continuous integration, deployment, monorepo system. It was incredible to see Google having billions of lines of code in a single repo, all building and testing continuously and automatically. This means that any engineer can reuse another one’s code without having to ask for permission, or having to do anything special at all, you just set up the dependency and it all compiles and runs. Google does not have any “cruft” in its code base - everything compiles, runs, is documented, is tested. The code is written to an exceptionally high standard and is simple to read and understand. (Contrast this to the banking world, where it is common to find systems where no-one has any real understanding of how components written years ago work anymore, nor any idea how they would be removed or replaced).

Finally, one should note that often the technology deployed in Google isn’t particularly original in itself, its just that it is done exceptionally well, in terms of speed, stability and availability (think of Gmail). But doing simple things well isn’t necessarily easy - a system like Gmail globally handles huge volumes of email and has to always work quickly and effectively. If one thinks of Gmail as a transactional system, where each email is a transaction, one can start to see how this overlaps with banking.

While Google was great, once an entrepreneur, always one, and once my “time” was up, I left at the end of 2013. Like a sous chef who has learned in a great restaurant, I felt that after leaving Google, I was leaving with the recipe of how to build a great tech company. And with this in mind, I founded Thought Machine.

Building the Team

My top priority was to create a genuine world class team, and infuse it with a spirit of engineering excellence. The key to doing this is to have a vision of a company that makes people want to join. This is a combination of a great work environment, an interesting domain, a huge market and IPO potential. I wasn’t going to build a company for an early acquisition again. I first turned to former colleagues that I knew had engineering excellence, and who would want the same company goals as me. In this way, I hired Fabian, Pebers and Will, all from Google

Eventually you run out of candidates you know so I hired an internal recruiter, Hung Lee, who would work full time in building the team. He hit on the idea of holding an event, where candidates would come and mingle with the team and I would give a talk about the company. This was an instant success and we have continued hiring this way ever since, with more than 1,000 candidates coming to Thought Machine hiring events over the last 4 years.

I thought it more important to build the great team before deciding on the precise issue…

Picking the Problem

The general area was easy enough: with London fast becoming the fintech capital of the world it was the obvious choice. For the first year of Thought Machine, we worked with banks in various areas of consumer finance. Banks’ appetite for doing something new was clear, and we weren’t short of choices. But it was clear that most in fintech were operating around the periphery, and this was because touching the core - the mythical core banking platform - was deemed to be beyond reach. Interfacing with core banking engines was hard as they had often been built decades before, in Cobol and ran on mainframes. They were a long way from cloud hosted API microservice architecture.

I saw the business case clearly when I talked to challenger banks. It would take them years to build their own banking systems, and buying from existing third parties was difficult and expensive as none used modern architectures. There was a clear need, or gap, in the market.

Nearly everyone told us not to do this: mainframe core banking is a constant; banks would never change, the problem was too difficult and intractable. Regarding the last issue (is core banking difficult?), I refused to back down. Why would calculating payments and interest, and recording the results be so hard? And of course having so many people tell you how hard it is just encourages the ambitious engineer!

So Will Montgomery and I had a series of long chats, and Will started building what became Vault. After a few months we had something, and demoed it to some banks. The feedback was immediately encouraging, which led to us gradually over the course of 2016 to dedicate the whole company to this product. As we built more and more, the commercial interest grew rapidly, and we then saw the opportunity was much bigger than originally seen: not only would we build a core banking engine in the cloud, but in fact a full retail bank, with treasury systems, financial reporting, CRM, credit risk decisioning and dozens of other components.

If one wanted to be critical, one could say there is nothing particularly magic or original in what we are doing, we are simply applying the best practices of agile cloud computing to the retail banking problem. But that is exactly the point - you have to start with the mindset of cloud computing and build the specific system from scratch, staying pure to that methodology. You can’t evolve from the way the banking industry had been doing it for decades.

And that is the story of how I got from text-to-speech to cloud based banking platforms.