Hacker News new | ask | show | jobs
by asdasdsddd 695 days ago
I've interviewed many people on Alexa before. From what I gather, its just a giant switch statement, and each individual "path" takes a bunch of effort to support and there are thousands of paths for music, ordering, commands, etc. It's peak AI == if statement architecture.
5 comments

Throwaway, used to work at the NLU unit of Alexa about 5 years ago. There is some ML going on but as with all ML projects I have worked on people want control. This means you add rules for the "important" stuff. You also add test cases to make sure the ML works. But if you already have those test cases, why not just match on them directly? There are also advanced techniques for generating examples (FST for example).

What this culminated in is a platform where 80% of request, and pretty much 99% of "commands" are served by rules built with a team of linguists.

Is this not actually kind of powerful? Having linguists write up a bunch of rules seems a lot more predictable than "rolling a bunch of dice and hoping that some LLM spits out a coherent set of steps".

It feels very fractal but on the other hand if Alexa has only a specific gamut of responses it's not exactly a limitless state space right?

Very curious about how those rules look like though

The problem is it's completely undiscoverable. You can tell Alexa "play some music" because you're pretty sure one of these linguists added a rule for that. But can you tell it "play me a song that lasts longer than 5 minutes"? Doubtful. The only way to know is to try it.

The problem is the space of possible commands is waaaaay bigger than the space of commands you can manually handle, which means if you just randomly try stuff 95% of the time it won't work. Users learn that very quickly and end up sticking to the few commands they know work.

The one exception is "search" queries - "how tall is Everest" and so on, but that only really works well on Google's platform because they've done all the work for that already.

Contrast that with LLMs which basically at least understand everything you're asking of them. If you give them a simple API to carry out actions they can do really complex commands like "send a WhatsApp to my wife telling her how when I'll get home if I start cycling in 10 minutes". That's impossible without LLMs but pretty trivial with them.

Obviously the downside is they are prone to bullshitting and might do completely the wrong thing.

It’s worse than that. These systems can be adapted by looking at failed user commands, but people don’t really sit around trying out fun things and watch it fall on its face for longer than the first day or so. After that, the novelty wears off, so you’ve trained your users to accept the device’s limitations. Then, even when you do improve the functionality, your users won’t know! They won’t try it, and those commands will never get traction in the system or get more testing beyond the initial launch criteria. It’s a death spiral. The same thing happens with the tone of voice people use.
> people don’t really sit around trying out fun things and watch it fall on its face for longer than the first day or so.

Isn't this what thumbs-up/down RL is for? To improve the quality of the results.

That’s the intention but very few users enjoy being unpaid QA for trillion dollar corps.
I am confused as to why it's more undiscoverable than, say, some LLM.

> The problem is the space of possible commands is waaaaay bigger than the space of commands you can manually handle, which means if you just randomly try stuff 95% of the time it won't work. Users learn that very quickly and end up sticking to the few commands they know work.

This is not strictly true. Context free grammars can be written to handle (finite) sentences of arbitrary length! if you have a rule like "play me <song>" and then <song> can be "a song that lasts longer than X" or "a song by <artist>" (then you have <artist> be "<some name>" or "some German singer" or whatever....). You can just keep on going.

> The one exception is "search" queries - "how tall is Everest" and so on, but that only really works well on Google's platform because they've done all the work for that already.

Had a small Google Assistant thingy for years, and that search stuff works great, until it doesn't, and completely misses the mark. This immediately kills trust and reduces it to a gadget that I will only use for non-critical stuff, always expecting it to break anyway.

> But can you tell it "play me a song that lasts longer than 5 minutes"?

I don't think even pre-LLM technology allows you to do this.

I can't do something as basic as goto Spotify's search page and filter "only genres I like", neither a smart version of that filter or a manual version of that filter is possible.

Honestly the FSTs themselves were actually really cool, it's very much GOFAI. It automatically creates lots of permutations, i.e. `play taylor swift`, `please play taylor swift`, play taylor swift now`. etc. And once the FST is built it always works deterministically. It's compiled to a graph and an incoming command is pushed through the state machine, if you get to an end state it "matched the fst" and some specific behaviour would be triggered.

the rule were really just strings and we had efficient matching against it. I didn't work on that, I would assume some sort of LHS.

what do all these acronyms mean
wouldn't that just be some kind of NLP https://en.wikipedia.org/wiki/Natural_language_processing?

May just be a long list of if/else and/or switch statements or isomorphism.

Was there a knowledge engine of some sort in the past? I could ask it some questions like “what color is a light red flower” and I would get back “a pink flower is pink.” Asking what color a purple cat was would get back purple… but asking what color a blue bird was would get back “a blue bird is blue, red, and brown.”

“Who has birthdays today?” And I would get a list of famous people with birthdays today. I could also ask if Alice and Bob (two names in the list) had the same birthday and I would get an answer (one time I think I got back some internal query language for it instead… but that’s lost in old bug reports).

Now any interesting question starts its answer with “according to an Alexa answers contributor…”

Later on, there were methods to generate FSTs themselves without manual human curation.
> You also add test cases to make sure the ML works. But if you already have those test cases, why not just match on them directly?

Kind of says it all.

At the end of the day if you have a complex product but don't have comprehensive test cases, it's just a matter of time until your users notice your product sucks.

This is exactly how Cortana and Google Assistant have been built as well.
IMO with my experience with siri being AI-style unreliable in many ways, like bit flips when saying turn off the lights makes the dimmer go to %100, I think it's better to do the switch statement for the dozen or so query types that probably represent %90 of traffic, like weather, music, home control, unit conversions, etc in exchange for way more reliability.
I think describing the NLU Engine as a switch statement is underselling it a bit. Determining domain and intent alone requires more than that (frequently, at least).
Do you mean actual coded if statements as in actually human written code like

    if (question.match(/^what is (.*)/)) return wikipedia.search(question)
or something more automated?
Not like that, but once you get to first party Alexa Skill themselves there's a bunch of match rules that make up a big chunk of the traffic. Don't know exactly how much. The longtail is done through ML means.
> I've interviewed many people on Alexa before.

I got a recruiting message once for a ML engineer position in Alexa, ignored it.