| I made a brief, aborted attempt at a restaurant recommendation service. We wanted to hydrate our data with existing pictures of dishes from the restaurants sourced from Yelp and/or Google Image Search. After looking at that data, I realized that a human touch to picking the right images would make a huge difference in the service. We're talking thousands of restaurants that we wanted pictures of food from, each of the restaurants had dozens of images we could pull. So tens of thousands of images needed to be sifted through, I figured with the right tooling, myself and my cofounder could put together something really nice that would only need an hour or so of maintenance a day to keep up. So I built a pipeline that used very basic and easy to build and maintain 'dumb' Rails asset pipeline pages to present data for sifting. Go to the endpoint, it shows you the name of the restaurant and a bunch of images, you select one, type in a name for the dish, and it saves it to the database and puts up another page of images. It took me bitching up a storm to get him to even look at it. He complained about how long he thought it would take, while I just got to work. Took maybe three weeks to prototype our app. One thing I learned in the process is that if you're looking at a bunch of Southern food, for some reason the picture of shrimp and grits always looks the most appetizing. I was well on my way to classifying and figuring out novel ways to present the data when I had to make the determination that there wasn't good cofounder fit. So now I work with CNN. But now all my side projects revolve around ways to get human attention to improve automated tasks. I suppose one of these days I'll get the right idea and/or the right cofounder and I'll give it another go. There's a wealth of usable information out there on the web that one can build businesses on top of if one only wants to apply a little elbow grease to clean it and turn it into data. It's far easier to scrape data with a regular web browser with a custom browser extension than to try to build out headless infrastructure. But no one wants to do it. |
Our main initiative was creating a heuristic based classifier (think lots of regex). At my own initiative, I trained ML classifiers while we worked on it. As development went on, the ML classifiers were rapidly catching up with the heuristic based one. Unfortunately it was kind of a one off data processing task, and when time ran out the regex machine was still in the lead.
I was modestly proud of the legalese DSL generator I wrote up. The lawyers didn't even know they were writing coffeescript as they typed out what documents were, what key dates were, etc. :D
That coffeescript formed the basis of our accuracy testing suite. It was as fundamental as it was huge. That team ended up creating a couple thousand tests in less than a month.