Hacker News new | ask | show | jobs
by 11thEarlOfMar 3266 days ago
I ran across this article while researching a stock and as I read, I kept thinking, "This was not written by a person. This was written by software." [0]

I checked the attribution, and there is a person's name on it. Sure, any hack can write and publish and this is probably just another example. But the odd style doesn't even strike me as 'writing the way I think' or writing and publishing quickly without editing. For example, from the 2nd paragraph, "The corresponding low also paints a picture and suggests that the low is nothing but a 97.89% since 11/14/16." I can't gather any meaning from that statement, yet it has oddly specific details.

I am not glad to see this trend and not glad that Google is embarking on this path. I suppose it is inevitable, but unless there is expertise built into this AI that can extract meaning from data on my behalf and present it in a way that is more insightful and interesting than I am, it will become yet another source of chaff I'll have to filter.

Can we at least, please, flag AI generated prose as such?

[0] https://www.nystocknews.com/2017/07/05/tesla-inc-tsla-showca...

10 comments

That "author" published 87 such articles on July 7 2017[0], and a total of over 8,500 so far this year[1].

[0] https://www.nystocknews.com/author/mack-tyler/page/9/

[1] https://www.nystocknews.com/author/mack-tyler/page/854/

As far as I can tell;

None of the authors are real people, the website is registered behind a anonymization service, there is no company registered with their name, the address of their office doesn't exist, the phone number connects to a 'subscriber not in service'...

If you look at their Google Ad ID, it was used in the past on the now defunct "TheSportsTruth.com" -- which looks like it primarily existed to shuffle people to a supplement site. From there, there are a ton of links to other random affiliate schemes with sports, 'internet marketing', etc. No sense outing anyone, but I believe I figured out who's behind a few dozen of this shitty sites. The NYStockNews site seems to make its money by referrals to some penny stock scam sites.

It's crazy how much 'content' on the internet exists solely to get people to click on links to supplements & penny stock scams.

How did you find out that stuff of the google ad id? Especially associating it with past websites.
In the source of every page with Google ads, there's a "Publisher ID" which is a unique identifier for the account. In the case of NYStockNews.com, it shows up as:

> google_ad_client: "ca-pub-6009540024781990"

From there, there are specialty services that keep track over time, otherwise, you can just search for the trailing digits on Google.

When doing so for "6009540024781990", a few sites come up, GDPInsider.com - Another stock bot-written site, and then a dead link with a Google Cache:

https://webcache.googleusercontent.com/search?q=cache:Ccudn6...

Using various other tools, you can see the domain registration information over time or ID which servers hosted it, or just find out who was linking to a domain earliest. Reddit is a great site for the latter. Often times, when a 'marketer' sets up a site like these, they immediately run to social media to try and promote it. If you can find the first time it's linked publicly, you can often find out who posted it.

That last part is actually one of the ways they tied Ross Ulbricht to The Silk Road -- They found the first public mention of The Silk Road online (The post: https://www.shroomery.org/forums/showflat.php/Number/1386099...) which was written by 'Altoid' and directed users to a Wordpress page that had been set up a few days earlier. They then found a series of posts on BitcoinTalk by 'Altoid' looking for an IT Pro in the Bitcoin community with instructions to email Ross.Ulbricht@gmail.com if they were interested in a job... He was doing deeply illegal stuff and couldn't be bothered to mask his ID, imagine how easy it is to find rando affiliate marketers.

I work for Automated Insights, a company that makes a SaaS platform very similar to what's in the article. Here's an example "in the wild" of the content we produce - http://www.thenewstribune.com/news/business/article158774809...

Many of your criticisms are totally valid. Lots of the phrasing is awkward - even the lede is really bad ("Tesla, Inc. (TSLA) has been having a set of eventful trading activity"...wat). And it feels really deceptive to put a human byline on an automated article.

We're pretty open about the fact that our solution to this problem is not "magical" at all [1, 2] - it's good, old-fashioned automation. This approach allows our customers to QA their content heavily before pushing it to production, which eliminates many of the problems with awkward/incorrect phrasing that people who rely more heavily on machine learning tend to run into. And the news articles we publish always have a note at the end saying that they were generated by Automated Insights, and don't include a human byline.

There is real value in this type of reporting - a recent study [3] found that the articles we produce for less well-known publicly-traded companies has increased the trading volume for those companies. The idea is that, yes, the content is fairly formulaic, but there's now reporting on companies that had very little coverage before we existed. There are similar arguments for mass personalization work we've done for companies like Activision Yahoo - having prose that describes raw data (even if it is formulaic to an extent) is often better than not having prose.

[1] https://automatedinsights.com/blog/the-state-of-artificial-i...

[2] https://automatedinsights.com/blog/creating-great-automated-...

[3] https://insights.ap.org/industry-trends/study-news-automatio...

I don't understand what value the prose provides over spending the same amount of effort producing clear, easy to read infographics.

Instead of producing awkward and difficult-to-read English sentences, why not use the same content generator to produce completely accurate and easier to read dynamic data visualizations?

If you do automated content well, it's not awkward and difficult to read ;)

As far as visuals vs prose, I see it as "both-and" rather than "either-or". And in addition to our journalism and personalization work, we also integrate with interactive visualization tools like Tableau.

Increased the trading volume? Seriously, you call that value? What are you smoking, that's called pump and dump and is illegal my friend, and if it isn't actually being dumped is downright sleazy car salesman to me. Was that recent study also automated.
Increased trading volume generally just means better price discovery. Why do you think increased trading volume means it's "pump and dump?" When I trade SPY, I increase the trading volume in the underlying S&P 500 components -- am I pumping and dumping then?
Increased trading volume driven by bot-written blogspam produced by the company PR department with the express intent of pumping their share price definitely isn't "better price discovery"
It's a good idea to follow the links before accusing someone of illegal behavior.

Here's the link to the study again: [1]. This is specifically in reference to the reporting on quarterly earnings reports that we automate for the Associated Press. It's an objective summary of the financial performance of these companies that appears in news outlets across the country (for example, [2,3,4,5,6]). The companies being reported on have no influence over the content of the articles.

From the summary:

>These articles synthesize information from firms’ press releases, analyst reports and stock performance, and are widely disseminated by major news outlets within hours of publication...This study found a positive effect between the public dissemination of objective information and market efficiency.

[1] https://insights.ap.org/industry-trends/study-news-automatio...

[2] http://www.thenewstribune.com/news/business/article158779784...

[3] http://wtop.com/business-finance/2017/05/yum-beats-street-1q...

[4] http://www.foxbusiness.com/markets/2017/04/27/dominos-sales-...

[5] http://www.businessinsider.com/ap-fedex-beats-street-1q-fore...

[6] https://www.usnews.com/news/business/articles/2017-04-26/her...

Fair response and I withdraw and apologise for the implicit accusation against your company specifically. I'm sure you would agree less benign actors exist, which is why I'm reflexively sceptical of the idea that a link between more reporting and more trading volume is an indication of its merit.

I guess if I'd taken the time to read your original link we could have a more interesting discussion on whether pretty basic earnings information in a format more friendly and available to non-professional investors was adding noise to the market or providing a useful counterweight to the amount of free publicity that more prominent companies' earnings get. But I've probably already poisoned the well on this one.

> "The corresponding low also paints a picture and suggests that the low is nothing but a 97.89% since 11/14/16." I can't gather any meaning from that statement, yet it has oddly specific details.

Maybe it was a horribly sleep-deprived person in Wall Street at 2am and made a cut-and-paste error while half-asleep.

I've also noticed this on financial/stock news articles. I've seen a few of them use full 4-word names for corporations ("The Coca-Cola Company") dozens of times in one article, and multiple times per sentence.
That's where companies like Arria https://www.arria.com/ doing natural language generation step in so you can't tell the difference if it's machine or human generated. Well that's the promise.
Yes, finance and sports and weather and other metrics reporting is already highly automated.
Automated Insights https://automatedinsights.com - writes all the routine corporate earnings stories for Associated Press so they can cover more companies and let journalists actually investigate and write more nuanced pieces. Some earnings reports start with the software writing and get augmented by human journalists.
These are indeed frequently generated automatically.
I'd guess that about a third to half of the "financial news" articles I see on finance.google.com are generated from some sort of template, with a script filling in the details. I'd love to see Google identify and remove these sites from their search results (and any sites that link to them), but I think they either don't think it's a high priority or they don't know how to solve the problem.

What will be creepy is if the auto-generated story algorithms get good enough that you can't tell what's written by a human and what isn't, there will no longer a human filter between what some powerful institution wants a news article to say and what makes it into print. Most journalists have a sense of journalistic ethics or at least a reputation to defend; an algorithm has neither of those.

Bloomberg has these articles as well. If a service that costs $20K a year is promoting these (and not removing), it's a safe bet "free" Google Finance will be showing more rather than less of these in the future.
That's a great example, because I much prefer the robot version of the Denny article to the human created version.
For me, neither was worth reading and since the machine generated version was shorter it was less not worth reading than the longer one...but the longer one contained the logical possibility of being worth reading whereas the machine written one was as good as it could possibly be.

Or to put it another way, while it was not worth the human effort to write the story, it wasn't worth the CPU cycles to write the machine generated version either. The story was not worth writing or publishing or reading at all because nobody cares including the author (which is why a machine can write it).

> The story was not worth writing or publishing or reading

Of course, these are the wrong metrics.

The correct metrics are: ad impressions vs. cost to generate content.

That's only for the publishing side. The reader's utility calculus matters too, and I agree with the other poster that both stories are garbage.
Ad impressions read by who ? Triggering which action ?
In that vein and thinking about click-bots, I'd favor revenue over ad impressions.
They both had their advantages. The human one wasted words on being cutesy but had the Las Vegas analysis.
You call it analysis, I call it bullshit.

Doesn't every bit of research we have show how hopeless all this analysis is?

https://duckduckgo.com/?q=%22When+you+combine+the+technical+...

Seems to be at least some sort of copy and paste going on...

edit: This is so bizarre, one of the sites has a section with editor "bios" but they read like some sort of very poor odesk/fiverr profiles, wouldn't be surprised if thats what they are...

https://nystocknews.com/our-staff/

This line from those "bios" caught my attention:

I’d affection to help you with your written work, altering and substance needs!

...because I mentally "autocorrected" the latter half to "and mind-altering substance needs"... Looks like they used a "thesauriser" on it. Not hard to see love->affection, editing->altering, and content->substance.

Of course, if you are under the influence of a mind-altering substance, you would probably not notice anything wrong with that page. ...and unfortunately, so would many people who aren't.

Ha, and the "de-thesaurised" version of that phrase is indeed from this upwork profile!

https://www.upwork.com/o/profiles/users/_~012388b8f7c8ed8aa2...

The whole thing is full of rehashed phrases:

(link to google for "A deeper exploration of the setup is sure to yield a clear picture"):

https://www.google.com/search?q=%22A+deeper+exploration+of+t...

Craziness. Auto-generated soup to farm SEO?

I am curious though, will these systems have pen names? A simple name easy enough to recognize as machine written without the need for disclaimer? Could competition be more easily obtained between different companies based on which pen name attracts the most viewers?

the one concern I have is someone has to give ths system enough information to create a story and what prevents a fake news machine?

I vote for them all to be called Writey McWriterson.
My vote is Botty McBotface :)
From the article (mistakes highlighted):

> Human news writers regularly point out that AIs tend to lack nuance and a _flare_ for language in the stories they churn out. That’s probably a _fare_ criticism [...]

Maybe they used speech-to-text transcription for this, given that the mistakes are homophones? It seems very unlikely that either a human typing this, or a computerized system would make these mistakes (if it learns word associations from a corpus).

PS: the article also claims to be human generated:

> This story was not generated by an AI, but to be fair, I haven’t had my coffee yet.

EDIT: Oops, I might have misunderstood which article you were referring to, since the reference was not placed next to "this".

> It seems very unlikely that either a human typing this, or a computerized system would make these mistakes (if it learns word associations from a corpus).

You underestimate people's ability to make language errors, including spelling ones. Every time I see somebody I suspect is a native English speaher using "it's" for "its", I grind my teeth. (Another instance is somebody using phrase like "as a programmer, the data bus should be written..." to mean "I, as a programmer, think that..."; this phrasing makes me simply furious.) With those errors they make reading my second language so much harder, and I can't even point their bad spelling or writing style out, because I'm seen as being nitpicking or something.

and I can't even point their bad spelling or writing style out

There is a certain delicious irony that you managed to contrive such a perfect example of a dangling preposition in the very next sentence after your complaint about a dangling modifier.

Skitt's Law in effect once again!

I believe you overlooked the "out" word at the end of the sentence ("[...] I can't even point [it] out"). Or am I mistaken and you meant something else? How should I have written the sentence?

Remember that English is not my native language, and having an already established carreer, I don't have many opportunities to learn the grammar more. I'm bound to make errors and not even know about them, because there's nobody who would point them out.

It should be "I can't even point out their bad spelling or writing style". When the preposition gets separated from its object, it's referred to as "dangling". There's a famous (probably apocryphal) example where Winston Churchill humorously wrote, "This is a situation up with which I will not put." - the humour being, of course, that the arguably more grammatical phrasing sounds absurdly unidiomatic.

If you cc me on all your work correspondence, I'll be happy to point out any grammatical errors I find (for a fee, obviously).

OK, Wikipedia has a nice article about it. https://en.wikipedia.org/wiki/Preposition_stranding

Though my sentence was grammatically and semantically correct, but the sentence I was complaining about was semantically invalid, so it was a little too much from you to point that out (stranding here fully intended).

> If you cc me on all your work correspondence, I'll be happy to point out any grammatical errors I find (for a fee, obviously).

A "nice" offer, but I'll pass. First, I'm not in a position to copy my work correspondence to a random dude from the internets. Second, my work correspondence is mainly in my native language.

Not with that attitude!
FWIW, I absolutely believe that a human would make those kinds of typos while typing... I myself didn't realize "flair for criticism" was spelled like that (and the top hit searching on Google for that, without quotes, is actually a book title using the other spelling, though it may very well have been a purposeful pun...). It would be one thing if those weren't themselves "correctly spelled words" (and so a text editor might catch it), but both "flare" and "fare" could easily slip by unnoticed. I will often even make much more interesting typos, where the word just "sounds sort of like the other word but no one would ever confuse the two", as I tend to speak to myself in my head as I type (and as I read) and I swear all language in my brain is at some point represented as audio... I'm not coming up with any examples right now, but trust me that when they come up they are incredibly strange.
Only slightly tangential, have you come across the term eggcorn[1] before?

I always get a chuckle out of thinking about it.

1. In linguistics, an eggcorn is an idiosyncratic substitution of a word or phrase for a word or words that sound similar or identical in the speaker's dialect (sometimes called oronyms). The new phrase introduces a meaning that is different from the original but plausible in the same context, such as "old-timers' disease" for "Alzheimer's disease". - https://en.wikipedia.org/wiki/Eggcorn

"AP's robot journalists are writing their own stories now"

https://www.theverge.com/2015/1/29/7939067/ap-journalism-aut...