I often think AI is mostly crap, wasting a lot of energy for very questionable benefits. But could/should this repetitive task of reminding submitters to follow the submission guidelines and add the year to submissions of old articles be automated?
Even simpler, just check if the url or title has been submitted before. That would also take care of all the duplicate entries that pop up once per day for a week after a viral story is emerging.
In this instance, the url is slightly different from previous submissions so some more clever fuzzy matching or using only the title would be needed.
Yes, I have always wondered why the simple duplicate checker within the same couple of days does not exist. Or does it exist and the duplicates are actually sligt variations of the URL.
What algorithm would you suggest to find the year in an arbitrary submission? Of course AI is not a very clearly defined term, more difficult problems certainly exist. I was just thinking of the case the submission contains several dates or none at all and still several hints a human would take into consideration get checked.
Of course some minimal implementation without AI techniques could already handle many cases. My AI suggestion was not death-serious ;)
Google's research blog does not seem to provide this, but many blogs include the Open Graph metadata[0] around when the article was published or modified:
article:published_time - datetime - When the article was first published.
article:modified_time - datetime - When the article was last changed.
For example, I pulled up a random article on another website, and found these <meta> tags in the <head>:
For pages that contain this metadata, it would be a cheaper/faster implementation than using an LLM, but using an LLM as a fallback could easily provide you with the publication date of this Google article.
>What algorithm would you suggest to find the year in an arbitrary submission?
In the submission title, a simple regex for the presence of a date with a standard format (e.g. %Y) would suffice.
Matching it to the article might or might not be possible, but that would already be enough (assuming having the date is a good thing, which I'm not certain at all)
As another comment suggested, you can scan for previous submissions by URL -- Algolia is very helpful with that.
Outside that, no clue, been a long time since I last wrote crawlers, admittedly. Though it can't be too difficult to crowd-source origin date parsers per domain?
But hey, if any LLM's free tier can achieve it, then why not. My point was that many people worked on that particular problem historically. It would be a shame if we can't use any of their hard work.
I think adding the year is mostly crap. What exactly information would it give, except perhaps the false impression that this article is "antiquated information", when it pretty much holds true, and describes a perrenial issue?
It gives a cue about how many times I've probably seen the article before. Quite useful, IMO. I read this particular article when it came out in 2006... it's convenient to know we're not discussing a novel finding on the same topic.