Hacker News new | ask | show | jobs
by yantrams 819 days ago
Congrats on the launch. This is something I'd spent some time on few years ago. I hacked together something similar for my usecase by reverse engineering. No ML model though - Using Nearest neighbours and Tversky similarity measures in Julia with the same taxonomy that you are using.

Tested with one of the comments from this thread.

        requests.post(
            "https://x2vud9xfq0.execute-api.ap-south-1.amazonaws.com/api/text/classify",
            json={
                "text": """
                And, to be frank, I can't see why I'd send my confidential information to you when I can send it to Google. (Ahem!)
                But the problem with theirs and yours is the OOTB categories are for a global topic set, something like Yahoo directory, rather than for a given discipline. And what's generally needed is a set of disciplines, or several topic trees. (Think Amazon.com instead of Yahoo.)
                I've found the general lists, like LCM[^1] (what you really want is LCSH[^2] subject headings, not LCM), too broad for my business or personal content, while something like ACM[^3] is more what's needed for, say, computing related content.
                For a firmwide knowledge base at a {field}-tech firm, you have a mix of the firm's focus field, and computing, and a broad scope fallback like you're starting with. Even libraries have their own topic hierarchy! [^4]. Plenty fields have controlled vocabularies[^6], and if you can't find one for a field, you can usually generate one by finding someone who is already classifying that field, and looking at their TOC. All of which is to say, to be generally useful, you have to let people BYOT (bring your own topics) for this.
                For instance, we built our topic list based on combining a reference taxonomy for our field, a reference taxonomy for computing, a reference taxonomy for business books, and the Google NLP tool mentioned above.
                There are occasional tools that try to match arbitrary documents to arbitrary hierarchies such as clerk [^5] but they are challenging for various reasons.
                You have a note to contact you for different topics, but raising this here since so far (6 hours) you had no feedback, and I'm a big fan of what you're doing and the niche is underserved.
                A couple other thoughts:
                """,
                'key': 'HACKERNEWS'
            }
        ).json()
        
        
        {
            'genres': {'Technology': 24, 'Finance': 16, 'Education': 11},
            'tags': {'/Business & Industrial/Small Business/MLM & Business Opportunities': 5.094265117745211,
            '/Internet & Telecom/Web Services': 5.51434499612552,
            '/Finance/Investing': 5.72584536853734,
            '/Business & Industrial/Business Operations': 5.888633926463297,
            '/Jobs & Education/Education/Standardized & Admissions Tests': 6.0132143106028435,
            '/Business & Industrial/Business Services': 6.100261915913882,
            '/Jobs & Education/Jobs': 6.126547614437338,
            '/Science/Earth Sciences/Atmospheric Science': 6.1553064528175545,
            '/Finance': 6.249046550441405,
            '/Business & Industrial': 6.333431648078183},
            'id': '65f891a111ec14ddd4b56bda'
        }
        
        
Your result

        {
            "result": [
                [
                "/Arts & Entertainment/Books & Literature/Reference",
                0.138976
                ],
                [
                "/Jobs & Education/Job Listings",
                0.138976
                ],
                [
                "/Computers & Technology/Networking/Distributed & Cloud Computing",
                0.069488
                ],
                [
                "/Jobs & Education/Online Learning",
                0.069488
                ],
                [
                "/Arts & Entertainment/Music & Audio/Music Reference",
                0.046325
                ]
            ]
        }