Hacker News new | ask | show | jobs
by casual-dev 1419 days ago
Thanks for the kind words. Slowly going into the water is my approach as well, but sometimes it just gets to me. My learning projects die on the hill, because of the frustration I have on the job with these techniques. Plus, overwhelming ecosystem.

About my regex problem: This is a structual mess. JSON/XML with HTML code in the data fields. We process them and send them to multiple job boards. Our clients mainly use HRM software or some CMS, some of which are only able to spit out whatever HTML is displayed on their career sites. This code often does not even have classes or IDs. Most of the times we are dangling together whatever is between two headlines, praying those won't change. But they do, because the recruiters put fields, where they not belong. I call myself code cleaner, not web dev nowadays. We are not able to use APIs, because the receiving job boards either don't offer one, the client doesn't, or it's just not worth it financially.

I will take a step back and reevaluate my situation.

2 comments

I spent a decade parsing text-with-angle-brackets with regexes, and it sucks. It’s always tempting to try an html parser but if the code is written by a human (or worse, a mixture of human and machine, especially if the machine involves MS Word) it just doesn’t work.

I’d suggest rather than attempting to do big regexes that capture a bunch of stuff in one call, break it down to a bunch of smaller, more targeted calls - one call to capture the text of the whole record, another with 3 variants to get the title, another with 2 variants to pick up a tag line, etc.

Essentially, this is what I do. First matching with a broader regex ruleset, working down to next one and so on and so forth. But with more complexity of code comes more breakage down the line. I went in full maze mode yesterday and questioned everything after thtat, so this is what my sanity looked like this morning.

Regex isn't really the problem though (even though it technically should also not be the solution in this case, but I cannot dictate the techstack). It was just the last drop on my frustration with the situation and myself not being able to do, what my colleague does, even though I want to. I felt the need for help, and I got it. Awesome community around here.

Thank you for the context! What you're doing is actually much harder than regular web dev. It's a specialized kind of data processing, often called a "extract, transform, load" (ETL) workflow.

Most web devs don't need to do that, and that you're willing to tackle it at all just shows how willing to learn you are, despite the frustration.

If you hate this situation, it's totally understandable lol. That kind of work has all the tedium of dealing with someone else's arcane data format, and none of the joy of seeing your creativity come to life. Some people love that sort of work, and specialize in it, becoming backend people or DB engineers or data scientists or the such, but it's not usually what web devs are known for (who tend to focus instead on UIs and some level of design and interactive stateful apps). Nothing wrong if ETL just isn't your cup of tea. I'd go crazy if I had to do that often, too.

Anyhow, if I'm understanding you right, you have HTML embedded in either JSON and/or XML. Do you know what "escaping" is in the text embedding sense? Like if you have quotes inside quotes, or tag brackets inside tags, how to separate each layer of embedding? If your JSON and XML files are cleanly escaped, you should be able to (as a first step) just iterate through the files and get the HTML parts out (without regex).

Like if the HTML is just a data string inside JSON, you can transform the JSON into an array of HTML strings using array.map() or object.values.map().

In the XML, if the HTML is stored in CDATA fields, you can access it using an "XPath" selector... you know how CSS has selectors that let you say headings should be styled one way, paragraphs another? XML has its own selector language that lets you directly target a certain node inside the document, without using regex, by specifying the hierarchical path that takes you there (like a CDATA inside a description inside a job inside a company, or whatever). Although there is a learning curve to XPath, it is much more suited to the task than regex, because the regex can't easily account for the complexity within XML (especially when there's nested layers).

It would help if you can post some example snippets, but that might be better suited for Stack than HN (though feel free to link to it here).

Once you have the HTML out, then you can run it through a sanitizer -- that's an optional step, but would let you strip out unnecessary divs, old font tags, whatever, keeping old basic formatting (headers, paragraphs, links, bold, etc.) which should be much cleaner to hand off to your clients. That would be much easier to embed on someone else's site vs a scraped page with all the HTML mess from someone else's framework.

I know there is a lot of complexity in each of those steps, but there are great tools and documentation for each step of the way. That's just to get you started.

At the end of the day what you're doing isn't really a Javascript issue at all, it's just a different kind of work that Javascript happens to be able to handle if you really need it to (but so can Python or Java or specialized command line tools like jq). It's a different body of work, which is why your casual web dev skills aren't providing easy answers. It's OK! You can learn it once and make it work (and then decide never to do that again, like I did lol). Or switch tracks, totally up to you :)

But feel free to ask here or on Stack if you have followups!

You are much appreciated. I didn't even know there is term for this part of my work.

Down the line, we do everything you cautiously described. We extract single fields with pointers (in lack of a better term, english is not my main language) to the XML/JSON fields we like to extract. Our software then lets us use JS snippets to manipulate the contents. Problem is, once you define a rule, it may get 80-90% over hundreds of datasets. But breakage is not an option most of the time. It's pareto principle work: 80% in 20% of the time, 20% work in 80% of the time. In the end, they are just snippets, then a giant gap, then the projects my colleague does.

I get where you are coming from, regarding "never to do that again". This not the only work I do. I also build HTML from customer demands, many of which are pdfs meant for print use, but not for the web. I like it, but I only scratch the surface of what might be. Thanks to the resources in this thread, I have a good insight of what to come. So, thanks again.