Hacker News new | ask | show | jobs
by querez 553 days ago
Some very weird things in this.

1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.

2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project. Collecting data for several years w/o actually doing anything doesn't with it is not a sound project.

3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.

3 comments

yep I spent more time on duolingo for 600+ day streak and can barely speak spanish.
Duolingo is a pretty bad tool for learning a language, it's good to make you feel like you're learning though.
At this point it's more about being scared of the bird.
Just to give a nuanced perspective on duolingo.

My wife only did 50 hours of duolingo in total the past 2 years. Combine that with me teasing her in Dutch and she’s actually making progress.

Duolingo is a chill tool to learn some vocab. That vocab then gets acquired by talking to me. We talk 2 minutes Dutch per day at most. So about 11 hours in total per year.

She is 67% done with duolingo. So we bought the first real book to learn Dutch (De Opmaat).

That book is IMO not for pure beginners. But for the level my wife was at, it seems perfect.

Human speech is around 150-200 words per minute; even going slow, 2 minutes a day of real talk is probably more vocab than 10 minutes of Duo. And with better feedback, a human rather than a cartoon casino.
Do you think it would be good for Flemish too or speaking standard Dutch in Belgium?
I don't know how one would learn Flemish from books. I think you'd need to go to Belgium and speak Dutch there and then see what the differences are.

Dutch and Flemish are interchangeable though. Sometimes it falls apart based on accent, but not on language.

I finished the whole tree in French and had nothing to show for it either. It really is a fun way to feel like you're learning, without connecting you to the language or culture in any significant way.
It's a useful tool if you're immersed in the language, it's not key to your learning but it can tremendously help.
For me - nothing beats in-person classes in lieu of a native speaker whom you can interact with. Being forced to actually speak the language in “mock settings” makes all the difference.

And even if you don’t get your grammar completely right, you will learn enough to survive in a real-life setting.

I learned Spanish through a combination of both - I took Spanish classes after I started dating my Mexican wife, enough to get conversational. Then I started interacting in Spanish with her family, which helps me now maintain the language without needing the classes.

I feel this whilst learning (trying to) German: when I think "how I would say this in German?" I got nothing less than a blank on my mind. But I'm a good "speaker" though, and sadly, I feel I'm not going anywhere as well...
Watch Dark on Netflix in original German on repeat, great way to subconsciously make note of tones and pronunciation while also watching an awesome show. Be very intentional about it though.
Surround yourself in the language. In Germany we have almost everything dubbed, so you can watch pretty much any popular movie or TV series in German or read any popular book in German. Besides that there are also quite a lot of German productions.
Indeed.

For learners, I'd also currently recommend "Easy German" podcasts and YouTube videos, as they come in all skill levels, are free, and are well made.

https://youtube.com/@easygerman?si=EQdZPHMZ0lPNEl6V

That seems to be a pattern
It is because you never really practice talking with Duolingo. I am quite good at reading French now, though.
> I am quite good at reading French now, though.

If you are, that's actually quite an achievement and good. If you're talking about French outside of Duolingo, that is.

I do not normally hear of people getting to reading fluency through Duolingo.

Duolingo used to have a really good feature where you read through and collaboratively-translated texts, but they shut it down years back.
Wow I forgot about that! When I was using it for French many years ago, I imagined they were using it as a way to get generate free translations, but still found it enjoyable and useful.

Wonder why they took it away.

Well you can't practice producing unconstrained sentences. Only with their very narrow training-wheels.
Anki is the way, especially with their new FSRS algo.
Yep, any good textbook or course with Anki for aiding raw memorisation. By far the best way to go
Likewise, but also about that with Arabic on Duolingo and I never even mastered the alphabet.
Point number 2. is super important for non-hobby projects. Collect a bit of data, even if you have to do it manually at first and do a "dry run" / first cut of whatever analysis you're thinking of doing so you confirm you're actually collecting what you need and what you're doing is even going to work. Seeing a pipeline get built, run for like two months and then the data scientist come along and say "this isn't what we needed" was complete goddamn shitshow. I'm just glad I was only a spectator to it.
They touch on something relevant here and it's a great point to emphasise

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020. This experience underscored a fundamental data engineering principle: raw data is king. While parsers can be rewritten, lost data is irretrievable.

I've done this before keeping full, timestamped, versioned raw HTML. That still risks shifts to javascript based things but keeping your collection and processing distinct as much as you can so you can rerun things later is incredibly helpful.

Usually, processing raw data is cheap. Recovering raw data is expensive or impossible.

As a bonus, collecting raw data is usually easier than collecting and processing it, so you might as well start there. Maybe you'll find out you were missing something, but it's no worse than if you'd tied things together.

edit

> Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

They say they had to manually find the links to the right liveblog subpage. So they had to go to the main page, find the link and then store it.

While I understand the points I think it's worth being kinder about someone coming out to write about how they failed with a project.

> 1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.

Consistent work over multiple years shouldn't be looked down on like this. If you've done something every day for years it's still a lot of time in your life. We're not econs and so I don't think summing up the time really captures it either.

> 3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.

They might not realise how to do this sustainably, they might be mentally just done with it. It may be harder for them to think about.

I'd recommend also that they release the data. If they put it on either Zenodo or Figshare it'll be hosted for free and referenceable by others.

> 2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project.

I agree, but again on the kinder side (because they also agree I think) there are multiple reasons for doing this and focusing on why might be more productive.

1. It gets you to actually process the data in some useful form. So many times I've seen things fail late on because people didn't realise something like "how are dates formatted" or whether some field was often missing or you just didn't capture something that turns out to be pretty key (e.g. scrape times then realise that at some point they changed it to "two weeks ago" and you didn't realise).

This can be as simple as just plotting some data, counting uniques, anything. The automated system will fall over when things go wrong and you can check it.

2. What do people care about? What do you care about? Sometimes I've had a great idea for an analysis only to realise later maybe I'm the only one that cares or worse, the result is so obvious it's not even interesting to me.

3. Keeping interest. Keeping interest in a multi-year project that's giving you something back can be easier than something that's just taking.

4. Guilt. If I spend a long time on something, I feel it should be better. So I want to make it more polished, which takes time, which I don't have. So I don't add to it, then I'm not adding anything, then nothing happens. It shouldn't matter, but I've long realised that just wishing my mind worked differently isn't a good plan and instead I should just plan for reality. For that, doing something fast feels much better - I am happier releasing something that's taken me half a day and looks kinda-ok because

5. Get it out before something changes. COVID had or has no upfront endpoint.

6. Ensure you've actually got a plan. Unless you've got a very good reason, you can probably build what you need to analyse things and release it earlier. You can't run an analysis on an upcoming election, but even then you could do it on a previous year and see things working. This can help with motivation because at the end you don't have "oh right now I need to write and run loads of things" you just need to hit go again.