Hacker News new | ask | show | jobs
by abhaga 4938 days ago
There is tons of data on the govt websites in India and the only way to get to that is by scrapping the websites. Knowing that you can scrap it and be on your way feels very liberating at times. Example: Rates for Indian postal department services. Minutes of parliament houses. (great for building machine translation systems. A lot of research in MT has benefit from the availability of parallel corpus consisting of parliament proceedings in 2 or more languages. Hansard corpus from Canada. European Parliament corpus. No such luck in India.)