Hacker News new | ask | show | jobs
by hathawayp 3190 days ago
Hey HN, Sitebulb is a desktop crawler for Windows and Mac, specifically designed for SEO consultants/agencies. Me and my business partner have been building it for the last 2 years or so, and we're finally launching it today.

It's main differentiating factors: 1. Scale – it can comfortably crawl 500,000+ page websites despite being a desktop program. 2. Reporting – it does a lot of data manipulation and processing so you don't have to. 3. Visualization – it has tons of useful graphs, including the Crawl Maps, which help you visualise site structure.

Our aim was to give it the reporting capability of a SaaS crawler, with the convenience of a desktop crawler.

Looking forward to hearing your feedback on our new product. Thanks, HN community!

3 comments

Some feedback:

- Why require the email confirmation before using the software? Not really necessary, is it?

- No Umlauts in project name?

- Standard/advanced settings switcher is confusing

- Crawl Maps is not linked from the "Product" dropdown

- "Recent audits" shows finished and queued ones, but not running ones (which also have no menu point)

- Super simple option to limit crawl to "internal" URLs would be nice (or did I miss it?)

- "Filtered URL Lists" is a strange navigation option, above the main selection especially

- Why no endless scrolling in tables? This is what a desktop app should do better than browsers

Nice tool!

Thanks for all the detail! Here you go:

- Email confirmation is required for the username/password, which is how free and trial licenses are controlled, and ultimately how paid licenses are doled out. So we need it for the licensing.

- No special characters at all! Excepts periods. Sorry!

- Agreed, we need to improve the settings switcher.

- Crawl Maps is not linked - you mean on the website right? I'll fix that.

- Running audits show on the main Dashboard, seemed kinda overkill to put it on Recent Audits as well. No?

- You can switch of 'Check external' in the Advanced Settings. Kinda 'hidden away' to keep the main settings UI cleaner (otherwise where does it end?!)

- "Filtered URL Lists" - they are there because people want them ('a big list of all the URLs') and kept missing them in our usability tests!

- Why no endless scrolling in tables? It's not easy to do because the data is written to disk, rather than stored in RAM (which is the reason it can typically crawl more pages), so it needs to go and fetch/filter/etc... every time.

All makes sense, thanks for the reply.

> Email confirmation is required for the username/password, which is how free and trial licenses are controlled, and ultimately how paid licenses are doled out. So we need it for the licensing.

If you are interested in getting more free users into the app to try it, I would suggest to rework the licencing stuff a bit to enable the usage without email, but at least without confirmation. Should be worth the effort, and you can still require login when swicthing from free/trial to paid.

> Crawl Maps is not linked - you mean on the website right? I'll fix that.

Yep, no link in the feature dropdown.

> Running audits show on the main Dashboard, seemed kinda overkill to put it on Recent Audits as well. No?

Maybe. I like structure, so was expecting it to be shown a level down from the Dashboard somewhere as well.

> You can switch of 'Check external' in the Advanced Settings. Kinda 'hidden away' to keep the main settings UI cleaner (otherwise where does it end?!)

Ok, I think I am biased because I usually use a tool that is "internal only" by default.

> - "Filtered URL Lists" - they are there because people want them ('a big list of all the URLs') and kept missing them in our usability tests!

Umm, ok. "Crawled URLs" maybe?

> Why no endless scrolling in tables? It's not easy to do because the data is written to disk, rather than stored in RAM (which is the reason it can typically crawl more pages), so it needs to go and fetch/filter/etc... every time.

If some websites can do it with a request to the server each time the next results are loaded, I am sure you can also do that with whatever local database you use ;)

The problem is for big crawls( and 500k is not large) you probably don't want to use your desktop for example my home adsl is only 3.5 as we are 6kyards from the exchange.

And I would not want to get my works 100Mb banned by google. This is where services like deep crawl come in to play I can set up my sites to be crawled at night and look at the reports in the morning.

And another problem I found is desktop crawlers are very resource hungry at one small agency we had two striped down dedicated machines just to run crawls as the risk of causing a crash was to high

Yeah for really big crawls your probably better off sticking it on a server or AWS, as much as anything so you don't need to leave your computer on for ages.

But Sitebulb is not resource hungry in the same was as other desktop crawlers. It saves to disk instead of using RAM, so you don't experience the same limitations.

I'm not sure what you mean about Google. There is no link between Sitebulb and Google - it doesn't visit Google at all, so there is no risk of banning. Using it on your 100 Mb work line would be ideal.

Interesting that it’s all a desktop app. What problem do you think this solves compared to something that runs in the cloud ? Apart from the cost structure, I can’t think of anything myself.
The other big thing, in comparison to cloud software, is convenience. You can setup a crawl and start it running -and see URLs being crawled - within a minute.

On cloud software that's simply not possible, due to the way that everything is scheduled.

There are a few other small things, such as being able to view Audits offline (what we call 'train mode').

The cost structure can be a big limiting factor though, especially for smaller companies. Sitebulb effectively remove all limitations around number of domains, number of projects, total number of URLs crawled etc...

> The other big thing, in comparison to cloud software, is convenience. You can setup a crawl and start it running -and see URLs being crawled - within a minute.

This depends on implementation. If the architecture is modern and well thought, using dynamic scaling or even AWS Lambda, the result should be available much faster on the cloud software due to ability to parallelization. You can only have so much network bandwith / CPU power locally and if you need to crawl hundreds of pages to get your result, it matters a lot.

Disclaimer: I'm building a SaaS tool for SEO which also involves page crawling.

> On cloud software that's simply not possible, due to the way that everything is scheduled.

As someone who works in cloud software this makes me cringe a little.

I have no doubt this is how existing cloud SEO crawlers work but with elastic scaling, web sockets, and serverless there is no reason why this has to be true.

It is not a limitation of cloud software. It is a sign of devs and/or product owners deciding making instant results is not a priority for the product.

Edit: I hear that a lot from industries that are not intimately familiar with web apps. "You can't do that on the cloud"... a typical web software engineer will not be able to do it but there are people out there who can. They are more expensive than your typical developer but if depending on your product they are worth it.

Sorry, maybe I misread, but I kind of read the comment as 'what separates this from other cloud products on the market?'

So I wasn't trying to argue what is and isn't possible with cloud architecture, simply what is and isn't possible with (our) cloud-based competitors.

The process is along the lines of: 'Click Start', get taken to a screen which says 'Initializing' or similar, then maybe 2-3 minutes later you'll see something start to happen. But there is little to no data on which URLs are actually being crawled.

Sitebulb, and desktop crawlers in general, has a much quicker feedback loop.

> Sitebulb, and desktop crawlers in general, has a much quicker feedback loop.

I wasn't denying that. I'm sure it does. I am confident this is way better than most (if not all) current cloud solutions.

I just think it is unfortunate because there is no technical limitation of the cloud that prevents it from being instant on the cloud as well.

The cloud can't handle spikes well (1,000 customers all unexpectedly try to scan at once) but if the load is predictable, linear, or easily done in parallel which I suspect it is for this use case than it is perfectly doable with no delay on the cloud.

Deepcrawl does that, but again it entirely depends on the website type and infastructure you have got.
The one advantage that I can think of is this can run on websites that are in development and not accessible to something running in the cloud. A lot of enterprise websites have their dev/stage behind a VPN and being able to run this against those without having to find out how to jump through hoops would be really nice (which is looks like this is capable of doing since you just need to feed it a URL). On top of that you also don't have to worry about what they're doing with the data output by the program on their server.
> On top of that you also don't have to worry about what they're doing with the data output by the program on their server.

Why would you care if the site is available in the Internet and you can't control who is browsing it?

This is a valid argument only for internal websites which are not a subject for SEO anyway.

But it can run on sites that will be public but are not yet deployed.
Yep exactly. All crawls are stored locally so there is no data issue to worry about.