Hacker News new | ask | show | jobs
Show HN: GS-Calc – a spreadsheet with 12M rows (citadel5.com)
12 points by jpiech 1119 days ago
Loading a 4GB+ workbook with over 500 million cells with random floating-point numbers and text strings - 8 worksheets, ~12 million rows each should take less than 30s on a computer with 16GB RAM.

Updating 12 million vlookups searching 12 million rows takes seconds.

Effective use of up to 64 processor cores for calculations applying any functions, including any number of user-created functions that can both extend and replace the default built-in set of ~390 functions.

Filtering, regular expressions, Monte Carlo simulations, pivot tables with up to 12 million rows, scripting.

GS-Calc can be installed on any portable storage device and used without performing any registry modifications. The installation folder requires ~8.5MB.

https://citadel5.com/gs-calc.htm (incl. YT videos with examples)

Any suggestions, especially regarding the ETL-related functions, are welcome.

Thanks

5 comments

Very cool! For anyone wondering, before 2003 Excel only supported 65,536 rows. It now has a hard limit 1,048,576 rows.

Google sheets is a bit harder to figure out -- it doesn't have an explicit row limit, but does have a cell limit of 10M cells - so depending on your number of columns you can back out to the max rows.

If you want to see some practical effects of these row limits, check out the time that England misreported thousands of covid cases - because their Excel file ran out of space silently [1]. Oops.

This looks like a pretty sweet engineering effort! Anything cool you learned while building this that you can share here that helped you achieve this performance? What language did you use to build this application?

I'm asking for interest and for selfish reasons -- fellow spreadsheet builder of Mito [2] - which is pretty much just a UI wrapper for Pandas dataframes (and so can support 10M+ rows as well).

[1] https://www.bbc.com/news/technology-54423988 [1] https://trymito.io

Thanks. It's C++, mostly with custom template classes instead of the STL.
I have just tested it for a few minutes.

What I really like is that it opens much faster than Excel and, as advertised, can handle large files. I will certainly consider buying it after the trial.

A couple of improvements that you could consider are: 1) implement row/col freezing like in Excel and not via split view because the split view is not as space-efficient (you see fewer rows on the screen), 2) add an option to add GS-Calc to "Open with" context menu, 3) a more modern UI.

Congratulations on shipping a very promising product!

Thanks. Actually, it's easy to overlook this at first but splitting views also maximizes the working area, e.g.: (1) click the table toolbar and chose to hide it, (2) right click a given split view and hide row and/or column headings, (3) right click that view and hide the vertical and/or horizontal scrollbars. In the "View" menu you can also minimize the width of the splitter line. What's left in the view are only cells. You can have up to 99 individually resized and individually synch. views of various regions of the same worksheet or any other worksheet in the same workbook (View > Choose Worksheet) per window.

Re: the "Open with" menu for other file types, it should be done automatically once you set the file association (Windows might require the installation folder not to be account-restricted).

This is great, except it seems Windows only. I'm guessing because of the VB scripting compatibility? It seems scary that Microsoft still has such dominance in spreadsheets still. OSX has Numbers? Linux has.... LibreOffice?
As a matter of fact, if someone owns the system, competing on equal terms (or at all) is impossible. Re: the Linux, it's tempting for a couple of reasons, but potential user base seems to be a problem (at the moment).
12M sound bit weird number for limit, I wonder where it stems from. Similarly I wonder why row count is limited at all, or at least why the limit is not something huge like 2^32 (or available RAM).
For various internal indexing purposes, it's more convenient to use powers of 2 or their multiplicities. While "no limits" is doable it would have some impact on the performance and would require a bit more complex UI for the column-oriented editing actions (which might not necessarily be worth it).
BTW, data tables exceeding 12 million rows can be handled by a companion database program GS-Base ( https://citadel5.com/gs-base.htm ). It's a database with spreadsheet functions. Uses up to 256 million records/rows including data types same as in GS-Calc and binary fields: Long Text, Images/Files, Code (for code snippets with syntax highlighting for 16 languages). It uses the same type of calculations within one and more tables but in this case, they record/row based.
Why is there a hard row limit at all? For a sufficiently large worksheet could you not page data in/out from disk?
Unfortunately, even for fast NVMe SSDs random access to cells (which determines the performance) can't compare to RAM. In the example above these 0.5 billion cells can still be fully loaded to 16GB RAM, but adding further such worksheets will cause spooling and noticeable slow down.