Hacker News new | ask | show | jobs
by greggyb 1888 days ago
Power Query's streaming semantics for tables and lists can lead to severe performance issues with even modest data volumes. Table.Buffer and List.Buffer offer some small amount of control, but it's likely that you have a pipeline that creates a series of intermediate table and/or list values. Every single table and list function (with the exception of the buffer functions mentioned above) creates a new lazy stream.

Accumulation patterns perform abysmally even with data in the 100Ks of elements. Say you have a table of inventory movements and want instead a snapshot table of inventory at point in time. You can do an O(n^2) self-join of a table with itself to all records with a lesser date, summing all movements to derive a total quantity at that time.

If you want to use an accumulation pattern, you can sort and cast your table to a list of records and then use List.Accumulate to iterate over each list element, deriving a new field with the running total of inventory amount. If you do this, you will find that it falls right over even with 1Ks or 10Ks of records. This is because the intermediate list that you're appending to through the accumulation is itself a lazy stream. Thus, you have to use List.Buffer at each step. Even with List.Buffer at each step, this solution falls over at high 10Ks or low 100Ks of records.

Incredibly unintuitively, you can use List.Generate with an already-buffered input list to derive a new list that can then be cast back to a table, though this still struggles with 100Ks of records.

If your snapshots can be aggregates, then you can happily throw out the idea of such an accumulation pattern and just join to a date table at the appropriate grain with all movement records less than or equal to the date in that date table.

I'll note that I regularly speak with several of the people whose blogs you will inevitably come across when performance tuning Power Query. The approaches above are the current state of the art in PQ for iteration and accumulation patterns. This is not an appeal to authority or a brag. This is to highlight the difference with the Excel spreadsheet formula approach below, which even beginners can derive from first principals.

In an Excel spreadsheet, for the same challenge, you just define a new column with a special first row formula, and each subsequent cell referencing the row above. This will happily run right up to the spreadsheet row limit with no performance concerns. If you really want, you can spill over to multiple spreadsheets, which is clunky to manage, but still performs just fine, and degrades slowly. The M approaches above hit a cliff and start hanging.

Excel formulas make it trivial to reference arbitrary cells. M is a nearly-general purpose language. PQ uses M, but as a framework for writing M, it has a strong emphasis on a query/table paradigm. A table-based operation model cuts against the grain of a spreadsheet, because a spreadsheet is a collection of arbitrary cells. A tabular approach is a collection of similarly shaped records stacked one upon the other. These two paradigms have a fair amount of overlap, but are not isomorphic. There are things trivial to express in one that become difficult bordering on impossible in the other.

2 comments

As someone developing essentially a competitor to Excel-and-PowerQuery/M, I find all this very interesting.

My language is strict and statically typed. However, after arrays (tables are arrays of records conceptually) exceed a certain length, rather than processing them in-memory as arrays, they will be offloaded to storage and processed (transparently) in a streaming fashion.

I’m surprised that this doesn’t work well in PowerQuery. I would have thought that 100K would be peanuts for it.

Mine is a SaaS however, so the user’s laptop isn’t a constraint, and I can transparently throw a million records in BigQuery or some other data warehouse and use its aggregates if needed. Although at the 100K scale you can use SQLite and it can handle that scale of data trivially on commodity laptops.

So your experience is interesting indeed.

Feel free to reach out via email if you want to follow up. My address is in my profile.

I'll note, as I did to a sibling reply of yours, I made observations about a specific pattern that showcases performance issues in PQ/M. PQ/M easily scales beyond 100Ks of records, but not for arbitrary processing patterns.

I'm skeptical. I want an example, because my experience differs.
Power Query falls over with thousands of items?

That's not my experience. At work, the data usually isn't very large, but I have experimented on my own time with, for instance, a public covid data file that I think was several GB.

I also thought lazy semantics is a good thing, not a fundamental flaw.

Rather than debate, I would be interested enough to spend some time on a sample problem, if you could provide one, where you believe Power Query inadequate, and at the same time have an alternate solution to provide a benchmark of what is adequate.

Not "PQ falls over with 1Ks of items," but rather "the M language does not do well with accumulation patterns on tables; naive approaches can hit significant performance issues in the 1Ks of records and sophisticated approaches struggle with 100Ks."

These are two very different statements. I've happily used PQ to ingest GBs of data. Its streaming semantics are fine to great for some types of processing and introduce performance cliffs for others. There's no binary judgment to be made here. Laziness is neither a fundamental flaw not an unmitigated good.

I've already shared one specific pattern above. I can share some mocked up data if you need me to, but that might be a day or two. Also, feel free to reach out via email (in my profile).

>I've already shared one specific pattern above

If you mean this:

"Say you have a table of inventory movements and want instead a snapshot table of inventory at point in time"

Then I can make my own data to play with - I only want to be clear about the constraints. Would 500K records be enough to obviate the distinction between naive and non-naive approaches? Can you quantify (not precisely) "struggle"?

I have used Table.Buffer, but I probably don't thoroughly understand its use yet.

(I belatedly realized your problem is something I've done with Sharepoint list history recently, but not that many records, so I'm going to look for a public dataset to try)

P.P.S. I guess it also makes me think - I frequently am getting my data from an Oracle database, so if something is easier done there, I'd put it in the SQL. Analytic functions are convenient.

P.P.P.S. Aha! I found a file of parking meter transactions for 2020 in San Diego, which is about 140MB and almost 2 million records. This seems like a good test because not only is it well over the number you said was problematic, but it's well over the number of rows you can have directly in one Excel sheet.

https://data.sandiego.gov/datasets/parking-meters-transactio...

Ok, I agree that PQ is slow. It is possible to calculate a running total of a column in a million row table before the sun burns out though.

I am very not an algorithm person, but I got a huge speedup from a "parallel prefix sum" instead of the obvious sequential approach or the even worse N^2.

I translated this to M by rote and trial and error (page 2): https://www.cs.utexas.edu/~plaxton/c/337/05f/slides/Parallel...

Implementing the parallel, recursive solution got me a million rows in about three and a half minutes.

Fill down (which I had to do anyway to compare) was about 10 seconds.

So...probably not the first choice in this scenario but could be worse?