| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by codetrotter 2692 days ago

> I think DB's could definitely do more to expose what the cost of various operations are, it would be great if you could "explain" a migration before you run it like you can with a query and it would calculate a rough cost, how many rows need to be touched, what resources need to be locked, even how likely the required locks are to cause contention with other frequently-taken locks based on system statistics, etc.

As part of my development process for a project I am working on currently I have spent quite a bit of time writing a Python 3 program to generate sample data.

It's nothing groundbreaking but I haven't seen anyone talk about this so I think it might be of interest to others maybe?

I'll explain how that is relevant to the comment I am replying to in a moment but first I would like to talk a little about said program.

So first of all you have other existing tools for generating sample data. For example in DBeaver (https://dbeaver.io/) you have functionality for generating mock data. However, while DBeaver as a whole is a nice tool that I am happy to have learned about, the mock data generating functionality is from what I have seen severely limited.

In particular, what I wanted to do was to generate data that would adhere to arbitrary statistical distribution of my liking.

In the first version of my program it took about 50 minutes (unacceptable!) to generate and insert 10,000,000 records on my laptop, and I was consuming so much RAM that my laptop started swapping which hurt the performance as well (to be fair my laptop only has 8GB of RAM but still). That however was just the initial starting point -- it was even only doing statistical distribution for one property (age), a few were assigned random values and all of the other were assigned a single value common to all.

After a bit of thinking I decided that the next step would be to construct a tree structure with frequencies for the different values. This tree is cheap for memory and fast to build.

First I specify a set of "distribution templates" that instruct the frequencies for values or groups of values of each of the properties that will have a desired statistical distribution.

For example, I create a demographics model (simplified here for brevity):

    distribution_templates = \
    {
      'age':
      {
        '18-24': 150,
        '25-34': 206,
        '35-44': 185,
        '45-54': 177,
        '55-64': 175,
        '65+':   106,
      },
      'gender':
      {
        'female':         80,
        'male':           15,
        'other':           3,
        'rather not say':  2,
      },
    }

From this I create at runtime what I refer to as a "combinatorial tree" with calculated target frequencies, and then I distribute a population count over it that is specified at runtime as well. Let's say that we want to generate 10,000,000 sample users. The resulting tree looks like this:

    -- root -- 'all' (10000000) -- age -- '18-24' (1501501) -- gender -- 'female' (1201201)
                                       |                              |- 'male' (225225)
                                       |                              |- 'other' (45045)
                                       |                              `- 'rather not say' (30030)
                                       |- '25-34' (2062062) -- gender -- 'female' (1649650)
                                       |                              |- 'male' (309309)
                                       |                              |- 'other' (61862)
                                       |                              `- 'rather not say' (41241)
                                       |- '35-44' (1851852) -- gender -- 'female' (1481482)
                                       |                              |- 'male' (277778)
                                       |                              |- 'other' (55555)
                                       |                              `- 'rather not say' (37037)
                                       |- '45-54' (1771772) -- gender -- 'female' (1417418)
                                       |                              |- 'male' (265766)
                                       |                              |- 'other' (53153)
                                       |                              `- 'rather not say' (35435)
                                       |- '55-64' (1751752) -- gender -- 'female' (1401402)
                                       |                              |- 'male' (262763)
                                       |                              |- 'other' (52552)
                                       |                              `- 'rather not say' (35035)
                                       `- '65+' (1061061) ---- gender -- 'female' (848849)
                                                                      |- 'male' (159159)
                                                                      |- 'other' (31832)
                                                                      `- 'rather not say' (21221)

In addition to that I have created functions for things like algorithmically generating unique usernames.

Then I have a tree walker that yields all of the combinations represented by the tree. From the tree above we would yield { 'age': '18-24', 'gender': 'female' } a total of 1,201,201 times, { 'age': '18-24', 'gender': 'male' } a total of 225,225 times and so on. (Again, the real tree is deeper than this -- there's more than just age and gender in the actual model. Also, age is specified at a more fine-grained level but the concept is the same.)

I retrieve these values from the generator function in batches and generate sample user profiles based on that.

So within a span like '18-24' I linearly distribute dates of birth by calculating (upper date bound - lower date bound) / count and using that as the delta to step the date of births by within the span starting from the lower date bound. (And of course the more fine-grained your spans are, the lesser the impact of linearly distributing values within each span.)

And I generate additional properties that are pseudo-random like usernames.

The script execution time is presently down to about 6 and a half minutes for generating and copying (using copy instead of insert was another optimization I made to the script along with a few others from https://www.postgresql.org/docs/current/populate.html after meticulously measuring that said optimizations had a significant positive impact on the time it took to put the data into the db), memory usage is very reasonable, and with the most recent commit I made which makes the generated usernames guaranteed to be unique (for up to a set number of generated user above the 10,000,000 I am doing), the stage is set for running the script on as many cores as your computer has to offer, further cutting down the script execution time.

Anyway, now on to how any of this is relevant to your comment.

I wrote this tool as part of my development process because a lot of the value that the project will offer to its users is tied directly to segmenting the users by various facets, and so I need sample data that allows me to explore the user experience while developing the project.

But I also think that this sort of tool could be useful in the situation that you guys are talking about here.

So when you are doing a migration you could generate a smaller test set of data that realistically reflects the real data of your users and you could then run the migration on that test set to get a very good idea about the cost of the various operations like you wanted. Agree?

Also, if anyone else knows of similar tools to mine I am always interested in knowing about them. Discovering what exists already can be hard and I have gotten to learn about many interesting and useful tools through discussions with others.