Hacker News new | ask | show | jobs
Show HN: CSV file merging (mightymerge.io)
29 points by jampoole 2382 days ago
3 comments

I just use `xsv` for this. `xsv concatenate` and `xsv join` do a pretty good job. Blazing fast: https://github.com/BurntSushi/xsv
That tool looks pretty cool.

Funny thing is several years ago I made a very similar tool to what you pointed out above written entirely in bash/awk and is also blazing fast.

That's where I got the idea for this online tool. I wanted to create the same exact concept of working with files locally without having to open up a terminal/console.

Is the code open source? How would one know with certainty their data isn't being surreptitiously copied?

(You can also recreate this functionality locally with a few lines of Python):

import pandas as pd

import os

list_of_files = [f for f in os.listdir('.') if os.path.isfile(f) and ".csv" in f]

print(f"Files that will be concatenated are: {list_of_files}")

original_df = pd.DataFrame()

for file in list_of_files:

  df = pd.read_csv(file, keep_default_na=False)
 
  original_df = pd.concat([original_df, df], axis=0, ignore_index=True)
original_df.to_csv(f"Output file.csv", index=False)
The code is not open source. However, If you inspect the network traffic from Chrome developer tools you can verify there are 0 post events occurring.

True, you can do this with many languages like python (quite easily with pandas as you pointed out) or with awk, perl, even cat (minus ordering) etc. For more power users Excel Power Query is your friend, but as you know these require script setup and memory considerations.

This tool is geared for ease of use anyone can do without the fuss of custom scripts and/or setups and for the most part almost zero memory concerns as the output is being written on the fly.

your snippet just concatenates the content of 2 CSVs (while adjusting headers)?

Here's how it's done in JS: https://repl.it/@caub/csv-merge

I like glob for getting the list of files:

    import glob
    list_of_files = glob.glob("*.csv")
You can also use Pathlib:

    from pathlib import Path
    dir = Path(".").glob('*.csv')
Glob is great but on some OS (Linux maybe?) it doesn't guarantee maintaining the order of files. If order matters, it would make sense to sort the results prior to processing.
Obligatory plug for csvkit which allows you to do that and much more without having to write any code > https://csvkit.readthedocs.io/en/latest/
I tried vertical merge with 2 files that were each a single column of items (no header). The resulting file was 3 columns wide, 1 file in first column with second and third columns blank and then below that the items from the second file were in the 3rd column with columns 1 and 2 blank

I was expecting a 1 column result file with both files items stacked

This is expected as merging currently requires a header row for each file so that it can automatically line up the data accurately.

For example if you had two csv files...

[File1.csv]

    Id,ProductId
    1,200
    2,300
[File2.csv]

    ProductId,Name,Id
    400,NameFor400,3
    500,NameFor500,4
Final output would be...

[Output.csv]

    Id,ProductId,Name
    1,200,""
    2,300,""
    3,400,NameFor400
    4,500,NameFor500
With that said, thanks for commenting! I will add an option for "No Header" files, or maybe even auto detect for these situations and do a straight merge:)