Hacker News new | ask | show | jobs
Any tools can generate datatype from string?
11 points by LucianSpec 1328 days ago
i wonder if there is a tool can do this. for example: give a csv file, guess the schema.
8 comments

Python's pandas library essentially does this when creating a DataFrame object from a CSV file. Of course, it's not always correct, but may be a starting point for your use case.
pandas is good if you are in python.

If you happen to be looking for something in java, you can try: https://github.com/deephaven/deephaven-csv

Benchmarks, along with other java alternatives discussed here: https://deephaven.io/blog/2022/02/23/csv-reader/

i used pandas, type guess isn't nice. especially on some string or time types. also extend pandas types is hard, pandas's design is more like a numerical computation tool.
Any sort of office suite would be able to do it, but the data types would be very limited. CSV itself is not standardized, and with a well-constructed spreadsheet wouldn't contain information such as a currency symbol that would make more advances data types possible.
> CSV itself is not standardized

Actually, CSV is formalized in RFC 4180. Adherence to that spec is another story, of course. :-)

https://www.rfc-editor.org/rfc/rfc4180

Over here commas are used to denote decimals. So MS decided to helpfully use the semicolon as a separator for csv files if your locale is set to german. Rather than wrapping the values in quotes. It's a PITA
Not exactly that but your first step should be to use magic to determine file type: https://en.wikipedia.org/wiki/File_(command)
ETL tools and similar things will generally have a go at this.

e.g. the Import Flat File wizard in SQL Server

Delimted File metadata tool in Talend

Basically they look at the first N rows and take a guess. If the guess turns out to be wrong its usually worse than not guessing at all.

quite interesting package~~
In f# you could use type providers for that.

In C# there's source generators.

I am sure there are other options for other languages.

Yes. I built several of these before. The trick is to have a lattice of types at the ready.
If a 'guess' is sufficient, just guess that all the columns are strings.

If that's not correct or good enough, then a guessing is not a sufficient solution for this problem.

yeah, csv type guess is just one of case. i more likely to have a metadata management solution beyond the data lake, maybe Ontotext Refine's goal is what i want, but it's approach from RDF really sucks.