Any tools can generate datatype from string?

Y	Hacker News new \| ask \| show \| jobs

	Any tools can generate datatype from string?
	11 points by LucianSpec 1328 days ago
	i wonder if there is a tool can do this. for example: give a csv file, guess the schema.

8 comments

hackarama 1328 days ago

Python's pandas library essentially does this when creating a DataFrame object from a CSV file. Of course, it's not always correct, but may be a starting point for your use case.

link

dsmmcken 1327 days ago

pandas is good if you are in python.

If you happen to be looking for something in java, you can try: https://github.com/deephaven/deephaven-csv

Benchmarks, along with other java alternatives discussed here: https://deephaven.io/blog/2022/02/23/csv-reader/

link

LucianSpec 1327 days ago

i used pandas, type guess isn't nice. especially on some string or time types. also extend pandas types is hard, pandas's design is more like a numerical computation tool.

link

janosdebugs 1328 days ago

Any sort of office suite would be able to do it, but the data types would be very limited. CSV itself is not standardized, and with a well-constructed spreadsheet wouldn't contain information such as a currency symbol that would make more advances data types possible.

link

sidpatil 1328 days ago

> CSV itself is not standardized

Actually, CSV is formalized in RFC 4180. Adherence to that spec is another story, of course. :-)

https://www.rfc-editor.org/rfc/rfc4180

link

Akronymus 1328 days ago

Over here commas are used to denote decimals. So MS decided to helpfully use the semicolon as a separator for csv files if your locale is set to german. Rather than wrapping the values in quotes. It's a PITA

link

dvh 1328 days ago

Not exactly that but your first step should be to use magic to determine file type: https://en.wikipedia.org/wiki/File_(command)

link

codeulike 1328 days ago

ETL tools and similar things will generally have a go at this.

e.g. the Import Flat File wizard in SQL Server

Delimted File metadata tool in Talend

Basically they look at the first N rows and take a guess. If the guess turns out to be wrong its usually worse than not guessing at all.

link

Vohlenzer 1328 days ago

https://duckdb.org/docs/data/csv

link

LucianSpec 1327 days ago

quite interesting package~~

link

Akronymus 1328 days ago

In f# you could use type providers for that.

In C# there's source generators.

I am sure there are other options for other languages.

link

chewxy 1328 days ago

Yes. I built several of these before. The trick is to have a lattice of types at the ready.

link

tacostakohashi 1328 days ago

If a 'guess' is sufficient, just guess that all the columns are strings.

If that's not correct or good enough, then a guessing is not a sufficient solution for this problem.

link

LucianSpec 1327 days ago

yeah, csv type guess is just one of case. i more likely to have a metadata management solution beyond the data lake, maybe Ontotext Refine's goal is what i want, but it's approach from RDF really sucks.

link