Hacker News new | ask | show | jobs
by qwhelan 2337 days ago
>Want to join two dataframes together like you'd join two database tables? df.join(other=df2, on='some_column') does the wrong thing, silently, what you really wanted was df.merge(right=df2, on='some_column')

Simply a matter of default type of join - join defaults to left while merge defaults to inner. They use the exact same internal join logic.

>What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

This was a long standing issue but is no longer true.

>Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

1 comments

> Simply a matter of default type of join - join defaults to left while merge defaults to inner.

No, join does an index merge. For example, if you try to join with string keys, it'll throw an error (because strings and numeric indexes aren't compatible).

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something_else": [4,3,1,2]})
  left.join(other=right, on="abcd")
  
  ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
If you try to join with numeric keys:

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [10,20,30,40]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [40,30,10,20]})
  
  left.join(other=right, on="something", rsuffix="_r")
  
    abcd  something abcd_r  something_r
  0    a         10    NaN          NaN
  1    b         20    NaN          NaN
  2    c         30    NaN          NaN
  3    d         40    NaN          NaN
Or even worse if your numeric values are within the range for indexes, which kind of looks right if you're not paying attention:

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [4,3,1,2]})
  left.join(other=right, on="something", rsuffix="_r")
  
    abcd  something abcd_r  something_r
  0    a          1      c          3.0
  1    b          2      a          1.0
  2    c          3      b          2.0
  3    d          4    NaN          NaN
Whereas merge does what one would expect:

  left.merge(right=right, on="something", suffixes=['', '_r'])
  
    abcd  something abcd_r
  0    a         10      a
  1    b         20      b
  2    c         30      c
  3    d         40      d
>> What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

> This was a long standing issue but is no longer true.

Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

  pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807]})
  
              foo
  0  1.000000e+00
  1  2.000000e+00
  2  3.000000e+00
  3  4.000000e+00
  4           NaN
  5  9.223372e+18
It's also a lossy conversion if the integer values are large enough:

  df = pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]})
  
              foo
  0  1.000000e+00
  1  2.000000e+00
  2  3.000000e+00
  3  4.000000e+00
  4           NaN
  5  9.223372e+18
  6  9.223372e+18
  
  df["foo"].unique()
  
  array([1.00000000e+00, 2.00000000e+00, 3.00000000e+00, 4.00000000e+00, nan, 9.22337204e+18])
>> Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

> Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

It's not very pythonic, though. A definition of "all dimensions greater than 0" would've been much less surprising.

> Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

It was released in 0.24.0: https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...

For example:

    pd.DataFrame({"foo": [1,2,3,4,None]}, dtype=pd.Int64Dtype())

        foo
    0     1
    1     2
    2     3
    3     4
    4  <NA>

    pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]}, dtype=pd.Int64Dtype())

                       foo
    0                    1
    1                    2
    2                    3
    3                    4
    4                 <NA>
    5  9223372036854775807
    6  9223372036854775806
Sure, if you specify the type. It's still a gotcha because the default behavior is to upcast to floating point unless the type is defined for every integer column of every data frame, which isn't very pythonic.

The example with the (incorrect) join above shows how even other operations can cause this type conversion.

Yes, there's a lot of existing code written assuming the old behavior. But most code has only a few ingestion points, so it's pretty simple to turn on.