I'll give a very, very basic example of why two shards with "half" the data is less optimal. More complicated optimizations can be left as an exercise to the reader.
Lets pretend the only data structure within a Lucene Shard is a Trie.
Given 4 strings, ["Hello", "World", "Help", "Thanks"]; A total of 20 chars.
With one shard, Lucene can utilize prefixing to find overlap between "Hello" and "Help". Meanwhile, "World" and "Thanks" are always fully stored. Resulting in a Trie of only 17 chars, i.e. a whopping (1 - 17/20) = 15% storage optimization!
With two shards Lucene potentially looses that optimization.
If the split is: ["Hello", "Help"], ["World", "Thanks"] then Lucene needs to store two Tries with 6 chars and 11 chars. Totaling: 17 chars and we still get a 15% optimization.
However, if the split is: ["Hello", "World"], ["Help", "Thanks"] then Lucene needs to store two Tries with 10 chars and 10 chars. Totaling: 20 chars for a 0% optimization :(
Now lets get back to reality, and remember that Lucene not only uses a LOT of optimizations (for both storage, and query performance), but also (for many reasons) pre-processing the data to find optimal shard placement is generally not an option, and the amount of data being indexed is generally so large that these optimizations are extremely powerful.
Just to make sure this comment is never used out of context: Sharding is still extremely important, and using a single shard is only recommended if you have insignificant amounts of data.
I can’t quantify in bytes, but a shard comes with quite a bit of baggage, it has a mapping and other data stores in the cluster state, some other bookkeeping data attached to it (where, is it primary or replica, in sync or not, ...) and each shard allocates a chunk of the HEAP for index operations. That chunks size depends on whether the shard received writes or not and ranges from 5MB to 256(?)MB. The exact maximum varies from version to version and I don’t think it’s in the ES docs.