Hacker News new | ask | show | jobs
by Nimelrian 2622 days ago
Yup, you can implement the whole ordeal in C++ like this:

  int main(int argc, char** argv) {
    std::vector<std::string> strings{
      "apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
      "Apple", "dOg", "banana", "cat", "dog", "apple",
    };
  
    strings.erase(
      std::remove_if(
        std::begin(strings),
        std::end(strings),
        [seen = std::unordered_set<std::string>{}](std::string_view str) mutable {
          std::string lower = "";
          std::transform(std::begin(str), std::end(str), std::back_inserter(lower), ::tolower);
          if (seen.find(lower) == std::end(seen)) {
            seen.insert(lower);
            return false;
          }
          return true;
        }
      ),
      std::end(strings)
    );
  
    for (auto& str : strings) {
      std::cout << str << "\n";
    };
  }
4 comments

Yeah, this is definitely the downside to Go. You can write more or less the same code as your C++ in Rust like this:

    use std::collections::HashSet;
    
    fn main () {
        let strings = vec![
          "apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
          "Apple", "dOg", "banana", "cat", "dog", "apple",
        ];
        
        let mut set : HashSet<&'static str> = HashSet::new();
        
        let strings : Vec<&str> = strings
            .into_iter()
            .filter(|string| set.insert(string))
            .collect();
        
        println!("{:?}", strings);
    }
(runnable snippet: https://play.rust-lang.org/?version=stable&mode=debug&editio...)

EDIT: updated to make use of the fact that `set.insert` returns a boolean.

For completeness, Java is basically the same:

    List<String> strings = new ArrayList<>(List.of("apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
                                                   "Apple", "dOg", "banana", "cat", "dog", "apple"));

    Set<String> seen = new HashSet<>();
    strings.removeIf(s -> !seen.add(s.toLowerCase()));

    strings.forEach(System.out::println);
EDIT How much copying does each implementation do?

The Go version only copies in the cleanup pass he doesn't show, so it copies each surviving element once, and no other elements. The Rust version makes a new vector, so it does the same minimal amount of copying (well, moving). The C++ version uses remove_if, and the documentation doesn't specify how it does the copying [1], but knowing what C++ people are like, it will probably do the minimum of copying too. The Java version hits some truly hoopy code in removeIf [2], which does the same minimal copying, but does allocate a bitmap and walk the array twice, in order to support predicates which want to look at the rest of the list.

I would say all of these reflect their language's values. The Go code is efficient, but only because the programmer has to write it all out longhand, so there is nowhere for inefficiency to hide. The C++ code is (i think!) efficient, because C++ implementers value maximal efficiency in their libraries. The Java code sacrifices a little efficiency to allow its users to do silly things safely.

[1] https://en.cppreference.com/w/cpp/algorithm/remove

[2] http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/j...

>The Rust version makes a new vector, so it does the same minimal amount of copying (well, moving).

To be clear, the Rust version does not do any copying or moving of the strings. Both the set and the new Vec contains &'static str, same as the original Vec.

But a more realistic example is where the original Vec was of String elements instead of &'static str. The same point would still apply, but the result Vec of &str would borrow from the original Vec of String. If the goal was to produce a Vec of String, then it would need to copy (clone) the Strings from the original Vec.

The output form this is a little different form the article, as it additionally requires:

* that the uniqueness check is done case-insensitively,

* that the casing from the first instance of the string in the input array should be used in the output, and

* that the output should be ordered based on the order of the first instance of the string in the input array

Possibly an unusual set of requirements (I'd generally expect that if removing duplicates from a list, that the output order wouldn't matter) but hey, different problems have different requirements.

The desired output is:

    apple
    cAt
    Dog
    banana
while that code snippet produces:

    apple
    cAt
    cat
    Dog
    dog
    Cat
    Apple
    dOg
    banana
I've tried to fix the Rust code, but I'm getting some lifetime errors. Can anyone spot the problem? (Note: I've used C++ for most of the time and am still uncomfortable with Rust)

  use std::collections::HashSet;

  fn main () {
    let strings = vec![
      "apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
      "Apple", "dOg", "banana", "cat", "dog", "apple",
    ];

    let mut set : HashSet<&'static str> = HashSet::new();
    
    let strings : Vec<&str> = strings
        .into_iter()
        .map(|string| (string, string.to_lowercase()))
        .filter(|(_, lower)| set.insert(lower))
        .map(|(string, _)| string)
        .collect();
    
    println!("{:?}", strings);
  }
The lower value is of type &String because the argument to filter is of type &(String, String) and the pattern match needs to propagate the &. The reference is only valid for the body of the filter closure (that is, it is &'short String, for some anonymous/unnamed lifetime 'short), but it is being put into a set that expects &'static str (the explicit type on the let line). The &String to &str type coercion is fine, but it's not correct to treat the &'short as a &'static (the short one is potentially invalid after filter's closure returns).

Changing the type annotations won't make this compile, because it is actually catching a real bug. The iterator is lazy and the full pipeline is executed for each element at once: lowercasing, inserting into the set, inserting into the Vec that's the result of the collect, and deallocating the lower string. This last step is the key/danger: if the HashSet held references/slices to the lower strings (instead of owning them), those references would become dangling immediately and future look-ups into the set won't work right/will trigger undefined behaviour.

The problem is a little clearer (and mostly fixed) if you simplify the code slightly by removing the two map calls, and instead call to_lowercase in the filter directly:

        .into_iter()
        .filter(|string| set.insert(string.to_lowercase()))
        .collect();
This form is a type error, that can be corrected by changing the type annotation to be HashSet<String>, or even removing it entirely and letting type inference handle it. The HashSet owning the strings is the key, so they only disappear after the entire iteration is complete, not after each element.
Even with the hash set solution, it would be nice to not copy the string contents, which unfortunately is needed if the hash set contains lowercased versions of the strings. On the other hand, converting all strings to lowercase then comparing them isn't a great way to case-insensitively compare strings anyway, as Unicode doesn't guarantee that will work.

The UniCase crate defines a wrapper around strings with a case-insensitive Eq implementation, so this works:

    use std::collections::HashSet;
    use unicase::UniCase;

    fn main () {
        let strings = vec![
            "apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
            "Apple", "dOg", "banana", "cat", "dog", "apple",
        ];
        
        let mut set = HashSet::new();
        
        let strings: Vec<_> = strings
            .iter()
            .filter(|&string| set.insert(UniCase::new(string)))
            .collect();
        
        println!("{:?}", strings);
    }
https://play.rust-lang.org/?version=stable&mode=debug&editio...
This is pretty cool. People might think it's cheating to pull in a crate. But IMO the fact that its super easy to pull in a crate in Rust is one of the best things about the language.
You don't even need the filter, all you need is:

    let dedup = strings.into_iter().map(|s| s.to_lowercase()).collect::<HashSet<_>>().into_iter().collect::<Vec<_>>();
Unfortunately that doesn't preserve case nor order of the input strings.
Similar to the C++, all of those are addressed by inserting lower-cased into the set, which can be achieved via an extra method call:

   set.insert(string.to_lowercase())
It also requires adjusting/deleting the type annotation on the set.

https://play.rust-lang.org/?version=stable&mode=debug&editio...

Just like C++, your filter can be simplified like,

filter(|string| {set.insert(string)}

as insert return a bool.

Oo, I didn't realise that. Super handy. I've updated my snippet to make use of your suggestion :)
if (seen.find(lower) == std::end(seen)) { seen.insert(lower); return false; }

You can just return

seen.insert(lower).second

as insert function return a pair<iterator, bool>

And today I learned something new. Been a long time since I wrote my last lines of C++ anyway.
Bonus points for using emplace() instead of insert
This is good enough for "small" arrays, but I wouldn't wanna run it on really large ones. You double the space with those lowercased copies + some extra space for the set itself. And you'll also spend a lot of time lowercasing the strings and std::hash'ing them for the set.
You need to be careful with C's tolower. It takes a signed integer, and negative values are undefined behaviour. It would be safer to cast to unsigned char before calling tolower. https://en.cppreference.com/w/cpp/string/byte/tolower