Dienstag, 25. September 2012

Performance Apache Common StringUtils Split and Google Guava Splitter

Last Friday I have a discussion with a colleague at SEITENBAU about the semantic of the split method of the Apache common lang StringUtils class. At the end we have compared the Google Guava Splitter API with the Apache commons Lang StringUtils split methods. Our opinion after that is that the source code based on Guava could be better understood and is much clearer.

After comparing the APIs, we have thought about which of the two APIs has the faster split implementation. So I have build simple performance test for the two String split implementations. The result has surprised me. The StringUtils split method is in my test case much faster then the Guava Splitter split method.

Test setup is I generate 5000 random strings with a length of 10000. The test strings contains commas to split the strings in the test. I invoke the Apache common spilt method and the Guava Splitter with the same test data, the performance result is shown in the table bellow.

Test Runs 1 2 3 4
Apache Common
StringUtils.split(…)
126 ms 122 ms 121 ms 122 ms
Google Guava
splitter.split(…)
352 ms 350 ms 346 ms 349 ms


Here the source of my simple performance test:
Why
Has anybody an idea why the Guava API in my test is slower then the StringUtils split method? I read that the Guava Splitter performance should be very good. Therefore, I am surprised about the result.

Here the dependencies I have used for the performance test:

Kommentare:

  1. I think the answer to your question lies in the javadoc of StringUtils.split:

    * @return a two element array with index 0 being before the delimiter, and
    * index 1 being after the delimiter (neither element includes the delimiter);

    -> String utils allways only returns 2 elements !!!

    AntwortenLöschen
    Antworten
    1. https://gist.github.com/3780998/20f8c8c7ef488246953ba30658f63fc34c716e0b

      Löschen
    2. Thats not the bug the issue in the test was that I used the wrong StringUtils class.
      Now I have fix the test and also I have update the results.

      But the Apache Commons implementation is still faster. I Think there is another bug in my test steup. Has anybody an idea?

      Löschen
  2. I started your Testcase on my own machine. First of all, I can reproduce your timekeepings.
    But I noticed another fact by changing your test code a little bit:
    Removing the two inner for loops in which you get access to the splitted strings, Guava will get faster then Apache Commons:

    -----------------------------------------
    ms % Task name
    -----------------------------------------
    00106 062% Apache Common Lang Split
    00065 038% Google Guava Splitter

    If you even increase the testDataCount to 50000, the result is
    -----------------------------------------
    ms % Task name
    -----------------------------------------
    00875 093% Apache Common Lang Split
    00064 007% Google Guava Splitter

    I don't know why accessing the splitted strings does perform so badly with Guava compared to Commons...

    Greez
    Clemens

    AntwortenLöschen
    Antworten
    1. The speedup in guava is because it only splits on an iterator.next() call, if you remove this you do not actually split anything.

      Löschen
  3. okay with the fixed code, this looks better.

    my version : https://gist.github.com/3781245

    a)
    StringUtils has an optimization for separators of size 1:

    If you use fixed size parts ( part being the sequences between the two separators ), then for me guava gets faster when the separator size gets >= 4 characters.

    b)
    i don't think guava can be faster by default, because the StringUtils code looks quite optimized. Guava just seems to be faster because it only splits when next() is called.

    IMHO:

    When you need all parts anyway, use StringUtils. This should be faster because it's splitting the whole String in one loop. So you don't have the guava overhead for the Strategy/iterator calls.

    AntwortenLöschen
    Antworten
    1. I also added String.split() as reference:

      This is always faster than guava, but can be slower than Apache, mainly because it uses a regex instead of just a CharSequence. But it get's quite fast for long separators.

      (Yes it's not null save, but when parsing files i normally had to nullcheck beforehand anyway)

      Löschen
    2. With Apache the separator is actual a list of characters. So the performance with longer separators is actually not correct: guava and java handle these long separator-strings as whole separator not as many single chars.

      Löschen