From time to time I witness or am involved in the discussion about Java streams performance. The common knowledge is that streams are the compromise between performance and brevity with convenience. I couldn’t find any comprehensive set of benchmarks around showing how fast (or slow) streams are. I decided to prepare and run a few tests by myself. Below I present gathered results.

What and how has been tested

I measured performance of several different use cases of forEach application, like applying some function to list of numbers and summing it or filtering and grouping such list — usual programming tasks. The same tasks were implemented with different flavors of Stream and results were compared.

The choice regarding benchmarking tools was easy, the battle tested micro-benchmark toolkit — JMH maintained under the umbrella of OpenJDK. It’s very flexible and easy to use.

The complete benchmarks source code can be found at Github. Please take a look there for details.

The benchmarks were run on 4 CPU cores virtual machine using Java 21 with -Xms2G -Xmx2G JVM arguments against lists of different sizes: 1000, 10 000,100 000, 1 000 000 to verify how given implementation performs depending on the number of elements it has to process.

The benchmark results are presented in charts as percents of best result within sample size. Charts are normalized in the way that for given list size best score among different implementations is found and every score is divided by it. Best result is 100 %. When you see 50 % score on the chart, it means that this benchmark was half good as the best one

Sum of integers

This is the first benchmark comparing performance of summing integer numbers from list.

 // Benchmarks parameters
 @State(Scope.Benchmark)
 public static class Params {
  // Run with given size parameters of
  @Param({"1000", "10000", "100000", "1000000"})
  public int size;
  // Items to run benchmark on
  public List<Integer> items;
  // Setup test data, will be run once and will not affect our results
  @Setup
  public void setUp() {
   items = IntStream.range(0, size)
     .mapToObj(i -> i)
     .collect(Collectors.toList());
  }
 }

 // Plain old forEach implementation
 @Benchmark
 public int forEach(Params params) {
  int res = 0;

  for (Integer item : params.items) {
   res += item;
  }

  return res;
 }
 // Using summing collector
 @Benchmark
 public int collect(Params params) {
  return params.items.stream().collect(Collectors.summingInt(i -> i));
 }
 // Using summing collector on parallel stream
 @Benchmark
 public int collectPar(Params params) {
  return params.items.parallelStream()
    .collect(Collectors.summingInt(i -> i));
 }
 // Using reduce
 @Benchmark
 public int reduce(Params params) {
  return params.items.stream().reduce(0, Integer::sum);
 }
 // Using reduce on parallel stream
 @Benchmark
 public int reducePar(Params params) {
  return params.items.parallelStream().reduce(0, Integer::sum);
 }

It was easy to predict. We have the obvious winner: forEach , but its advantage was decreasing with growing items count, especially when compared to parallel version version of collect. Sequential collectis summing faster 1000 and 10 000 elements, but then parallel version starts to move ahead. reduce don't perform well at all, worst in almost all cases but again, parallel stream with reduce is doing better when sample is bigger.

If the greatest part of your program is responsible for summing reasonable big number of integers, go for forEach .

If you think I’m little bit sarcastic here … you are right. It’s very unlikely that this level of performance optimizations matter, especially in the world of REST APIs, Microservices and Event Driven architectures. If we replace Stream.collect() with forEach to sum 100 000 integers obtained via some REST endpoint to save the result in a database, we don’t profit much. We gain tenths of microseconds compared to seconds required to fetch data from database, serialize, transfer over the wire and deserialize.

Sum of calculations

Let’s take more “realistic” task this time: perform some math operation on a list of randomly generated floating-point numbers and sum all calculated values.

 // Benchmark parameters
 @State(Scope.Benchmark)
 public static class Params {
  @Param({"1000", "10000", "100000", "1000000"})
  public int size;
  public List<Double> items;
  // We generate pseudo random doubles
  @Setup
  public void setUp() {
   Random random = new Random();
   items = random.doubles(size).mapToObj(i -> i)
     .collect(Collectors.toList());
  }
 }

 // This is our calculation, takes double type number, calculates 
 // logarithm, then sine and then square root
 private static double calculate(double value) {
  return Math.sqrt(Math.sin(Math.log(value)));
 }

 // Using forEach
 @Benchmark
 public Double forEach(Params params) {
  Double res = 0d;
  for (Double item : params.items) {
   res += calculate(item);
  }
  return res;
 }
 // Using collect with summing collector
 @Benchmark
 public Double collect(Params params) {
  return params.items.stream()
    .map(DoubleCalculationBenchmark::calculate)
    .collect(Collectors.summingDouble(i -> i));
 }
 // Using collect with summing collector on parallel stream
 @Benchmark
 public Double collectPar(Params params) {
  return params.items.parallelStream()
    .map(DoubleCalculationBenchmark::calculate)
    .collect(Collectors.summingDouble(i -> i));
 }
 // Using reduce
 @Benchmark
 public Double reduce(Params params) {
  return params.items.stream()
    .map(DoubleCalculationBenchmark::calculate)
    .reduce(0d, Double::sum);
 }
 // Using reduce on parallel stream
 @Benchmark
 public Double reducePar(Params params) {
  return params.items.parallelStream()
    .map(DoubleCalculationBenchmark::calculate)
    .reduce(0d, Double::sum);
 }

Well, what a surprise! forEach score is more or less the same as of sequential stream based versions. Parallel stream implementations are about 3 - 5 times faster. Actually, such results are quite easy to explain.

There are cost related to Stream creation and threads coordination that can't be mitigated when number of the operations to parallelise is small or the operation itself is fast.

We didn’t observe such quick benefits of parallel streams in previous benchmarks. According to Amdahl’s law defining theoretical limits of parallelism — the greater part of our program is run sequentially, less parallel it can be. This is exactly the case for summing list of integers. Two integers addition is very fast operation, so inherently sequential part responsible for threads synchronization prevailed, especially in case of relatively small quantity of items in the list.

The same law caused slightly better performance of reduce operation in this benchmark. reduce contract requires to use binary associative operation, free from side effects and working on immutable objects. Double.sum() falls into this category. If we obey these constraints, no synchronization is required during partial results merge.

Grouping

Grouping some data set is easy, concise and very descriptive with streams therefore is really frequently used, at least in the code I see on a daily basis. This feature is compared with version implemented imperatively using forEach loop in this benchmark. There are plenty of variations that might affect performance: stream can be sequential or parallel, ordered or unordered, we have several options choosing collector. We can use Stream.reduce() operator instead of Stream.collect() or different data structures as grouping containers. That's the reason of multiple tested implementation versions.

The benchmarked task is: take list of randomly generated Double 's, group by result of division by 100 rounded to long . Produced result is of type Map<Long, List<Double>>.

 // This is grouping divisor value
 static final double DIVISOR = 100.0;

 // Benchmark parameters
 @State(Scope.Benchmark)
 public static class Params {
  @Param({"1000", "10000", "100000", "1000000"})
  public int size;
  public List<Double> items;
  @Setup
  public void setUp() {
   Random random = new Random();
   items = random.doubles(size)
     .mapToObj(i -> i)
     .collect(Collectors.toList());
  }
 }

 // Using forEach
 @Benchmark
 public Map<Long, List<Double>> forEach(Params params) {
  Map<Long, List<Double>> res = new HashMap<>();

  for (Double item : params.items) {
   Long key = Math.round(item / DIVISOR);
   List<Double> list = res.get(key);

   if (list != null) {
    list.add(item);
   } else {
    list = new ArrayList<>();
    list.add(item);
    res.put(key, list);
   }
  }

  return res;
 }
 // Using forEach and linked list
 @Benchmark
 public Map<Long, List<Double>> forEachLinked(Params params) {
  Map<Long, List<Double>> res = new HashMap<>();

  for (Double item : params.items) {
   Long key = Math.round(item / DIVISOR);
   List<Double> list = res.get(key);

   if (list != null) {
    list.add(item);
   } else {
    list = new LinkedList<>();
    list.add(item);
    res.put(key, list);
   }
  }

  return res;
 }
 // Using forEach, linked list and map's compute method
 @Benchmark
 public Map<Long, List<Double>> forEachLinkedCompute(Params params) {
  Map<Long, List<Double>> res = new HashMap<>();

  for (Double item : params.items) {
   res.compute(Math.round(item / DIVISOR), (k, v) -> {
    if (v == null) {
     LinkedList<Double> list = new LinkedList<>();

     list.add(item);

     return list;
    }
    v.add(item);

    return v;
   });
  }

  return res;
 }
 // Using grouping collector
 @Benchmark
 public Map<Long, List<Double>> collect(Params params) {
  return params.items.stream()
    .collect(Collectors.groupingBy(n -> Math.round(n / DIVISOR)));
 }
 // Using grouping collector with linked list
 @Benchmark
 public Map<Long, List<Double>> collectLinked(Params params) {
  return params.items.stream()
    .collect(Collectors.groupingBy(
      n -> Math.round(n / DIVISOR),
      Collectors.toCollection(LinkedList::new)));
 }
 // Using parallel stream
 @Benchmark
 public Map<Long, List<Double>> collectPar(Params params) {
  return params.items.parallelStream()
    .collect(Collectors.groupingBy(n -> Math.round(n / DIVISOR)));
 }
 // Using parallel stream with linked list
 @Benchmark
 public Map<Long, List<Double>> collectParLinked(Params params) {
  return params.items.parallelStream()
    .collect(Collectors.groupingBy(
      n -> Math.round(n / DIVISOR),
      Collectors.toCollection(LinkedList::new)));
 }
 // Using parallel unordered stream, concurrent grouping collector with
 // linked list
 @Benchmark
 public Map<Long, List<Double>> collectParOpt(Params params) {
  return params.items.parallelStream()
    .unordered()
    .collect(Collectors.groupingByConcurrent(
      n -> Math.round(n / DIVISOR),
      Collectors.toCollection(LinkedList::new)));
 }
 // Using parallel stream reduction with immutable map and list
 @Benchmark
 public io.vavr.collection.HashMap<Long, io.vavr.collection.List<Double>> reducePar(Params params) {
  return params.items.parallelStream()
    .reduce(
      io.vavr.collection.HashMap.empty(),
      (m, n) -> {
       Long key = Math.round(n / DIVISOR);

       return m.put(key, m.get(key)
         .map(l -> l.prepend(n))
         .getOrElse(() -> io.vavr.collection.List.of(n)));
      },
      (l, r) -> l.merge(r, io.vavr.collection.List::prependAll));
 }
 // Using parallel unordered stream reduction with immutable map and list
 @Benchmark
 public io.vavr.collection.HashMap<Long, io.vavr.collection.List<Double>> reduceParUnord(Params params) {
  return params.items.parallelStream()
    .unordered()
    .reduce(
      io.vavr.collection.HashMap.empty(),
      (m, n) -> {
       Long key = Math.round(n / DIVISOR);

       return m.put(key, m.get(key)
         .map(l -> l.prepend(n))
         .getOrElse(() -> io.vavr.collection.List.of(n)));
      },
      (l, r) -> l.merge(r, io.vavr.collection.List::prependAll));
 }

Among sequential implementations, forEach based benchmarks are slightly better then streams collected to Map . Using LinkedList usually improves results insensibly but not for collected parallel stream. Actually, there is severe performance loss visible when Collectors.toCollection(LinkedList::new) is used as downstream collector. Parallel stream optimization attempt, simply speaking, failed miserably. It was only Stream.unordered() and Collectors.groupingByConcurrent() added but it caused collectParOpt benchmark to be approximately 10 times slower in every case.

I gave Stream.reuduce() on parallel stream the try here too. To make parallel reduction working properly, we have to use immutable data structures. Java HashMap is mutable. Immutable Map and List from Vavr library came to the rescue but with limited success. To be honest, my expectations towards parallel reduction performance were initially much higher but I shortly realized that expectations were unreasonable. Vavr's Map has effectively constant get and put operations performance characteristics and we have to always call them both, because there is no similar to plain Java's Map.compute() operation available in its API.

I wouldn’t like to draw any significant conclusions regarding the reasons standing behind achieved results. There is just not enough data gathered. The question: stream or not to stream seems to have following answers. If we care about microseconds level performance we should use sequential forEach or parallel Stream.collect() . And we'd better skip optimization attempts without testing them in real life scenarios.

Filtering and sorting distinct items

Another frequently used features of Stream are filtering, sorting and distinct items selection. Maybe not all together but separately for sure. This benchmark takes random decimal numbers, filters them by predefined minimal and maximum value, takes distinct values and sorts in natural order.

 // Minimal value used in filter
 static final double MIN = 0.0;
 // Maximal value used in filter
 static final double MAX = 10.0;

 // Benchmark parameters
 @State(Scope.Benchmark)
 public static class Params {
  @Param({"1000", "10000", "100000", "1000000"})
  public int size;
  public List<Double> items;
  @Setup
  public void setUp() {
   Random random = new Random();
   // Generate 500 random values from range MIN >= value < MAX + 5
   List<Double> values = random.doubles(500, MIN, MAX + 5)
     .mapToObj(i -> i)
     .collect(Collectors.toList());
   // Create list from random numbers within range
   items = random.ints(size, 0, values.size())
     .mapToObj(values::get)
     .collect(Collectors.toList());
  }
 }

 // Using forEach
 @Benchmark
 public Collection<Double> forEach(Params params) {
  Set<Double> set = new HashSet<>();
  for (Double item : params.items) {
   if (item > MIN && item < MAX) {
    set.add(item);
   }
  }
  List<Double> res = new ArrayList<>(set);
  Collections.sort(res);
  return res;
 }
 // Using forEach and TreeSet
 @Benchmark
 public Collection<Double> forEachTreeSet(Params params) {
  Set<Double> set = new TreeSet<>();
  for (Double item : params.items) {
   if (item > MIN && item < MAX) {
    set.add(item);
   }
  }
  return set;
 }
 // Using collect
 @Benchmark
 public Collection<Double> collect(Params params) {
  return params.items.stream()
    .filter(n -> n > MIN && n < MAX)
    .distinct()
    .sorted()
    .collect(Collectors.toList());
 }
 // Using collect on parallel stream
 @Benchmark
 public Collection<Double> collectPar(Params params) {
  return params.items.parallelStream()
    .filter(n -> n > MIN && n < MAX)
    .distinct()
    .sorted()
    .collect(Collectors.toList());
 }

Filtering and sorting benchmarks results

Stream based code is the winner and common pattern reveals: parallel stream are better when given list becomes longer. The optimization attempt of with TreeSet in forEachTreeSet benchamrk instead of in place sorting didn’t succeed.

Conclusions

It looks like I found nothing revealing. Java stream can be sometimes faster and sometimes slower than the imperative loop. I will be frank with you — I’m really glad that Java Stream is good enough in regards of performance. At least in the scope of common enterprise class applications. I'd really miss its beautiful and descriptive abstraction if I had to drop its use due to poor performance. I did my best to perform legitimate benchmarks and I'm convinced by their results. If you are not, my dear reader, please let me know.

There is one more unknown yet left to be discovered. Our parallel streams share one and only ForkJoinPool in a single JVM. The common knowledge is that we should be careful about this, but what are the real numbers? You can find answers in my new post Performance of parallel Java Streams

Update

I was asked in the comment about JIT compiler impact on forEach performance. Below charts chow the results of all benchmarks with JIT compiler disabled via -Xint option.