In this tutorial, you will learn how to run Go code from the JVM for fun and profit.

Error Correction for the Masses

In 2015, Backblaze released JavaReedSolomon, their Reed-Solomon Erasure Coding library as open source, representing the first production-quality Java implementation of this error-correcting code.

Without repeating too much of Backblaze’s explanation of how they work, in a nutshell, Reed-Solomon can be used to detect and recover multiple errors in a byte sequence:

The byte sequence is first split into equal parts (data shards), and then extended by a fixed number of additional “checksum” data (parity shards). While making the byte sequence slightly longer, thanks to some clever math the Reed-Solomon code needs much fewer bytes than just duplicating the data multiple times.

No wonder Backblaze uses this approach for their backup solution, it’s certainly saving them a lot of money on harddisk space.

JavaReedSolomon is great tech, but it feels it could be faster — I’m barely getting 1000 MB/s throughput on my trusty MacBook Pro M1 Max.

These numbers may be good enough when contrasting single harddisk speeds (200 MB/s), but can certainly become a bottleneck when working with multiple devices accessed concurrently (a typical use case), even more so with SSDs, which churn out over 15,000 MB/s these days.

Also consult smhasher to get a feeling how fast hash functions (a related operation) can be in comparison — many modern implementations achieve 40,000 MB/s and more.

Give performance a Go

Is it that Reed-Solomon inherently is slow? No. Have a look at Klaus Post’s Go reimplementation!

For the same configuration as in Backblaze’s demo code (17 data shards + 3 parity shards, payload size 200k bytes), I observed 12222 MB/s, a 12x speedup. When using Klaus’ default payload size of 10 MB, I observed 47178 MB/s, a 47x speedup over JavaReedSolomon.

        
      
$ git clone https://github.com/klauspost/reedsolomon.git
# git checkout v1.12.5
$ cd ./benchmark
$ go build main.go

$ ./main -k 17 -m 3 -size $(( 10 * 1024 * 1024 ))
Benchmarking 1 block(s) of 17 data (K) and 3 parity shards (M), each 616810 bytes using 16 threads. Total 12336200 bytes.

 * Encoded 462000 MiB in 10s. Speed: 46199.18 MiB/s (17+3:616810)
 * Repaired 551165 MiB in 10s. Speed: 55116.19 MiB/s (17+3:616810)

CPU Features

How can that be! Is Go just so much faster than Java?

The answer is no. It’s mostly because Klaus’ code is superior for modern CPUs (but there’s also some discrepancy in what we measure, we’ll get to that later on).

While a lot of his implementation is in vanilla Go, Klaus heavily uses optimized assembly code with operations for modern vector extensions such as NEON (as on my M1), SVE, and AVX (including AVX512). And the benchmarks are running multithreaded by default.

But even with only 1 CPU core it’s still an 11x speedup over the Java implementation. One can’t easily beat hand-optimized SIMD assembly.

In an apples-to-apples comparison — no assembly optimizations and just 1 core — Go gives us speeds comparable to the Java implementation: 1061 MB/s (4% faster) for a payload of 200k and 680 MB/s for a payload of 10MB (33% slower than JavaReedSolomon).

        
      
$ cd ./benchmark
$ go build -o main-noasm -tags noasm main.go
$ ./main-noasm -k 17 -m 3 -cpu 1
Benchmarking 1 block(s) of 17 data (K) and 3 parity shards (M), each 616810 bytes using 1 threads. Total 12336200 bytes.

 * Encoded 6800 MiB in 10.006s. Speed: 679.59 MiB/s (17+3:616810)
 * Repaired 12400 MiB in 10s. Speed: 1239.96 MiB/s (17+3:616810)

Surprisingly, when running JavaReedSolomon running multithreaded on all my Mac’s cylinders, it beat the no-assembly Go version slightly for 10 MB payloads (7556 MB/s vs. 7154 MB/s, 5% faster), and significantly for 200k payloads (7864 MB/s vs. 1033 MB/s = 7.6x).

With the findings above, it seems that the Go version – at least its benchmark app – performs worse with larger payloads, but generally the same or worse than Backblaze’s Java version when no special assembly optimization is involved.

This means Go is no magic bullet. However it’s also troubling because there’s so much room for improvement in Backblaze’s Java implementation.

I’m no expert at all when it comes to using modern CPU assembly. For that, Daniel Lemire’s blog is always a great read when it comes to SIMD, AVX-512 and NEON tricks.

I also admit that I’m not well-versed in optimizing Matrix computations, maybe apart from that fun excursion into the Efficient Parallel Computation of PageRank 20 years ago. Alas, today I couldn’t even explain you off the top of my head how to multiply two matrices, let alone how to work with Galois fields, which are underlying the Reed Solomon code.

Use it from Java

In other words, I really just want to get the assembly fast-path working from Java, without building it from scratch. So what can we do?

To some, it may be obvious, to some surprising. In the end, our Reed-Solomon functions will likely run as yet another part of some POSIX process in Linux or macOS. At which point the language choice is irrelevant — as long as we can call into the Reed-Solomon implementation code from our application (which in this article we assume is written in Java, running in some Java VM that is also just a part of some POSIX process).

It comes in handy that Java has had, for a long time already, a good way to interact with “native” code (with native referring to code in a shared library): JNI. More recently, in Java 24, the Foreign Function and Memory API (FFM) has been promoted from a preview to a production-quality feature, adding long-awaited memory safety measures among other convenient improvements.

So, as long as we can craft a shared library from the Go code (a “libsomething.so” on Linux or “libsomething.dylib” on macOS, perhaps a “smthng64.dll” on Windows), we will be able to call it from Java. Thankfully, Go has a specific “c-shared” build mode just for that purpose, and we’re going to leverage it.

Shim gålore

In practice, dynamic libraries expose a binary interface that is compatible with the C calling conventions, which are also what Java expects when using JNI or FFM.

Since regular Go code usually isn’t designed with that in mind, we have to add helper functions acting as a shims, converting between simple types like integers, char pointers, etc. and the more involved Go types.

While the CGo package is often used to call C code from Go, we can also have it the other way around. For that, we create a new Go file, which we place in a new sub-directory of our “reedsolomon” Go project that we checked out above. Let’s use the acronym “jagors” (Java-Go-Reed-Solomon) for brevity (./jagors/jagors.go):

        
      
package main

import (
    "C" // CGo
    "unsafe" // for pointers
    "github.com/klauspost/reedsolomon" // the Go code (in the parent directory)
)

func main() {
    // no-op, required by the go compiler
}

Let’s try and build our shared library for the first time:

        
      
$ go build -o libjagors.dylib -buildmode=c-shared ./jagors.go
./jagors.go:5:5: "unsafe" imported and not used
./jagors.go:6:5: "github.com/klauspost/reedsolomon" imported and not used

Sadly, Go is very pedantic about unused imports and variables, and apparently cannot be convinced to be more lenient. Let’s comment these imports out for now (leaving only “C”) and try again:

        
$ go build -o libjagors.dylib -buildmode=c-shared ./jagors.go

$ file libjagors.dylib
libjagors.dylib: Mach-O 64-bit dynamically linked shared library arm64

Now we have to add the shim methods that are exposed in the dynamic library. Note that there is no special namespace, so let’s make sure that the function name is globally unique. We also have to add the very function name in a special comment before the declaration, which Go interprets as an annotation (do not add whitespace between // and `export”, Go is not forgiving here either). Let’s start with a simple example:

        
      
//export jagors_hello_world
func jagors_hello_world(input C.int) (C.int) {
    return input + 333;
}

Compiling the library again, we now see that go created a C header file for us in the same directory, libjagors.h, containing the following C declaration (among some boilerplate):

        
      
extern int jagors_hello_world(int input);

Let’s quickly test if we can call our code from C. Create a file test.c with the following contents:

        
      
#include "libjagors.h"
#include <stdio.h>

int main() {
	printf("%i\n", jagors_hello_world(123));
	return 0;
}

        
      
$ clang -I. -L. -ljagors -o test test.c
$ ./test
456

Great, that works. Now let’s see what we need to do on the Java side.

Tradionally, with JNI you would have to build another, JNI-specific shared library that would again link our Go-built “libjagors.dylib”. In 2025, and with Java 24, we can surely use FFM instead.

While (for some unexplicable reason) not yet being part of the Java JDK, jextract makes it easy to create shim code for the Java side:

        
      
$ jextract -t org.example.ffm libjagors.h

$ find org/ -type f
org/example/ffm/GoSlice.java
org/example/ffm/GoInterface.java
org/example/ffm/libjagors_h.java
org/example/ffm/_GoString_.java
org/example/ffm/GoString.java

Just like go build -buildmode=c-shared created a C header libjagors.h, jextract created Java code, containing (again among some boilerplate) the following method:

        
      
public static int jagors_hello_world(int input)

Let’s see if we can call our code from Java. Create a file Test.java with the following contents:

        
      
import java.io.File;
import org.example.ffm.libjagors_h;

public class Test {
  public static void main(String[] args) throws Exception {
    System.load(new File("./libjagors.dylib").getAbsolutePath());
    System.out.println(libjagors_h.jagors_hello_world(123));
  }
}

then compile and run it (the enable-native-access option suppresses a warning):

$ javac Test.java
$ java --enable-native-access=ALL-UNNAMED Test
456

Success!

Java calling

Now that we’re able to run trivial Go code from Java, we can look into the Go code specific to reedsolomon.

Thankfully, Klaus has thoroughly documented reedsolomon.go and also provided example code (simple-encoder.go and simple-decoder.go).

The API expects us to create a new enc reedsolomon object with the required number of data and parity shards, and then call enc.Encode (for encoding) and enc.Verify/enc.Reconstruct (for decoding) with an array of byte arrays, one byte array for each shard. Except for Reconstruct the byte arrays must all be of the same length.

For Reconstruct, it is acceptable that some shards are missing — that’s the whole point of error correction. The documentation states that these missing shards should either be marked as nil byte arrays or be Go “slices” of length 0 with a capacity that is equal to the length of the other shards (in Java terminology, these would be ByteBuffers with a limit of 0 and sufficient capacity).

From a C perspective, we will probably want to pass a char * (or void *) pointing to a contiguous buffer containing all shards (data and parity), which will be updated as required (for Encode the parity shards will be written, for Reconstruct the missing shards will be updated).

While we can directly map a char * to a Go []byte being directly backed by the underlying storage, arrays of byte arrays cannot be simply mapped to char **. This means we need to allocate the array containing the shards’ byte arrays with the Go allocator, and then get a bit creative with unsafe pointer arithmetic:

        
      
func bufToShards(buf unsafe.Pointer, payloadWidth C.int, numShards C.int) ([][]byte) {
    var shards [][]byte = make([][]byte, numShards)
    for i := range shards {
        var ptr = unsafe.Add(buf, i * int(payloadWidth));
        shards[i] = (*[1 << 28]byte)(ptr)[:payloadWidth:payloadWidth]
    }
    return shards
}

The weird statement (*[1 << 28]byte)(ptr)[:payloadWidth:payloadWidth] casts a sequence of bytes starting from ptr (which we calculated in the line above) to a byte array slice with length and capacity equal to the payload size (hat-tip to Sanchke Dellowar for providing this solution on Stack Overflow). 1 << 28 is just a large constant, guaranteed to be larger than our payload.

We can now create our encode shim. We’ll skip precise error handling for now, just returning 1 for success and 0 for failure:

        
      
//export jagors_encode
func jagors_encode(buf unsafe.Pointer, payloadWidth C.int, numDataShards C.int, numParityShards C.int) (C.int) {
    var shards = bufToShards(buf, payloadWidth, numDataShards + numParityShards)
    var enc, err = reedsolomon.New(int(numDataShards), int(numParityShards))
    
    if err != nil {
        return 0
    }
    
    err = enc.Encode(shards)
    if err != nil {
        return 0
    }
    
    return 1
}

For the decode shim, we need a way to specify which shards are missing — we cannot just have them be zeroes. Let’s pass another char * with marker bytes (one per shard):

        
      
//export jagors_decode
func jagors_decode(buf unsafe.Pointer, payloadWidth C.int, numDataShards C.int, numParityShards C.int, shardsMissing *C.char) (C.int) {
    var numShards = numDataShards + numParityShards
    var shards = bufToShards(buf, payloadWidth, numShards)

    var enc, err = reedsolomon.New(int(numDataShards), int(numParityShards))
    if err != nil {
        return 0
    }
    
    var ok, _ = enc.Verify(shards)
    if ok {
        // no reconstruction required
        return 1
    }
    
    markMissingShards(shards, shardsMissing, numShards)

    err = enc.Reconstruct(shards)
    if err != nil {
        // reconstruction failed
        return 0
    } else {
        return 1
    }
}

func markMissingShards(shards [][]byte, shardsMissing *C.char, numShards C.int) {
    var missing = unsafe.Slice(shardsMissing, numShards)
    
    for i,v := range missing {
        if (v != 0) {
            // trim size but keep capacity
            shards[i] = shards[i][:0]
        }
    }
}

Compile your code with go build and jextract as above, so we can complete the Java side of things (you must also add back the imports we previously removed).

jextract now provides the following additional methods in libjagors_h.java:

        
public static int jagors_encode(MemorySegment buf, int payloadWidth, int numDataShards, int numParityShards);

public static int jagors_decode(MemorySegment buf, int payloadWidth, int numDataShards, int numParityShards, MemorySegment shardsMissing);

As you can see, both unsafe.Pointer (void * in C) and *C.char (char * in C) are MemorySegments in Java.

This leaves us with the actual code exercising the Reed-Solomon code. For this tutorial, I’ll keep it simple and work with a payload of 4 bytes using 6 data shards and 3 parity shards.

        
      
import java.io.File;
import java.lang.foreign.Arena;
import java.lang.foreign.MemorySegment;
import java.lang.foreign.ValueLayout;
import java.util.Arrays;
import java.util.Random;
import org.example.ffm.libjagors_h;

public class ReedSolomonTest {
  public static void main(String[] args) throws Exception {
    System.load(new File("./libjagors.dylib").getAbsolutePath());
    
    // use Arena.ofAuto() without a try-clause for GC-controlled memory management
    try (Arena arena = Arena.ofConfined()) {
    
      int numDataShards = 6;
      int numParityShards = 3;
      int payloadSize = 4;

      int numTotalShards = numDataShards + numParityShards;

      // Allocate working memory
      MemorySegment buf = arena.allocate(payloadSize * numTotalShards);

      // Populate data shards with example data and print all shards for demonstration purposes
      System.out.println("Filling data shards with random data...");
      Random random = new Random();
      for (int i = 0, n = payloadSize * numDataShards; i < n; i++) {
        buf.setAtIndex(ValueLayout.JAVA_BYTE, i, (byte) random.nextInt());
      }

      dumpShards(buf, payloadSize, numDataShards, numParityShards);

      // Encode the parity shards
      System.out.println("Encoding...");
      int rc = libjagors_h.jagors_encode(buf, payloadSize, numDataShards, numParityShards);
      if (rc != 0) {
        System.err.println("Encoding failed!");
      }

      dumpShards(buf, payloadSize, numDataShards, numParityShards);

      // Corrupt some data
      System.out.println("Marking some shards damaged/missing...");
      MemorySegment damagedShard3 = buf.asSlice(payloadSize * 3, payloadSize);
      damagedShard3.fill((byte) 0);
      MemorySegment damagedShard8 = buf.asSlice(payloadSize * 8, payloadSize);
      damagedShard8.fill((byte) 0);
      
      // Mark damaged/missing shards
      MemorySegment missingShards = arena.allocate(numTotalShards);
      missingShards.setAtIndex(ValueLayout.JAVA_BYTE, 3, (byte) 1);
      missingShards.setAtIndex(ValueLayout.JAVA_BYTE, 8, (byte) 1);

      dumpShards(buf, payloadSize, numDataShards, numParityShards);

      // Recover
      System.out.println("Decoding...");
      rc = libjagors_h.jagors_decode(buf, payloadSize, numDataShards, numParityShards,
          missingShards);
      if (rc != 0) {
        System.err.println("Decoding failed!");
      }

      dumpShards(buf, payloadSize, numDataShards, numParityShards);
    }
  }
  
  private static void dumpShards(MemorySegment buf, int payloadSize, int numDataShards,
      int numParityShards) {
    for (int shard = 0; shard < (numDataShards + numParityShards); shard++) {
      MemorySegment shardBuf = buf.asSlice(shard * payloadSize, payloadSize);
      System.out.println("Shard " + shard + ": " + Arrays.toString(shardBuf.toArray(
          ValueLayout.JAVA_BYTE)));
    }
    System.out.println();
  }
}

Compile ReedSolomonTest.java, run, and you will see it works:

Filling data shards with random data...
Shard 0: [72, -90, -77, 78]
Shard 1: [-58, 74, -52, 47]
Shard 2: [48, -10, 23, 64]
Shard 3: [-85, 50, 91, -82]
Shard 4: [-110, -72, -71, 75]
Shard 5: [-122, -18, -36, -74]
Shard 6: [0, 0, 0, 0]
Shard 7: [0, 0, 0, 0]
Shard 8: [0, 0, 0, 0]

Encoding...
Shard 0: [72, -90, -77, 78]
Shard 1: [-58, 74, -52, 47]
Shard 2: [48, -10, 23, 64]
Shard 3: [-85, 50, 91, -82]
Shard 4: [-110, -72, -71, 75]
Shard 5: [-122, -18, -36, -74]
Shard 6: [-105, 33, -27, 102]
Shard 7: [-106, 95, -77, 20]
Shard 8: [-61, -14, -68, 61]

Marking some shards damaged/missing...
Shard 0: [72, -90, -77, 78]
Shard 1: [-58, 74, -52, 47]
Shard 2: [48, -10, 23, 64]
Shard 3: [0, 0, 0, 0]
Shard 4: [-110, -72, -71, 75]
Shard 5: [-122, -18, -36, -74]
Shard 6: [-105, 33, -27, 102]
Shard 7: [-106, 95, -77, 20]
Shard 8: [0, 0, 0, 0]

Decoding...
Shard 0: [72, -90, -77, 78]
Shard 1: [-58, 74, -52, 47]
Shard 2: [48, -10, 23, 64]
Shard 3: [-85, 50, 91, -82]
Shard 4: [-110, -72, -71, 75]
Shard 5: [-122, -18, -36, -74]
Shard 6: [-105, 33, -27, 102]
Shard 7: [-106, 95, -77, 20]
Shard 8: [-61, -14, -68, 61]

Making it stateful

While this does work, we’re still creating new enc reedsolomon classes for every call to encode/decode. This overhead is not much in this case but not insignificant, and we can certainly make the code feel a bit more like Java/Go than C, exposing the reedsolomon type with encode/decode methods instead of having individual functions. It’s just a bit tricky because of Go’s Garbage Collector…

What we need to do is to hold on to Go instances (so they don’t get prematurely Garbage-Collected by Go) and reference them from C by some ID.

A rather trivial way to do this is to use sync.Map. We store instances as values, and use their pointer addresses as keys. Then we export constructor, destructor and accessor shim functions, and use uintptr as numeric IDs so we can refer to a particular instance from C/Java:

        
      
var objectsMap sync.Map

//export jagors_new_reedsolomon
func jagors_new_reedsolomon(numDataShards C.int, numParityShards C.int) (uintptr) {
    enc, err := reedsolomon.New(int(numDataShards), int(numParityShards))
    if err != nil {
        return 0 // indicates error here
    }
    var ptr = unsafe.Pointer(&enc)
    objectsMap.Store(ptr, enc)
    return uintptr(ptr)
}

//export jagors_delete
func jagors_delete(objptr uintptr) (C.int) {
   _, deleted := objectsMap.LoadAndDelete(unsafe.Pointer(objptr))
   
   if (deleted) {
       return 1
   } else {
       return 0
   } 
}

func GetReedSolomon(objptr uintptr) (reedsolomon.Encoder, error) {
    obj, ok := objectsMap.Load(unsafe.Pointer(objptr))
    
    if (ok) {
        return obj.(reedsolomon.Encoder), nil
    } else {
        return nil, errors.New("Unknown objptr")
    }
}

//export jagors_encode_rsid
func jagors_encode_rsid(rsId uintptr, buf unsafe.Pointer, payloadWidth C.int, numTotalShards C.int) (C.int) {
    var shards = bufToShards(buf, payloadWidth, numTotalShards)
    var enc, err = GetReedSolomon(rsId) // changed
    
    if err != nil {
        return 0
    }
    
    err = enc.Encode(shards)
    if err != nil {
        return 0
    }
    
    return 1
}

// ... (jagors_decode_rsid similarly)

Now we clean up the Java side. We want to make sure that the Go reedsolomon instance is valid until our corresponding Java instance is no longer used. For this, we create a new class JagoReedSolomon, which is AutoCloseable and implements a Garbage Collection Cleaner for the case that someone forgets to close() the instance. The encode/decode operations are now methods of that class, and thus take fewer parameters.

        
      
import java.lang.foreign.MemorySegment;
import java.lang.ref.Cleaner;
import java.io.File;
import java.io.IOException;
import org.example.ffm.libjagors_h;

public class JagoReedSolomon implements AutoCloseable {
  static {
	System.load(new File("./libjagors.dylib").getAbsolutePath());
  }
  private static final Cleaner cleaner = Cleaner.create();
  
  private final int numDataShards;
  private final int numParityShards;
  private final int numTotalShards;
  private long id;

  private final CleanerState state;
  private final Cleaner.Cleanable cleanable;

  public JagoReedSolomon(int numDataShards, int numParityShards) {
    this.id = libjagors_h.jagors_new_reedsolomon(numDataShards, numParityShards);
    if (id == 0) {
      throw new IllegalStateException("jagors_new_reedsolomon");
    }
    this.state = new CleanerState(id);
    this.cleanable = cleaner.register(this, state);

    this.numDataShards = numDataShards;
    this.numParityShards = numParityShards;
    this.numTotalShards = numDataShards + numParityShards;
  }
  
  @Override
  public void close() {
    this.id = 0;
    cleanable.clean();
  }
  
  private void checkClosed() throws IOException {
    if (id == 0) {
      throw new IOException("closed");
    }
  }  
  
  private static final class CleanerState implements Runnable {
    private final long id;

    CleanerState(long id) {
      this.id = id;
    }

    @Override
    public void run() {
      libjagors_h.jagors_delete(id);
    }
  }
  
  public void encode(MemorySegment shards, int payloadSize) throws IOException {
    checkClosed();
    ensureShardSize(shards, payloadSize);
    int rc = libjagors_h.jagors_encode_rsid(id, shards, payloadSize, numTotalShards);
    if (rc == 0) {
	   throw new IOException("encode");
    }
  }

  public void decode(MemorySegment shards, int payloadSize, MemorySegment shardsMissing) throws IOException {
    checkClosed();
    ensureShardSize(shards, payloadSize);
    if (shardsMissing.byteSize() < numTotalShards) {
      throw new IOException("shardsMissing too small");
    }
    int rc = libjagors_h.jagors_decode_rsid(id, shards, payloadSize, numTotalShards, shardsMissing);
    if (rc == 0) {
	   throw new IOException("decode");
    }
  }
  
  private void ensureShardSize(MemorySegment shards, int payloadSize) throws IOException {
    if (shards.byteSize() < payloadSize * numTotalShards) {
      throw new IOException("shards too small");
    }
  }
}

I’ll leave the changed test class out here for brevity.

Note that for production code, you will want to move the static parts (System.load and Cleaner.create) to a separate LibraryLoader class that picks up the right shared library for your architecture, probably packed up directly in a jar on the classpath. This can get hairy quickly, because you have to build and bundle native libraries for all architectures. As a limitation of dlopen(3) we also can’t load a native library directly from a jar, so logic for temporarily unpacking has to be added too. See what junixsocket does, for example.

Speaking of production: you should also review any calls to panic on the Go side — these will most definitely take down your entire JVM unless handled correctly. One way to address this is to call a special Go-panic handler at the beginning of every shim method:

        
      
func panicProtect() {
        if err := recover(); err != nil {
            log.Println("Panic: ", err) // needs "log" in the imports section
        }
}

//export jagors_something
func jagors_something(...) {
    defer panicProtect()
    // ...
}

This will cause the panic to be logged, and the function returns with a 0 value. This is why, unlike C, we return 0 for error and non-zero for success.

Results

Not only Klaus Post’s Go code but also Backblaze’s JavaReedSolomon comes with a benchmarking class. It tests a variety of combinations for the coding loop — an internal detail that may distract anybody new to the codebase (the actual code uses the fastest variant, InputOutputByteTableCodingLoop).

For testing our Go/Java combination, we will reuse and adapt Backblaze’s benchmarking class.

Benchmarking caveats

I mentioned there’s some discrepancy in benchmarking between the Java and Go implementations, and this comes as a cautionary tale.

First of all, Backblaze’s code does not take parity shards into account when calculating throughput in MB/s, but Klaus’ Go code does (so it appears ca. 15% faster with 17+3 shards). For fairness, we should account all memory accesses for throughput, so I amended BackBlaze’s code accordingly (also see here).

More importantly, Backblaze measures validating the entire matrix (“isParityCorrect”) whereas the Go benchmark only runs “Recover” on a random set of corrupted shards in the amount of the number of parity shards. The latter operation is just faster because it touches fewer places in RAM (note the higher MB/s for “Repaired” at the beginning of this article). I made a simple fix, so we can compare both validation operations.

In principle, validating should be equally fast as creating the parity. One just needs to compute the parity shards and verify that they match the previously stored parity.

Surprisingly, Klaus’ Go implementation yields lower throughtput numbers for true verification than Backblaze’s code, and that is most likely due to some extra allocation that happens in that code path (this can easily be fixed).

Temporary allocations and copying data back-and-forth are two major culprits for performance degradations (the more so in garbage-collected systems) — avoid them wherever possible.

In light of these findings, my benchmarks will focus on the numbers for encoding.

Go scheduler woes

The first benchmark results I got were in fact too good.

Even when running code single-threaded on the Java side, the Go code would run on multiple CPUs. Depending on the payload size either only 4 or all 10 cores were running, resulting in 26867 MB/s (200k payload) and 48415 MB/s (10M payload).

The reason for this is the Go scheduler. All calls from Java to Go are subject to Go’s scheduling policies (set via runtime.GOMAXPROCS). Go is using all cores by default for go routines. Consequently, if Go’s scheduler is capped to 1 core instead, then even 100 threads on the Java side would have to queue their calls for that single core on the Go side.

To control the scheduler, we can add the following shim method, and call it on the Java side to suit our needs:

        
      
//export jagors_set_maxprocs
func jagors_set_maxprocs(maxProcs int) {
    runtime.GOMAXPROCS(maxProcs)
}

Fine-tuning this parameter is probably not necessary; being able to rely on the Go scheduler simplifies the Java code. However, there are two exceptions:

First, by limiting the scheduler to 1 CPU core, we can verify that our benchmark results are actually comparable to what Klaus Post’s benchmark code reports, and yes, we’re getting practically the same numbers — without assembly: 200k payloads yield 1105 MB/s (0.89x), 10M yield 694 MB/s (1.02x), with assembly: 200k payloads yield 11532 MB/s (0.94x), 10M yield 8959 MB/s (0.8x).

Second, when we want to squeeze out the last bit of performance, we max out the number of threads on the Java side and on the Go side. This yielded 59920 MB/s for 200k payloads (4.9x faster than native Go) and 52028 MB/s (1.1x), all with the 17+3 partitioning.

These results indicate that there is little room for improvement for our Java-Go bridge; it works well. We see a significant performance gain compared to the native Go benchmarks when it comes to smaller payloads — this is most likely attributable to the Go scheduler not using all available CPU cores.

The real numbers

OK, so we were benchmarking quite a few things. What counts the most for Java users is the final improvement over Backblaze’s vanilla Java that the assembly optimization brings us (all measured on an MacBook Pro M1 Max):

Single-threaded Java/Go

For payloads of 200k: 11x faster.
For payloads of 10MB: 9x faster.

All CPUs (multi-threaded Java and Go code)

For payloads of 200k: 7.6x faster.
For payloads of 10MB: 6.9x faster.

Single-threaded Java code; full multithreading via Go scheduler

For payloads of 200k: 26x faster.
For payloads of 10MB: 53x faster.

The fastest throughtput numbers I got

Around 95,000 MB/s. 10MB payload, 5+2 shards.

Are the returned numbers the same?

Regarding the actual calculations both implementations perform: Yes, they return the same results, which is quite important.

Conclusions and Outlook

You’ve learned a little about Reed-Solomon, and how to run some native Go code from Java, in the same POSIX process as the Java Virtual Machine.

This article shows that the language barrier between Go and Java is surmountable, and specifically without compromising performance.

We can now derive benefit from Klaus Post’s significant improvements and easily adapt his contributions for the JVM ecosystem where it all started.

Modern CPUs have amazing vectorization capabilities that are presently underutilized by traditional programming techniques.

Looking ahead, Java’s Vector API will eventually leave the incubation stage, giving users a portable alternative over optimized assembly. I’m curious how well such a modernized Java solution will perform.

What performance gains do you see with different sharding parameters and on platforms with AVX512 (AMD Zen 5 anyone?) or SVE (e.g., Apple M4)? Will better auto-vectorization save the day?

The full code (with separate commits for each step) is available in this repository.

Lastly, what will you make of all of this? Please comment below!

Faster Reed-Solomon Erasure Coding in Java with Go & FFM