Looking for efficient ways of loading files that run in the GB realm, in Java?
I recently played around with Java’s NIO API, which was introduced in JDK 1.4. As you may know, the traditional IO classes deal with streams, whereas the NIO API deals with block-oriented approach, which makes it considerably faster because the filling and draining of buffers is handled by OS, not JVM.
Anyway, what really caught my attention in the API, was the java.nio.MappedByteBuffer class. Memory-mapped IO means that data in a file can directly mapped to physical memory. Nothing new in the OS world, but this was previously not possible in Java. Just be aware that when dealing with writing, you are dealing directly with the disk memory - there is no separation between modifying the data and saving it to the disk (as in the traditional IO).
I’ve written a sample code snippet that uses MappedByteBuffer. The sample code maps the file in increments of 1 kB (block size), just for the demonstrative purposes. If you are dealing with files less than ~1.6GB on Windows or up to 2GB files on the Unix-based OS, you can comfortably make the block size equal to the file size and the entire file would get mapped at once. (Read more about the limit-related issues which will be addressed in JDK 7). I was trying to process a 10GB file, so I had to map the file in “sliding” blocks.
Note, however, that mapping file channels directly into memory makes sense only when dealing with very large files. There is no significant performance improvement when dealing with small files. Since the release of JDK 1.4, the java.io classes are actually implemented by the java.nio classes.
Below is a sample method that maps 1 kB at a time (again, this does not make sense in practice; you can comfortably map 1 GB files at once on any platform). The while-loop executes as long as there is still areas of the file that need to be mapped. The map method is the key, which returns the MappedByteBuffer. In order to make the contents into human-readable format, the buffer gets wrapped by a CharBuffer. Each line in the file, get stored into a class variable lines, which is of type List (see full source for details).
public void load() throws IOException {
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
MappedByteBuffer mbb = null;
Charset cs = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = cs.newDecoder();
StringBuilder sb = new StringBuilder();
long bs = 1024L; // block size
long fs = fc.size(); // file size
long t = 0L; // total size
if (fs != 0) {
while (t < fs) {
if (t + bs > fs) {
bs = fs - t;
}
mbb = fc.map(FileChannel.MapMode.READ_ONLY, t, bs);
CharBuffer cb = decoder.decode(mbb);
while (cb.hasRemaining()) {
char c = cb.get();
if (c == ‘n’) {
lines.add(sb.toString());
sb = new StringBuilder();
} else if (c == ‘r’) { // Windows
continue;
} else {
sb.append(c);
}
}
t += bs;
}
// The last line may not have ended with n
if (sb.length() > 0) {
lines.add(sb.toString());
}
}
}
The complete source code of the sample class: MemoryMappingReader.java.