Handling large data files efficiently with Java
Takeaway: Reading and writing data is a common programming task, but the amount of data involved can sometimes create a big performance hit. Luckily, the java.io package provides the tools you need to meet this challenge.
Java provides a simple standardized API for reading and writing to and from external resources such as files, databases, and sockets. But even though the Java I/O API covers a wide spectrum of applications' demands, correct usage of it is not as simple as it may seem. Inefficiently programmed I/O operations, being very CPU- and memory-intensive, can drastically compromise both application and system performance. This article will show you an effective approach for reading large data when time and memory allocation have to be considered to improve overall system performance.
Keep your Java skills sharp
Java has opened up the Web's interactive possibilities, and developers continue to use it to create complete applications. Stay up to date by starting each Monday and Thursday with our Java e-newsletter. Sign-up now!
Data access must be fast
The best way to get started with this topic is to look at an example. Let's assume that you must read a large sum of data from a binary file and store it in an array for further processing. Java I/O is based on streams that represent a sequence of bytes. First, you must choose a stream type. We are working with binary data, so the FileInputStream class is the correct choice. You should consider using the FileReaderclass when working with character data streams. We can open a connection to an actual file like this:
InputStream in = new FileInputStream (fileName);
At this point, it is possible to read data from the file, but let's take a closer look at other classes from the java.io package, keeping performance issues in mind. The BufferedInputStream class is a wrapper for input streams, allowing buffering of its input and improving the reading process. You can connect to a file like this:
InputStream is = new BufferedInputStream (new FileInputStream (fileName));
When you've connected to the file, you can start reading from it. The InputStream class has two main methods for reading data: int read() and int read(byte[] b,int off,int len). The first method reads only one byte of data at a time, whereas the second one reads up to len bytes of data from the stream into an array of bytes. Obviously, the second method gains in performance, so we'll use it as presented in Listing A.
This listing has several interesting aspects. First, because the file is big, we allocate a rather big buffer (20 Mb) when calling the read method. The bigger the buffer, the faster all data is read. Actually, it is sometimes possible to know in advance the number of bytes that can be read from an input stream without blocking and allocate a buffer of the same size. This is accomplished by calling the available method.
Unfortunately, this method does not always return correct results and can throw an exception. This is the case while reading database data as a long or BLOB via a stream. Second, all arrays are initialized outside of the while loop, meaning out, buf, and tmp arrays are reused, so less objects are to be garbage-collected. Third, when the buffer is filled with part of the data, it is copied into a growing array by calling the System.arraycopy method. Although this algorithm is quite efficient, every read loop creates a temporal array and performs two array copies.
You can reduce data copying and array allocation by modifying the while loop as shown in Listing B.
Here, instead of storing intermediate data in a big array and extending it every time data is retrieved, it is maintained in a list, where each element contains only a piece of data. When the end of the stream is reached, the data can be taken from the list and merged into a single array. This allows you to save one array allocation and one copy operation. If you don't immediately need a whole data as an array, you can return the list itself and thus save some more time and resources. Reading data using this algorithm can be significantly faster than using the first one (Listing A). The difference in speed depends on the buffer array size that is used by read method.
Download the code covered in this article
BigFileReader.java
Go forth and program
Now you have a pattern to speed up data reading and boost application performance. Applying this pattern is especially useful for reading large pieces of data from a file, a database, or a socket.
Print/View all Posts Comments on this article
|
|
|
|
|
|
|
|
|
|
|
|
White Papers, Webcasts, and Downloads
- Dell IT Cuts Energy Costs by Up to 40 Percent With a New Power Management Plan Dell Energy conservation is an increasingly important issue for organizations ... Download Now
- VMware Infrastructure: A Guide to Bottom-Line Benefits VMware Frustrated by the high cost of maintaining or building ever-larger data centers? Get the facts you need to formulate your Virtualization Action Plan. Download Now
- Advanced Java Memory Analysis with JProbe Quest Software Memory issues in Java applications can cripple performance and cost your ... Download Now
- Dell Helps Medical University of South Carolina Bring the Intelligent Classroom to Life Dell Established in 1824, Medical University of South Carolina (MUSC) is one of ... Download Now
- Building the Virtualized Enterprise with VMware Iinfrastructure VMware VMware virtualization software has been adopted by over 120,000 enterprise ... Download Now
Article Categories
- Security
- Security Solutions, IT Locksmith
- Networking and Communications
- E-mail Administration NetNote, Cisco Routers and Switches
- CIO and IT Management
- Project Management, CIO Issues, Strategies that Scale
- Desktops, Laptops & OS
- Windows 2000 Professional, Microsoft Word, Microsoft Excel, Microsoft Access, Windows XP,
- Data Management
- Oracle, SQL Server
- Servers
- Windows NT, Linux NetNote, Windows Server 2003
- Career Development
- Geek Trivia
- Software/Web Development
- Web Development Zone, Visual Basic, .NET

