SAS and Java revisited

SAS has had the ability to utilize and execute Java code from within a DATA step for quite some time, as far back as SAS 9.1. Early use was focused on computations, simulations, data streams and visualizations where it was more efficient to use packages, libraries and utilities other than SAS.

I co-authored a paper in 2008 exploring the Java Object interface and a few common use cases around retrieving data directly from a database or through Web Services. In a few recent projects, the Java Object interface comes to the rescue once again, making a difficult task quite simple and straightforward.

I have used the Java Object interface to solve a few different challenges.

Use cases of the Java Object

The first uses of Java Object interface that I came across was connecting SAS to different systems and environments where a SAS connection products did not exist or was considered overkill for the task. The first data lake project that I was part of was one such case.

The Java Object interface can also be a tool of choice if vendors only provide a Java API for their product but the organization is centered around SAS as a skill set. Creating a custom SAS macro library using the Java Object interface as the connecting arm can provide simple tools that will allow project and/or study teams to interact with and/or bring automation to a vendor solution.

SAS Drug Development (now part of SAS Life Science Analytics Framework) included at that time (this was versions leading up to 4.4) a simple SAS macro library that implemented a limited set of features compared to what was available in the full Java API. Of course, the features we needed for the project was not available as a SAS supplied macro, so we created them. As one example, a macro would query the audit trail for the latest changes, say to the file system, that would then drive automated processes downstream replicating data or simply running standard programs and processes, making it seem truly event driven.

In the last few years, the Java Object has been quite useful when interacting with a vendor’s Web Service. There are some Web Services that have very complex and dynamic data structures that can be quite difficult to interact with via PROC HTTP, PROC JSON and the JSON libname engine. By using the Java Object interface, you can add a layer of business logic that greatly simplifies the the data structures and resulting user macro library.

A practical example

There is a simple, but quite old, example of how the Java Object interface can solve a more frequent problem, that is when we need to compute a checksum on a file. Surprisingly, SAS does not have functions to do this prior to SAS 9.4 Maintenance release 6, with the exception of the MD5 and SHA256 functions that are limited to string values only.

There are also some great examples of SAS native implementations, it is a math problem after all, or you can rely on operating system utilities that may be available. In some cases, neither are a good fit, so we can employ a simple Java class and the Java Object interface in a DATA step to generate one of several checksum variants.

Basics of a checksum

A checksum is a mathematical computation on a sequence of characters or bits, if you think of a file as a binary sequence of 1’s and 0’s, to detect errors that may occur during transfers or storage. The principle is if you have two different sequences of bits, e.g. the files are different, the math should result in different checksum values. A common method today is to use a hashing function to generate a checksum value, also called a message digest, hash value or sometimes just hash (we will use hash to interchangeably mean the hash value or message digest going forward).

A checksum can be useful when you transfer a file from one system to another. You compute the hash on both ends of the transfer and if the result is the same value, the file transfer was most probably successful. If you get different hashes, maybe there was an error or issue in the transfer that you need to investigate further.

There are different hash functions, e.g. mathematical algorithms, available. Message Digest 5 (MD5) is/was one of the most common.. It turns out that there are cases where two different files (as an example) can result in the same MD5 hash, i.e. a collision. SHA (Secure Hash Algorithm) is another family of these hash algorithms that come in different variants. SHA-1 is faced with the same fate as MD5 where you can construct two different data structures with the same hash. Both MD5 and SHA-1 are still in use, but considered broken for sensitive cryptography and information security, such as digital signatures.

Even with the issue of collisions, both MD5 and SHA-1 are still widely used and perfectly valid for ensuring that information or files transfers are not corrupted in transit with the argument that it is highly unlikely that two different hash algorithms on the same file will both result in a collision.

Although a hash value is a number, it is usually presented in a hex string format, most often delimited with spaces for readability. As an example, my name will result in the following two hashes. I have also added two variations of the SHA-2 family, i.e. SHA-256 and SHA-512, for comparison.

AlgorithmHash
MD512 9f ab 93 de e5 7a ac 61 c0 b9 fd d1 4c ab f7
SHA-18d 1e e7 ea f6 28 ff 1a f6 ec ba de 9b 28 3e 4e 60 4b c0 7e
SHA-26585 e3 d9 b7 56 0e 06 6b 71 f1 9a 6c 94 6e df f3 af d8 93 e5 2b 24 20 90 85 13 33 12 09 9b 74 5a
SHA-512ff f4 a3 e8 28 c5 75 ed df 9d c7 fd 56 32 c6 ef 05 05 5f 8e fb ca 0f 6d 16 28 2c cf b8 ad 61 98 44 06 f2 3f f2 9c f0 39 d8 69 de c5 75 bd 73 83 01 3e 7e cf 2b bb b5 e4 ba d8 7b c3 7c 00 49 a3

One obvious question is why the different lengths. Each algorithm will produce a different value that is dependent on the math involved. MD5 generates a 128-bit hash while SHA-1, SHA-256 and SHA-512 generates a hash of 160, 256 and 512 bits, respectively. Eight bits is a byte and each byte can be represented by a 2 character hex value, so MD5 is represented by 16 hex values or a string of 32 characters, if you do not count the spaces.

The question is then which algorithm to use. Over the last 3-4 years, I commonly use both MD5 and SHA-256 as checksum references when transferring files. My argument is that the shorter MD5 is most often the standard message digest within organizations as well as easier to read and visually find differences. I add SHA-256, or sometimes SHA-512, as a second hash used in programmatic verification, following the argument above that two different hash algorithms is rather unlikely to have the same collision. A second algorithm also becomes handy if someone reads up on MD5 and the alarm bells go off.

First step: Getting a hash

SAS, starting with SAS 9.4 M6, contains several new hashing functions, where HASHING_FILE and HASHING_HMAC_FILE is exactly what we need to generate a hash on an entire file. However, prior to SAS 9.4 M6, we need to find an alternative method to generate a file hash.

SAS on a Linux environment and, in certain cases, Microsoft Windows have command line tools to generate a file hash. As an example, Linux has the commands md5sum and sha256sum, to use our previous examples. We can also in some cases use PROC GROOVY.

We can also use a simple Java class together with the Java Object interface in a SAS DATA step to generate the hash.

The Java class

We will create a Java class, say FileChecksumUtility, to contain the code generating the hash.

public class FileChecksumUtility {

	private boolean has_error = false ;
	private String has_error_message = null;
	
	/** 
	 * Empty constructor
	 */
	public FileChecksumUtility() {
	}

	/**
	 * Get the file hash as a sequence of bytes
	 * 
	 * @param algorithm Hashing algorithm
	 * @param path File path
	 * @return A byte array with the message digest 
	 * @throws NoSuchAlgorithmException if the specified algorithm is not supported
	 * @throws FileNotFoundException if the file does not exist
	 * @throws IOException if reading the file failed
	 */
	public byte[] createChecksum( String algorithm, String path ) throws NoSuchAlgorithmException, FileNotFoundException, IOException {
		
		//  get the message digest algorithm 
		//  throws NoSuchAlgorithmException if the specified algorithm is not supported
		MessageDigest md = MessageDigest.getInstance( algorithm.toUpperCase().trim() );
		
		//
		// read the file
		
		//  throws FileNotFoundException if the specified file does not exist
		InputStream file_input =  new FileInputStream( path );

		byte[] buffer = new byte[1024]; // use a buffer of 1024 bytes, e.g. 1 kb, to read the file

                int numRead;

                do {
        	    numRead = file_input.read(buffer);  //  throws IOException if there is an issue reading file
        	
                    if (numRead > 0) 
        	         md.update(buffer, 0, numRead);
           
               } while (numRead != -1);

               file_input.close();
        
               //  return the digest as a sequence of bytes
               return md.digest();
	}	
	
	
	/**
	 * Get the checksum for a file
	 *  
	 * @param algorithm Algorithm 
	 * @param path File path
	 * @return A String containing the checksum
	 */
	public String getChecksum( String algorithm, String path ) {
		
		byte [] checksum = null ;
		
		try {
			// generate checksum
			checksum = this.createChecksum(algorithm, path);
			
		} catch ( NoSuchAlgorithmException ex ) {
			//  algorithm is not correct
			
			this.has_error = true;
			this.has_error_message = "The algorithm " + algorithm.toUpperCase().trim() + " is not supported";
			
		} catch ( FileNotFoundException ex ) {
			//  file does not exist 
			
			this.has_error = true;
			this.has_error_message = "The file " + path.trim() + " does not exist";
		
		} catch( IOException ex ) {
			// error reading file
			
			this.has_error = true;
			this.has_error_message = "Error reading the file " + path.trim() ;
			
		} finally {
			if ( this.has_error )
				return "";  // when in error ... return empty string
		}
		
		//  convert to a readable hex format 
		StringBuffer result = new StringBuffer() ;
		
                for (int i=0; i < checksum.length; i++) 
        	   result.append( Integer.toString( ( checksum[i] & 0xff ) + 0x100, 16).substring( 1 ) );
		
		return result.toString();
	}
	
	
	/**
	 * Return error state
	 * 
	 * @return 1 if an error has occurred. 0 otherwise.
	 */
	public boolean hasError() {
		return this.has_error ;
	}
	
	
	/**
	 * Get the error message 
	 * 
	 * @return A String containing the error message if recorded. Empty String otherwise
	 */
	public String getErrorMessage() {
		
		// if no error or no error message, return empty String
		if ( ( ! has_error ) ||
				( this.has_error_message == null ) ) return "";

		return this.has_error_message ;
	}
}

There are few topics and lessons in the above code to make note of. The above class is really two classes in one, a wrapper class and the class that actually does all the work.

Most important lesson is probably that the SAS Java Object interface has traditionally not managed Java exceptions (e.g. when Java has an error) very well, so we need to to ensure that the Java class called from the SAS DATA step does not throw an exception. That is the role of the wrapper class, or the wrapper method in this case, that I call from SAS that manages all the exceptions. There is the EXCEPTIONCHECK method that can be used in the DATA step, but I have found that controlling the failure within the Java code is much easier to test and later validate.

The wrapper part of the class includes a state flag that I can query using a method, say something like public boolean hasError(). Just knowing there is an error may not be enough, so the method public String getErrorMessage() can be used to get a user friendly error message. I can later use the returned message as part of a SAS Error entry in the log.

The DATA step

The Java code is called using the Java Object in a DATA step. Our class FileChecksumUtility is located in a Java package call org.limelogic.blog.

data _null_;
   /* --  define reference to the Java class  -- */
   declare javaobj j("org/limelogic/blog/FileChecksumUtility");

   length str $ 1024 ;

   /* --  get the MD5 checkum of a test file  -- */
   j.callStringMethod( "getChecksum", "MD5", "hello.txt", str );
   put "MD5: " str ;

run;

The DATA step code is quite simple. We define a reference j that represents the Java class. There are several methods for interacting with the Java class methods depending if and what the return type is. In our case, we return a String, so the Java object method is callStringMethod.

The first argument to j.callStringMethod() is the Java class method that we call, in this case it refers to public String getChecksum( String algorithm, String path ).

The second and third argument to j.callStringMethod() are our Java method parameters, which corresponds to algorithm and path in our method. If you are on Windows, keep in mind that the backslash (‘\’) is an escape character in Java, so any Windows path is specified using a double backslash, e.g. C:\User\me\Documents\hello.txt is specified as C:\\User\\me\\Documents\\hello.txt.

The third argument is the SAS variable where the return value is written to. Remember to initialize this variable as a character variable as otherwise you will get a SAS Error that a Character type is expected. This is simply because SAS defaults to numeric variables if it is not previously initialized.

Let us now expand the DATA step to include error management as well.

data _null_;

   /* --  define reference to the Java class  -- */
   declare javaobj j("org/limelogic/blog/FileChecksumUtility");

   length str str_error $1024 ;

   /* --  get the MD5 checkum of a test file  -- */
   j.callStringMethod( "getChecksum", "MD5s", "hello.txt", str );

   /* --  error checking  -- */
   has_error = 0 ;  /* <--  initialize as no error */

   j.callBooleanMethod( "hasError", has_error );  /* <--  get error state  */

   if ( has_error = 0 ) then do;
       put "MD5: " str ;
   end; else do;
       /* --  there was an error  -- */
       j.callStringMethod( "getErrorMessage", str_error );
       put "ERROR: " str_error ;
   end;

run;

The Java class method public boolean hasError() is used to check if there is an error. You may notice my Java object method is j.callBooleanMethod(). SAS does not have a boolean type, e.g. True or False, so SAS converts this to a numeric. False is converted to 0 and True is equal to 1. Be mindful that SAS will treat any non-zero value as True, so just testing equal to 1 may be insufficient.

If we identify an error, we use the Java class method public String getErrorMessage() to retrieve the error message. As the above DATA step requests a checksum using an unknown algorithm MD5s, this results in the SAS error below.

86
87   data _null_;
88
89      /* --  define reference to the Java class  -- */
90      declare javaobj j("org/limelogic/blog/FileChecksumUtility");
91
92      length str str_error $1024 ;
93
94      /* --  get the MD5 checkum of a test file  -- */
95      j.callStringMethod( "getChecksum", "MD5s", "hello.txt", str );
96
97
98      /* --  error checking  -- */
99      has_error = 0 ;  /* <--  initialize as no error */
100
101     j.callBooleanMethod( "hasError", has_error );  /* <--  get error state  */
102
103     if ( has_error = 0 ) then do;
104         put "MD5: " str ;
105     end; else do;
106         /* --  there was an error  -- */
107         j.callStringMethod( "getErrorMessage", str_error );
108         put "ERROR: " str_error ;
109     end;
110
111  run;

ERROR: The algorithm MD5S is not supported
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds

Managing and providing user friendly error messages will go a long way to easily resolving issues as you may find that most of your users are not Java programmers and decoding a Java stack trace is something that they would vigorously try to avoid.

The rather simple DATA step above represents most of the scenarios that you may encounter with the Java object interface. The source of any additional complexity is either using a Java class without a wrapper class or interactions with complex data sources and structures. In most cases though, complexity is no more than a long sequence of object.calltypeMethod() statements followed by a single OUTPUT statement to generate each retrieved record.

Magnus Mengelbier

Latest posts by Magnus Mengelbier (see all)

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *