There are these utilities that you seem to come across all the time. SAS has functions to create a directory, test if a directory, folder or file exits or to create all sorts of different files, but I am yet to come across a function or procedure to simply list the contents of a directory. There are different approaches, but lets look at a simple single DATA step approach that we can extend to include recursive processing of sub-directories and sub-folders.
A quick note on convention. When we use the term directory, we also imply a folder even though the idea of a folder means there may be some metadata properties associated. To the file system, it is still a directory most often.
Paths and SAS
The path is probably one of the few atomic elements that SAS cannot function without. Whether it is finding SAS macros or accessing data sets or files, the path is central to how SAS functions.
SAS can use the Windows style backward slash (‘\’) interchangeably with the forward slash (‘/’), even within the same value, for statements, functions and procedures. The path C:/Users/me is just as valid as C:\Users\me or even C:\Users/me.
The case where caution is warranted is when using the CALL SYSTEM or X statements to execute an operating system command. Then the operating system rules apply, but only for the command you are executing. Even in Windows, there are certain commands that accept a forward slash in the path. Others will interpret the forward slash as the beginning of a command option, for example dir c:/Users/me will consider /Users an invalid switch or command option.
If we want to be totally correct, we can query the path delimiter for the Operating System with a simple DATA step, the Java Object interface and the standard Java class
options set=CLASSPATH "%sysfunc(pathname(work))"; data _null_; declare JavaObj j( 'java.io.File', '' ); length delim $ 5 ; j.getStaticStringField("separator", delim) ; call symput( 'path_delim', strip(delim) ); run;
The DATA step will return the path delimiter as the macro variable
The OPTIONS statement at the beginning is just a trick to ensure that the CLASSPATH is initialized with a non-empty value to avoid a SAS Warning message. If you already have CLASSPATH initialized, this step is not necessary.
The most common approach is to use a
FILENAME statement with a
PIPE device type and a DATA step. If we use the Microsoft Windows
dir command as an example, we get the following DATA step,
filename dirlist pipe 'dir C:\Users\me\Documents' ; data work.list; length line $ 200 ; infile dirlist ; input ; line = strip(_infile_); run; filename dirlist clear ;
The _INFILE_ automatic variable contains the input record buffer, that is the entire line read through the INPUT statement from the pipe defined in the INFILE and FILENAME statements, and is a practical example of what attributes are available using this technique. The character string would then be parsed into the different components for the directory inventory.
The same technique can be applied with the Linux/Unix command
ls. To get a comparable list, you will need to use the command
ls -alF or the common alias
Most installations of SAS seldom has both
ls commands, but nonetheless there are great examples of simple macros that will check the underlying Operating System and select the appropriate command to use.
There is the other option for performing the directory inventory entirely in SAS using the functions
DREAD with some help from
filename root 'C:/Users/me/Documents' ; data work.list ; * -- return variables -- ; length name $ 512 ; * -- directory to inventory -- ; did = dopen( 'root' ); do i = 1 to dnum(did) ; * -- directory item name -- ; name = dread( did, i ); output; end; rc = dclose( did ); did = 0; run; filename root clear ;
The DATA step above is fairly straightforward. We iterate through all items in the directory using
Please make note that when assigning the file reference
root , the FILENAME statement uses a forward slash (‘/’) just to make the point from our discussion above.
DREAD function cannot discern between a directory or file, but we can take advantage of
DOPEN, which will return the value 0 if it fails to open the file system item as a directory … such as when it is a file. A simple trick to distinguish a directory from a file.
if ( filename( 'item', path ) ^= 0 ) then continue; * <-- could not assign a filename so next item ; ditem = dopen( 'item' ); if ( ditem > 0 ) then do; type = "DIR"; rc = dclose( ditem ); end; else do; type = "FILE"; end; * -- clear the file reference -- ; rc = filename( 'item', '' ) ;
Make note that the DATA step and the above code snippet explicitly clears the file references for both the directory we inventory and each individual item. The directory or file remains locked to the SAS session otherwise and any subsequent attempt to assign a file reference will fail with a SAS Error. To get out of situations like that,
filename _all_ clear is a greate escape.
Adding file information
Most directory listing macros will also include some information about files, such as last modified date that is common with using most Operating System directory list commands. We can obtain additional information about the file using the SAS functions
* -- open file -- ; fid = fopen( 'item' ); * -- get properties -- ; created = finfo( fid, "create time" ); modified = finfo( fid, "last modified" ); size = finfo( fid, "file size (bytes)" ); * -- close file -- ; rc = fclose( fid ) ; fid = 0 ;
The above code assumes that the additional information is available for create and last modified date/time as well as the file size in bytes. To obtain the information property names for the information available for a file, use the SAS functions
FINFO function returns the value as character, so any conversions to numeric variables with date/time format and file size in units other than bytes will have to be performed with some data type conversions among the additional steps.
One additional nice feature with the
FINFO function is that a missing value is returned if an information property with the specified name does not exist.
There are the corresponding functions
DINFO for directories as well, but I usually elect to not use them simply because the only information item available on Linux/Unix is the directory name.
Recursive directory listings
The above examples are all well and done for listings the contents of a single directory. Both
ls Operating System commands support recursively listing directory content, and so shall we.
There are a few different approaches to recursion. We can wrap the above DATA step in a macro loop, but that would slightly stretch the idea of a single DATA step. We could use arrays or a simple space delimited list in a long character variable as the recursion queue. Another is to implement a simple Fist-In-First-Out queue using Hash Tables, a topic for a future post.
The approach to implement recursion using a queue in a long character variable is quite straightforward using the
CALL CATX statement, a
DO WHILE loop and the
SCAN function. The
SCAN function does not require a predefined fixed number of entries so its flexibility is quite convenient for our case.
The initial, or seed, value of the queue character variable is simply our root directory to inventory. We use a single dot to represent the root directory as our seed. Its purpose is also to give
SCAN a return value as
SCAN on an empty string is an empty string, i.e. missing.
data work.list ; length type $ 10 name $ 512 relpath $ 1024 path $ 2048 queue $ 10240 root $ 512 ; * -- initialise our queue with the root directory -- ; root = "C:/Users/me/Documents" ; * -- initialise our queue with the current directory -- ; queue = "." ; k = 1 ; q_entry = scan( strip(queue), k, "|" ); do while ( not missing( q_entry ) ); * -- assign reference to the current entry -- ; if ( filename( 'ditem', catx( "/", root, q_entry ) ) ^= 0 ) then continue ; * -- reset return variables and any references -- ; call missing( name, relpath, path, type, did ); * -- assign directory reference -- ; did = dopen( 'ditem' ); if ( did = 0 ) then continue ; * -- could not open directory ... next ; do i = 1 to dnum( did ) ; * -- common reference details to both directory and files -- ; * -- get name -- ; name = dread( did, i ); * -- get relative path -- ; relpath = strip(tranwrd( catx( "/", q_entry, name ), './', '')); * -- get absolute path -- ; path = catx( "/", root, relpath ); * -- determine directory or file -- ; if ( filename( 'item', path ) ^= 0 ) then continue; * <-- could not assign a filename so next item ; ditem = dopen( 'item' ); if ( ditem > 0 ) then do; * a directory ; type = "DIR"; output ; * add directory to the output data set ; rc = dclose( ditem ); rc = filename( 'item', '' ); * -- add it to the queue -- ; call catx( "|", queue, relpath ) ; continue ; * <-- next directory ; end; * -- if we are here .. it is a file -- ; type = "FILE"; * ... derive any file properties here ... ; * -- clear the file reference -- ; rc = filename( 'item', '' ) ; output ; * add file to the output data set ; end; * -- close directory reference -- ; rc = dclose( did ) ; * -- get next directory in the queue -- ; k = k + 1 ; q_entry = scan( strip(queue), k, "|" ); end; run;
DO WHILE loop works well simply because the
SCAN function will continue to iterate until no more entries exist. We use a pipe (‘|’) delimiter in the queue to allow for directory and file names to include spaces. The pipe is usually a safe delimiter as it has a role in most Operating Systems.
The length of the
queue variable is arbitrary, but long. Length of 10240 is 10 kb which in most cases is well sufficient, even with multi-byte character sets. Bare with me for a short but rather important aside. For character variables, most users will use the
LENGTH statement to define the number of characters when the SAS session uses single byte character sets. In actuality, it is the reverse that is true. The
LENGTH statement defines the number of bytes allocated to store a variable value, which can translate 1-to-1 to the number of characters for a single byte character set. This is also the case for the
SUBSTR and many other common functions, they count bytes and not characters, hence why the use of the SAS K-functions can be quite important.
We also use the relative path as queue entries to save space. Using the absolute path would just replicate the root path for every entry.
TRANWRD function when deriving the relative path is used because the seed value in the queue is the single period, i.e. the current directory, and I made the cosmetic decision that relative paths should start with the directory item name and not ‘
TRANWRD function can be replaced with an
IF-THEN-ELSE construct if that makes it clearer.
Just another macro
We could use this DATA step as a template and add to any macro or function that needs a directory list or we can wrap it into a simple utility macro.
%utils_dir_list( path = , recursive = N, out = work.list);
The additional pieces in the macro are the standard parameter error checks and simple CREATE TABLE and INSERT INTO statements with PROC SQL to standardize the output data set structure.
An evolved variant of this macro will be part of the cx Library when it is released in the near future.