Directory Listings in SAS

There are these utilities that you seem to come across all the time. SAS has functions to create a directory, test if a directory, folder or file exits or to create all sorts of different files, but I am yet to come across a function or procedure to simply list the contents of a directory. There are different approaches, but lets look at a simple single DATA step approach that we can extend to include recursive processing of sub-directories and sub-folders.

A quick note on convention. When we use the term directory, we also imply a folder even though the idea of a folder means there may be some metadata properties associated. To the file system, it is still a directory most often.

Paths and SAS

The path is probably one of the few atomic elements that SAS cannot function without. Whether it is finding SAS macros or accessing data sets or files, the path is central to how SAS functions.

SAS can use the Windows style backward slash (‘\’) interchangeably with the forward slash (‘/’), even within the same value, for statements, functions and procedures. The path C:/Users/me is just as valid as C:\Users\me or even C:\Users/me.

The case where caution is warranted is when using the CALL SYSTEM or X statements to execute an operating system command. Then the operating system rules apply, but only for the command you are executing. Even in Windows, there are certain commands that accept a forward slash in the path. Others will interpret the forward slash as the beginning of a command option, for example dir c:/Users/me will consider /Users an invalid switch or command option.

If we want to be totally correct, we can query the path delimiter for the Operating System with a simple DATA step, the Java Object interface and the standard Java class java.io.File.

options set=CLASSPATH "%sysfunc(pathname(work))";

data _null_;
  declare JavaObj j( 'java.io.File', '' );

  length delim $ 5 ;
  j.getStaticStringField("separator", delim) ;

  call symput( 'path_delim', strip(delim) );
run;

The DATA step will return the path delimiter as the macro variable path_delim.

The OPTIONS statement at the beginning is just a trick to ensure that the CLASSPATH is initialized with a non-empty value to avoid a SAS Warning message. If you already have CLASSPATH initialized, this step is not necessary.

The basics

The most common approach is to use a FILENAME statement with a PIPE device type and a DATA step. If we use the Microsoft Windows dir command as an example, we get the following DATA step,

filename dirlist pipe 'dir C:\Users\me\Documents' ;

data work.list;
    length line $ 200 ;
    
    infile dirlist ;
    input ;

    line = strip(_infile_);
run;

filename dirlist clear ;

The _INFILE_ automatic variable contains the input record buffer, that is the entire line read through the INPUT statement from the pipe defined in the INFILE and FILENAME statements, and is a practical example of what attributes are available using this technique. The character string would then be parsed into the different components for the directory inventory.

The same technique can be applied with the Linux/Unix command ls. To get a comparable list, you will need to use the command ls -alF or the common alias ll.

Most installations of SAS seldom has both dir and ls commands, but nonetheless there are great examples of simple macros that will check the underlying Operating System and select the appropriate command to use.

There is the other option for performing the directory inventory entirely in SAS using the functions DOPEN and DREAD with some help from DNUM.

filename root 'C:/Users/me/Documents' ;

data work.list ;

    * --  return variables  -- ; 
    length name $ 512 ;

    * --  directory to inventory  -- ; 
    did = dopen( 'root' );

    do i = 1 to dnum(did) ;
        * --  directory item name  -- ;
        name = dread( did, i );
        output;
    end;

   rc = dclose( did );
   did = 0;
run;

filename root clear ;

The DATA step above is fairly straightforward. We iterate through all items in the directory using DNUM and DREAD.

Please make note that when assigning the file reference root , the FILENAME statement uses a forward slash (‘/’) just to make the point from our discussion above.

The DREAD function cannot discern between a directory or file, but we can take advantage of DOPEN, which will return the value 0 if it fails to open the file system item as a directory … such as when it is a file. A simple trick to distinguish a directory from a file.

    if ( filename( 'item', path ) ^= 0 ) then 
        continue;  * <-- could not assign a filename so next item ;

    ditem = dopen( 'item' );

    if ( ditem > 0 ) then do;
        type = "DIR";
        rc = dclose( ditem );
    end; else do;  
        type = "FILE";
    end;

    * --  clear the file reference  -- ;
    rc = filename( 'item', '' ) ;

Make note that the DATA step and the above code snippet explicitly clears the file references for both the directory we inventory and each individual item. The directory or file remains locked to the SAS session otherwise and any subsequent attempt to assign a file reference will fail with a SAS Error. To get out of situations like that, filename _all_ clear is a greate escape.

Adding file information

Most directory listing macros will also include some information about files, such as last modified date that is common with using most Operating System directory list commands. We can obtain additional information about the file using the SAS functions FOPEN and FINFO.

    * --  open file  -- ;
    fid = fopen( 'item' );

    * --  get properties  -- ;
    created = finfo( fid, "create time" );
    modified = finfo( fid, "last modified" );
    size = finfo( fid, "file size (bytes)" );

    * --  close file   -- ;
    rc = fclose( fid ) ;
    fid = 0 ;

The above code assumes that the additional information is available for create and last modified date/time as well as the file size in bytes. To obtain the information property names for the information available for a file, use the SAS functions FOPTNUM and FOPTNAME.

The FINFO function returns the value as character, so any conversions to numeric variables with date/time format and file size in units other than bytes will have to be performed with some data type conversions among the additional steps.

One additional nice feature with the FINFO function is that a missing value is returned if an information property with the specified name does not exist.

There are the corresponding functions DOPTNUM and DINFO for directories as well, but I usually elect to not use them simply because the only information item available on Linux/Unix is the directory name.

Recursive directory listings

The above examples are all well and done for listings the contents of a single directory. Both dir and ls Operating System commands support recursively listing directory content, and so shall we.

There are a few different approaches to recursion. We can wrap the above DATA step in a macro loop, but that would slightly stretch the idea of a single DATA step. We could use arrays or a simple space delimited list in a long character variable as the recursion queue. Another is to implement a simple Fist-In-First-Out queue using Hash Tables, a topic for a future post.

The approach to implement recursion using a queue in a long character variable is quite straightforward using the CALL CATX statement, a DO WHILE loop and the SCAN function. The SCAN function does not require a predefined fixed number of entries so its flexibility is quite convenient for our case.

The initial, or seed, value of the queue character variable is simply our root directory to inventory. We use a single dot to represent the root directory as our seed. Its purpose is also to give SCAN a return value as SCAN on an empty string is an empty string, i.e. missing.

data work.list ;

    length type $ 10 name $ 512 relpath $ 1024 path $ 2048  
           queue $ 10240 root $ 512 ;

    * -- initialise our queue with the root directory  -- ;
    root = "C:/Users/me/Documents" ;

    * -- initialise our queue with the current directory  -- ;
    queue = "." ;

    k = 1 ;
    q_entry = scan( strip(queue), k, "|" );

    do while ( not missing( q_entry ) );

        * --  assign reference to the current entry  -- ;
        if ( filename( 'ditem', catx( "/", root, q_entry ) ) ^= 0 ) then 
            continue ; 

        * --  reset return variables and any references  -- ;
        call missing( name, relpath, path, type, did ); 
 
        * --  assign directory reference  -- ;
        did = dopen( 'ditem' );
        
        if ( did = 0 ) then continue ;  * -- could not open directory ... next ;

        do i = 1 to dnum( did ) ;

            * --  common reference details to both directory and files  -- ;
            * --  get name  -- ; 
            name = dread( did, i );     
                                       
            * --  get relative path  -- ; 
            relpath = strip(tranwrd( catx( "/", q_entry, name ), './', ''));   

            * --  get absolute path  -- ;  
            path = catx( "/", root, relpath );                                

            * --  determine directory or file   -- ;
            if ( filename( 'item', path ) ^= 0 ) then 
                continue;  * <-- could not assign a filename so next item ;

            ditem = dopen( 'item' );

            if ( ditem > 0 ) then do;
                *  a directory  ;

                type = "DIR";
                output ;  *  add directory to the output data set  ;

                rc = dclose( ditem );
                rc = filename( 'item', '' );

                * --  add it to the queue  -- ;
                call catx( "|", queue, relpath ) ;

                continue ;  * <-- next directory ;
            end; 

            * --  if we are here .. it is a file   -- ;
            type = "FILE";

            * ...  derive any file properties here  ... ;

            * --  clear the file reference  -- ;
            rc = filename( 'item', '' ) ;

            output ;  *  add file to the output data set  ;
        end;

        * --  close directory reference  -- ;
        rc = dclose( did ) ;

        * --  get next directory in the queue  -- ;
        k = k + 1 ;
        q_entry = scan( strip(queue), k, "|" );
    end;
run;

The DO WHILE loop works well simply because the SCAN function will continue to iterate until no more entries exist. We use a pipe (‘|’) delimiter in the queue to allow for directory and file names to include spaces. The pipe is usually a safe delimiter as it has a role in most Operating Systems.

The length of the queue variable is arbitrary, but long. Length of 10240 is 10 kb which in most cases is well sufficient, even with multi-byte character sets. Bare with me for a short but rather important aside. For character variables, most users will use the LENGTH statement to define the number of characters when the SAS session uses single byte character sets. In actuality, it is the reverse that is true. The LENGTH statement defines the number of bytes allocated to store a variable value, which can translate 1-to-1 to the number of characters for a single byte character set. This is also the case for the LENGTH, SUBSTR and many other common functions, they count bytes and not characters, hence why the use of the SAS K-functions can be quite important.

We also use the relative path as queue entries to save space. Using the absolute path would just replicate the root path for every entry.

The TRANWRD function when deriving the relative path is used because the seed value in the queue is the single period, i.e. the current directory, and I made the cosmetic decision that relative paths should start with the directory item name and not ‘./‘. The TRANWRD function can be replaced with an IF-THEN-ELSE construct if that makes it clearer.

Just another macro

We could use this DATA step as a template and add to any macro or function that needs a directory list or we can wrap it into a simple utility macro.

%utils_dir_list( path = , recursive = N, out = work.list);

The additional pieces in the macro are the standard parameter error checks and simple CREATE TABLE and INSERT INTO statements with PROC SQL to standardize the output data set structure.

An evolved variant of this macro will be part of the cx Library when it is released in the near future.

Magnus Mengelbier

Latest posts by Magnus Mengelbier (see all)

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *