File System Scanner

Overview

The File System Scanner is a collection of tools intended to record the history of a file system into a relational database.

Python and Shell scripts stored and executed on a system needing file system scanning
Web Services hosted on the warehouse web server (views.cira.colostate.edu/tsdw/).
SQL Server stored procedures.

Code Description

Scan the file system to a file

/var/DWUtils/FileSystemScanner/doScan.py
This is a top level script which executes scans of several directory trees. This script needs to be parameterized. The main function of this script is to connect specific directory trees to be scanned to instances of ScannerToFile described below.
/var/DWUtils/FilesystemScanner/pyFSScan/fsScan.py
This file contains numerous classes to handle the scanning of directory tree contents into various forms. The specific class used is ScannerToFile. Methods of the scanner to file class are called in the doScan.py script. This file contains the work of scanning the file system and storing the results. Also included in this file is the logic for parsing .log or .lst files which list the contents of adjacent .tar.gz archive files. The results of the file system scan are stored in a binary file whose form is defined within the ScannerToFile methods.

Process the file to the database

/var/DWUtils/FilesystemScanner/doIngest.sh
This shell script processes the binary scan result files generated in the last step to web services. This shell script contains arguments given to buildHistoryWeb.py
/var/DWUtils/FilesystemScanner/pyFSScan/buildHistoryWeb.py
This file parses a file system scan result binary file and publishes the contents to the TSDW web services. This process primarily entails calling preliminary web services /stored procs to set up for a scan, then iterating the file scan records in batches and publishing these to web services, and finally calling methods for post-scan processing.
/var/DWUtils/FileSystemScanner/TSDWConnection.py
This file contains a class which represents the connection to TSDW web services.
\TSDW\TSDWAPP\DataRequest\DataCatalog.asmx.vb
This is the server side code for the web services. There are many web services methods contained within this file but only a few are used for file system scanning. The services utilized in this case are all mapped directly to sql server stored procedures with the same name. Listed below are a description of the web services / stored procedures.

SetfileSystemScanTime
@Timestamp bigint - Epoch timestamp representing the time a scan was performed
Alters an insert trigger on the FileSystemItem table such that insert/update times are set as the epoch time given in @Timestamp
StartFileSystemScan
Sets the Scan bit of all file system item records to 0. As file system items are encountered in the scan result the db record is updated and the scan bit is set to 1. After all scan results have been processed any records which still have a 0 scan bit were not encountered when processing the file scan results.
CreateOrUpdateFileItems
@IsFile bit - 1 if this is a file 0 if it is a directory
@IsDirectory bit - 1 if this is a directory, 0 if it is a file
@FTPPath varchar(400) = null - The path for the file system item
@Depth int - The depth of the item from root
@FileCount int - If a directory number of files in this directory (does not include sub directories)
@TotalFileSizeBytes bigint - Total number of bytes in this directory (does not include sub directories)
@ModTime bigint - Epoch timestamp representing file modification time.
This plural web service (Items) maps to a singular Stored procedure. The stored procedure receives information about the item and sets the item's scan bit to 1. Insert/Update times are managed by triggers.
CompleteFileSystemScan
@Timestamp bigint = null - Timestamp for completion of file system scan. If given will be used as deletion time for unscanned file system ite,s. If not given then the current time is used
This method completes processes unscanned items by assuming that they have been deleted. Any file system item record which has a scan bit of 0 has it's IsDeleted bit set to 1 and deletion time updated to @Timestamp if it is given or the current system time.
UpdateFileSystemHIDs
Sets the hierarchy id field for each file system item record which has a null hid and IsDeleted == 0.
UpdateFileSystemTreeTotals
Updates the TreeFileCount and TreeTotalSizeBytes fields in the FileSystemItemTable. This applies only to directories. The updated fields represent the count and size of files in the entire directory tree rooted at the path of a directory record entry in the file system item table.
ResetFileSystemScanTime
Resets the insert/update triggers on the file system item table to use the current system time instead of a time which may have been set in StartFileSystemScan.

Disk Mounting Policies

The primary data partitions should be mounted under a directory with the name of the specfic machine eg /valkyr/data1/ or /viking/data1/

Archive Disks should be mounted in a directory structure like: /archive-disks/[PlatformName]/[Disk ID] As an example the WestJump disks all have ID labels naming them as WJ01, WJ02 etc. To scan these disks they should be mounted at: /archive-disks/WestJump/WJ01/ This enables IWDW admins to be informed that this data resides on some offline archive disk and indiates which disk the scanned files belong to.

Configuring Automatically Scanned Directories

The directories which will be scanned on a particular system should be specified within /var/DWUtils/FileSystemScanner/doScan.sh

The doScan.sh script is executed daily by a cron job under the root user

Scanning Archive Disks

Archive disks are only intended to be scanned once and then stored offline. First the disks must be mounted in a directory which identifies the platform and specific disk. Note that in creating archive disks the utility dirsplit can be used to spread the platform across multiple disks while preserving the directory structure.

Once the archive disks have been mounted and filled with data the script /var/DWUtils/FileSystemSCan/DiskScan.py is used to scan appropriate directories. A convenient method would be to scan the root of the platform ie /archive-disks/[PlatformName] as this would encompass all member disks. This method is only applicable if ALL of the archive disks are mounted at once. If there are too many disks to have them all mounted simultaneously then it will be necessary to scan each disk independently but this can be easily accomplished with a shell script. Valkyr has a few examples of these scan and ingest scripts at /var/DWUtils/FileSystemSCan/ARchiveScans/

After the scanning is accomplished another script is needed to perform the ingest of the scan results.

Here is an example scanning script for CARMMS 2.0

/var/DWUtils/FileSystemScan/DiskScan.py -i /archive-disks/CARMMS2/CARMMS2-01/ -o /var/DWUtils/FileSystemScan/ArchiveScans/CARMMS2/
/var/DWUtils/FileSystemScan/DiskScan.py -i /archive-disks/CARMMS2/CARMMS2-02/ -o /var/DWUtils/FileSystemScan/ArchiveScans/CARMMS2/

Here is an example Ingest script for CARMMS 2.0

/var/3SDWPython/bin/python2.7 /var/DWUtils/FileSystemScan/pyFSScan/buildHistoryWeb.py -u http://10.1.108.21/tsdwapp/http://views.cira.colostate.edu/tsdw/ -i /var/DWUtils/FileSystemScan/ArchiveScans/CARMMS2/ -l /var/DWUtils/FileSystemScan/ArchiveScans/CARMMS2/CARMMS2_Ingest.log --batchsize 100010000 -t True -d True