Tuesday, January 8, 2013

VIVO Ingest in 1.5.1

Little update, about 3 months ago we moved from Gainesville, Florida to Boulder, Colorado as I accepted a position with the University of Colorado Boulder.  I'm back in the trenches, writing code everyday and working on VIVO for Faculty Affairs at CU Boulder.  Hopefully this will restart my attempts at writing more often and focus that writing on more technical and less managerial matters.

Updating Data Through Data Ingest in VIVO 1.5.1

I've had a lot of tasks at CU since starting, and I'll go over some of the things I've learned and written soon enough.  For now I wanted to talk about updating your Data in VIVO 1.5.1.  In the old days, when four programmers at UF embarked on Data Ingest, if you wanted to get data into VIVO you had to add it to the systems "Main Model" known as KB2.  This made data ingest difficult during the update phase because you either had to 
  1. Start over from scratch with a blank VIVO
  2. remove the previous data that you ingested which contained the data you want to change
Semantic Triple Stores, didn't have a key that we could use to link a row in KB2 to the data coming in from our source (in hindsight I believe there are ways with hash keys that we should have probably done this).   Due to this we constructed a very complicated (and time expensive) process to compare the data you are putting into VIVO against the last time you ingest from that source.  It creates an additions and a subtractions file which you then apply against the KB2 model.  Basically it was a bit like writing a remove and insert to accomplish an update in SQL.

Now in VIVO 1.5.1 this is mostly the same.  However, data doesn't have to be in the main model to be index.  So now we can separate our data by the source, or in the case of CU the tables that generated the data.  This allows for a shorter ingest process, we're only interested in dropping and adding to the models that have changes to be made.  I took CU's current process (which uses selenium scripts against the UI) and ripped out the portion that loaded data into KB2 (download of the data, use add/remove screen to load the exported data to the main graph) and I added a method to drop the import graph that we used.  This dropped my time from 3-4 hours to 1 hour for an entire ingest.

We were still rebuilding from scratch with each ingest and now that we're heading to production I wanted to make this process a little faster.  So I wrote the first of a couple of scripts towards automating the entire process.  This first script reviews the dat files for changes, allowing me to drop only the graphs that have changes and re-run thier ingest scripts.  

The process was fairly simply and I've included a couple of sites and blogs I used to figure out what to do.  By hand (which will become another script soon) I copied down the data from the previous run and ran a new export and copied down the new data.  I then pass to my new little script the two paths for the old data and the new data.  

The first step in the script is to review all of the file names and see what is missing, what's new.  Since the export process is an sql script which always uses the same file names for each of it's methods we don't have to worry misspellings, just new dat files or the lost of a dat file.
def reviewFolderForChanges(oldFilePath, newFilePath):
    #read in file path
    oldFiles = os.listdir(oldFilePath)
    newFiles = os.listdir(newFilePath)
    
    #get names of files in order of name by create list
    for newFile in newFiles:
        if newFile in oldFiles:
            reviewFileForChanges(oldFilePath+newFile, newFilePath+newFile)
        else:
            print "New File Found: " + newFile
            
    for oldFile in oldFiles:
        if oldFile not in newFiles:
            print "File Missing: " + oldFile 
 The next and final step in the process was to review the files themselves.  I could have used the python difflib, but I wanted more information from the files.  I wanted to know what was new and what had been removed.  Plus I found a nice little reference post by a Frankie Bagnardi that I wanted to implement myself.  The result has greatly increased the ability of the ingest operator (me today, probably alex and vance at other times) to make sure that the changes coming in are reflected in VIVO.  For example


Changes in file:/Users/stwi5210/Source/uccs-new-data/fis_faculty_member_positions.dat
- "http://vivo.uccs.edu/fisid_XXXXX","http://vivo.uccs.edu/deptid_XXXX","1435","Chair","http://vivoweb.org/ontology/core#FacultyAdministrativePosition","2","2"
- "http://vivo.uccs.edu/fisid_XXXXX","http://vivo.uccs.edu/deptid_XXXX","1419","Lecturer","http://vivoweb.org/ontology/core#FacultyPosition","5","4"


With information like this I go to the two individuals listed and make sure that they no longer have the positions of Chair or Lecturer.  This let me know that my ingest was successful and to report that it's ready to migrate to production.

All in all the script took about an hour to construct and then run and saved me about 40 minutes of ingesting.  Plus I was able to review VIVO after the ingests were finished for the data that should have changed, which is a big improvement over our previous methods of review.

Citation(s):

  • Compare Two Files with Python by Frankie Bagnardi- http://aboutscript.com/blog/posts/107
  • Python: iterate (and read) all files in a directory (folder) by Bogdon T - http://bogdan.org.ua/2007/08/12/python-iterate-and-read-all-files-in-a-directory-folder.html

No comments: