Tuesday, August 12, 2014

Handling USGS data for simple visualization

In yesterday's post I mentioned that I'd make a short video showing how I imported and handled the USGS earthquake data in detail.  

So I got up early this morning and put it together.  Here it is: 



But in the process of doing this, I realized that a short outline of the process might be useful.  

Here's what the video shows.... 

1. Using the USGS earthquake search tool to find earthquake data in North America and download to a CSV file.
2. Uploading the CSV into Google Spreadsheets. 
3. Good data practices (setting up the data import as an uneditable sheet, labeling the data with metadata about where it came from and when downloaded). 
4.  Copying the data into new tabs for cleaning / munging / manipulation. 
5.  Extracting the name of the state from the "location" column in the data.  (4:20 to 11:00 on the video)
6.  Extracting the year of the earthquake from the "time" column.  (14:15 to 15:00 on the video)  
7.  Using the spreadsheets "Filter" function to select data by state. 
8.  Counting the number of earthquakes by year by state (using the "countif()" function).  
9.  Pulling the California and Oklahoma data together into a sheet for creating the chart.  (21:00 to 22:00 on the video)  


Most of this is pretty straightforward stuff.  Search for the data, import into a spreadsheet, and then filter/clean/transform the data until you get it into the shape you want.  

If you think about the goal of my data transformation, what I WANT is this a data table like this: 




... where column B is the number of earthquakes of magnitude > 3.0 in California, and column C is for Oklahoma.  

Most of the video is me messing around, trying to pull out the name of the state from the location description (which is in column E).  Most of the entries look like this: 


The place column is what I got from the USGS--it's in the CSV file.  But what I NEED is state(extracted), in order to get my counts of earthquakes by state.  

The hardest part of this entire process is writing the spreadsheet function that can pull the state name out of the place column.  

I ended up using a =regexpextract (...) function.  That's not the simplest, I know, but it's the way my programmer's mind works.  (To learn more about regular expressions, which are probably the handiest tool in the programmer's toolbox, see regular expression tutorial.)  

Just after I made the video, I discovered another way to do the same thing.  This is from my Google colleague Ronald Ho, and it's so clever I have to show it to you.  (Just in case you also need to extract the last term in a string on some future spreadsheet.) 

=iferror (RIGHT(N1879, LEN(N1879)-FIND("*",SUBSTITUTE(N1879," ","*",LEN(N1879) -LEN(SUBSTITUTE(N1879," ",""))))), N1879)

This whacky expression is a clever bit of programming that basically does the following: 

   1.  It find/replaces blanks ("") for every space in the string of N1879, compacting it.

   2.  It computes how many spaces there are in N1879 by subtracting the
       full-length string
LEN(N1879) from the compacted form. 

   3.  It replaces every space by a "*" character. 

   4.  The expression returns the RIGHT part of the string, everything from the last "*" 
        position to the end of the string.  

   5.  The whole expression is wrapped in an iferror so that if the string doesn't have 
        any spaces, the FIND command will cause an error, and then iferror will return 
        just the value of the string in the cell.  

It's complicated, but cute.  And it highlights an important point:  The functions that are built-into your data handling system need to be powerful enough to allow you to do whatever alterations to the data that you need. 

In this case, the Google Spreadsheets scripting language is missing a function to "find from the end" (that is, find the last space in the string).   (IF there was a FINDBACKWARD function, I would have written  = right(N1879, findbackward(N1879, " ")) which would have given me the last word in the sentence.  

Ah well.  Every language is missing something, that's why programming is sometimes a bit tricky.  

Hope you find this video useful!  

Comments (especially about how to improve this) welcome. 

Search on! 


4 comments:

  1. I will give it a go and let you know how I make out. In the meantime, the California 2010 results were unusually high in your original Googlesheets 650+(answer page) but your video is showing different results +150. I would like to figure out why the difference so can you share the Googlesheets presented in the video? You can see on the video at 17:13 what I am referring to.

    May I recommend that if Google decides to do a MOOC on Googlesheets as they did with Fusion Tables, that they include this video which I think would be excellent because it is basically self-contained requiring very little additional information.




    ReplyDelete
    Replies
    1. I just realized why the difference. Your new googlesheet starts in August 2010 and the old had more than 500 earthquakes between Jan and Aug 2010. This unusual activity is quite interesting.

      Delete
  2. Thanks Dr. Russell for this extra post. Outline idea is fantastic, is kind of summary that allows us to give better use to video.

    I'll do the analysis with Mexico data, and post the findings.

    Amazing as always.

    ReplyDelete
  3. Hi!

    The SPLIT() function would pull off the state into a separate cell if the comma always precedes the state and does not occur elsewhere in the data.

    regards,

    ReplyDelete