7/20/2017

        Today I will be working on some issues from earlier this week that I found. The first issue I am going to fix is the file name issue. I plan on adding a new column with the name of the file, that way you can distinguish the data sets. The file names should also have the election name and the year of the election. While fixing that problem I also fixed the repeat problem. I just had to change the indentation of the print statement. The next issue that I need to fix is the directory. I want to make the code more general so that more people can use it.
        Right now is looks like this:
indir = '/home/dsmall/VA_votes/CSV_converter/Data' 
        Now it looks like this:
indir = 'Data/'
        I am not 100% sure why this works, but it does. The next step would be to get the scraper to work, but I do not know what is wrong with it. While double checking everything I found a new bug. Some rows of data are being output when they should not be. This is a problem, because these rows would cause a syntax error when you pasted the INSERT statements into SQL. I need to figure out a new way to weed out this bad data. I was able to create a regular expression that checked for any digit in a column meant for the number of votes. That way if there was a name of the person running or their party it would raise a flag and not let that row continue to the final array of data. The regex was a very simple one:
pattern = re.compile("\d")
        The '\d' means any digit.
        A new issue is trying to get only the data we want. The website that I am getting the data from only has two options: with or without precincts. When it comes to presidential elections, it only has the data for the entire state. So if I only wanted to get the District 31 data I would have to add a layer or verification to check for only a select list of precinct ids. I am going to use a SQL query to just look at all of the precinct ids inside of my working database. While double checking my code I stumbled upon a bug that I had already dealt with and fixed. I don't think it is the same bug, but it causes the same thing. I was able to fix it by changing the order of when the INSERT statements were actually generated. The next goal I have is to be able to select only District 31 data from Presidential elections and other things. While doing this I found out that each election is formatted differently and so my script would need to understand what kind of election each one is, and then change to format it correctly. This would also mean my database needs to be changed again. There is also a problem with the political parties between elections. Some have two and some have four or more.




        I am now going to make a step by step guide on how to use my script. I got my data from http://historical.elections.virginia.gov/.
  1. Make sure you have a Github account that is linked by SSH key to your computer.
  2. Open terminal and enter the directory where you want the script to be, the next step will create a folder named "CSV_converter" with the script inside.
  3. Paste this command into terminal:
    git clone git@github.com:duncan-small/CSV_converter.git
  4. Now enter the newly made "CSV_converter" directory.
  5. Paste this command into your terminal to run the script and get the INSERT statements output:
    python3 Test-CSV.py
  6. Scroll up to the first output that should start with "CREATE TABLE...", select everything that comes after it.
  7. Now paste this into your SQL IDE, it will make and fill your table for you.
  8. You can add more data in the form of .csv files or from http://historical.elections.virginia.gov/
  9. You can then search for any election that you want.
  10. It is important that when you press "Download this election" that you should also press "Precinct Results"
  11. Once you have the election downloaded drag it into the "Data" folder inside of the "CSV_converter" folder.

Comments

Popular posts from this blog

First Day

Fifth Day

Second Day