Second Day

         First thing I did was update my version of the scraper and try it again. Near the end of the process the program froze. I added a new issue to the github page. The next thing I did was delete and reinstall the entire script to ensure it was the correct branch. I then ran it again and waited. The program finished, but the first page bug is still not fixed.
        While waiting for a solution I decided to focus on the data I did have for Prince Williams County. I was able to create a database with two tables, one for the 2015 House of Delegates Election and the other for the 2011 House of Delegates Election. Both of these tables only have the data for District 31, which is the District Elizabeth Guzman is working towards. I added the 2007 Election to get a better idea of the voting trends.
        If I was trying to see the breakdown of the 2007 House of Delegates Election for only Prince William County I would type:
SELECT * FROM District_31_2007 WHERE District_31_2007.county_city = Prince William County 
Then a table would pop up showing me that 4,585 people voted for the Republican Party and 4,540 voted for the Democratic party.
        Marco got on and started to help me with the scraper. We ran into some trouble but that was due to me having the wrong branch running. He walked me through on how to switch it. I think I am starting to get a better understanding of git. Once I was on the correct branch he gave me a more verbose option to run the scraper on so that it would be easier to debug any problems. Then he told me to look into piping and grep. Piping is sending output from one command to another. Grep is a way to check for certain phrases in the output. The command that I wrote to look for an id of '80871' is:
python src/multiscraper.py -vvv 2>&1 | grep '80871'
The id is '80871' because that is the id for the most recent Presidential election. This election appears first in the database and for some reason the script skips the first page. After running this command a few times nothing showed up, which means that it did not find any instance of '80871' in the output of the scraper.
         After my lunch I tried the above command with an addition option '-t 1' which alters the number of threads that run this process. For an unknown reason it worked and I had all of the elections in .csv format. Since I was getting all of the data I needed I closed that issue on github. I then moved on to trying to debug what happened to me this morning when the scraper froze. I just ran it once all of the way through without a problem, so I am going to run it a few more times to make sure it was not a fluke.
        I ran the program 4 times, only on the last one it froze near the end. Also during the 4th run my screen turned off and locked. I think that it has something to do with having that, Marco explained it as the program losing network connection which makes sense. The only problem is that it does not work every time.
         I spent about an hour trying to debug the issue that I filed this morning. I started to pipe the output into a log so that I could search through it there and not in the terminal. Finally I was able to make the program freeze in the log. It seems that bug is very rare and in the grand scheme of things very insignificant.
        The next thing on my agenda is to try and combine the separate tables to make a "Master Table" and to make tables for the most recent Presidential Elections. Then comparing them in some sort of graph or chart to see any trends or patterns.

Comments

  1. Great work! You made a lot of progress in several areas: submitting issues on the scrapper script to the git repository so that they could be fixed, learning to use git and github, and successfully obtaining the data you wanted.

    You were given this project because you expressed an interest in learning to use databases to do something real. I would say that you have spent enough time with the python script that scraps the data. Remaining problems can be left to others in the future, since you have been able to get the data you wanted, it has already solved your current problem for you.

    Today and tomorrow, begin to structure your District 31 database. What do the records in the database look like? What fields do they have. How did you convert the CSV data into database tables? Now that you have the data, it is time to explore it to gain a better understanding of what it contains and how you would structure it to be most useful.

    ReplyDelete

Post a Comment

Popular posts from this blog

First Day

Fifth Day