Journal Reflection #2: Insights into your first ML project

Please use your reply to this blog post to detail the following:

  1. A full description of the nature of your first ML project.
  2. Some insights into what predictions you were able to make on the dataset you used for your project.
  3. What types of conclusions can you derive from the images/graphs you’ve created with your project?
  4. Did you create any formulas to examine or rate the data you parsed?

Take the time to look through the project posts of your classmates. If you saw any project or project descriptions that pique your interest, please reply or respond to their post with feedback. Constructive criticism is allowed, but please keep your comments civil.

9 Replies to “Journal Reflection #2: Insights into your first ML project”

  1. 1. For this project, I found a dataset of tweets from Democratic and Republican representatives in the U.S. House. By parsing through each tweet and determining the number of times each word was written by Democrats and by Republicans, and subtracting the frequency of each word’s use by Democrats from the frequency of the word’s use by Republicans, I assigned a score to each word based on whether it was more likely for a Republican or a Democrat to write it. I then found scores for four representatives (whose tweets were left out of the training dataset) based on which words they used, with the goal of predicting which party they belonged to using their tweets. Unfortunately, all of the representative scores were positive (Republican) because Republicans used less words more frequently than Democrats, so I was unable to draw a line between the parties.

    2. Because of substantial overlap between scores from different parties, I was unable to predict party allegiance based on a representative’s twitter score.

    3. By creating a scatterplot of 237 representatives, with ideology scores (based on voting record) on the x-axis and my assigned twitter scores on the y-axis, I found that within each party, there was little to no correlation between ideology scores and twitter scores. However, the Republican twitter scores were, on average, higher than the Democrat scores, showing that on net, there is some difference between the twitter vocabularies of the two parties.

    4. I used this formula in order to assign scores to words:
    usesR = number of times word used by Republicans
    totalR = total number of words written by Republicans
    usesD = number of times word used by Democrats
    totalD = total number of words written by Democrats

    word score = (usesR/totalR) – (usesD/totalD)

  2. 1. My first ML project used data about different types of animal testing used in all 50 states and sorted them from worst to best.
    2. There are many different predictions I could make with this data, I decided to print a list of the worst to best states for animal testing, but earlier I tried to print a graph showing all of the states and their calculated scores, but there were too many entries for the graph and the state labels were unreadable.
    3. I was able to conclude the order of states from worst to best for animal testing.
    4. My formula to determine each state’s score for animal testing was: the number of total animals in the state put under pain during testing without any aid from drug multiplied by three, plus the number of animals in the state put under pain during testing but aided with drugs multiplied by two, plus the number of animals in the state tested without any inflicted pain or any drugs used. I chose this algorithm based on my own opinion that testing with pain is more immoral than testing without it, and that using drugs when pain is inflicted is slightly more moral that not using drugs.

  3. A full description of the nature of your first ML project.
    Some insights into what predictions you were able to make on the dataset you used for your project.
    What types of conclusions can you derive from the images/graphs you’ve created with your project?
    Did you create any formulas to examine or rate the data you parsed?

  4. Ok for real this time.

    1. The purpose of my first ML project is to predict the 2020 senate election in North Carolina. It will do this by parsing through old elections, comparing them with trends from elections before them by using a formula, and finally using said formula to give as accurate of a prediction as possible.
    2. I knew going into the project which states would lean which way on the political spectrum, so I think that influenced a lot of my bias going into the project. Once I narrowed my scope down from national to state-wide (in NC), however, I was able to infer more based on the data. Looking back at my graph now, I would predict that a Republican would win the Senate race. However, I strongly disagree based on the current state of politics surrounding the presidency. I would need different datasets to evaluate this, however.
    3. The conclusions I’ve drawn from the North Carolina graph is that it’s not a consistently partisan state, like New York is (I analyzed that graph separately and nearly all of the Senators since 1976 have been Democrats). It follows the state’s reputation as a “swing state.” When talking to Mr. Cochran, it became very clear that many of these elections are not just based solely on partisanship, and that the human element does matter. For example, Kay Hagan in 2008 was able to roll over her Republican opponent since some of the arguments she was able to make against her opponent were incredibly effective at turning voters towards her side. This really challenged my view of how partisan politics is today.
    4. I am working on a formula right now, as my project isn’t quite finished. Ms. McCain suggested that I use a piecewise function, but I might try to linearize the data to find a specific number that I can compare to election results.

  5. 1. My project analyzes UFC fighter stats and try to predict the winner. I found an algorithm and trained it with fighter data and got an accuracy score from test data. I also used the differences in stats between two fights to train an algorithm and get an accuracy score.
    2. I can make predictions of the winner of a UFC fight.
    3. NA. I have no images.
    4. No. I used a model I found to analyze the data. I tried to make an algorithm to figure out what was being wrongly predicted but it did not work and was unnecessary to the function of the code.

  6. 1. The purpose of my project was to (regressively) predict the average sale price of avocados in the US based on the quantity sold and its type (organic/conventional). This was achieved using a 2-hidden-layer neural network.

    2. With enough tweaking of parameters of network size and its training, I was ultimately able to get in-test-data price prediction to an average of less than 10¢ off, and an average of less than 18¢ inaccuracy on out-of-dataset examples. For comparison, the standard deviation was around 40¢.

    3. The graphs plotting average price vs predicted average price show that although the network can identify a price trend well on training data, it tends to segment test data into two areas of “high price” and “low price” without representing an overall trend. To me, this suggests that either the network is still a bit too simplified/lacking in some property to allow it to capture the trend better, or the dataset is limiting its learning in either the number of available examples or in the feature data provided (i.e. avocado price cannot be predicted solely from purchase quantities).
    In addition, although not included with the final project, plotting actual average price vs day of the year reveals a zig-zag trend in prices in correlation with the seasons.

    4. For one, I used several formulas to make my data more usable.
    – Firstly, I standardized it => (value – mean)/(range); this made all numerical features range strictly from -1 to 1.
    – I also tried, but did not ultimately use, normalization => (value-mean)/(standard deviation)
    – For comparing conventional vs organic, I converted it to a numerical feature -1 if convention and +1 if organic (which was then standardized).
    – To get the date data, I used the datetime library to convert the dates in YYYY-MM-DD format to an integer day of the year (which was then standardized).

    The formula used to assess the performance of my network in training was mean squared error (MSE), which was used as the loss function => avg[ (predicted – actual)**2 ]

    In testing, I expressed performance as the average absolute value difference between the predicted and actual value => avg[ abs(predicted – actual) ]

  7. A full description of the nature of your first ML project.
    The intent of my project is to determine if a tweet is spam based on the words used. By using an embedding encoder with data from Google, I was able to encode the words of a tweet into numerical form, while preserving their meaning. This made it easy for the network to comprehend the tweets and make a decision.

    Some insights into what predictions you were able to make on the dataset you used for your project.
    I found that tweets involving crime-related words had a high probability of being marked as spam, as well as short tweets and those beginning with “RT @username:” Pro-Trump tweets also registered high, but his own tweets did not.

    What types of conclusions can you derive from the images/graphs you’ve created with your project?
    Again, words related to crime topped the charts in terms of spam probability score. This may have something to do with spam bots calling certain political candidates “criminals.”

    Did you create any formulas to examine or rate the data you parsed?
    Apart from creating the word ranking graph and embedding words into vector space, the program architecture was pretty simple, with sklearn doing most heavy lifting.

  8. 1. For my first machine learning project, I used two datasets that I found on Kaggle to predict NBA games. The first dataset had all of the games that were played with information on which teams were playing, the date, the location, etc. Unfortunately, it did not include team record and other team specific information like point differential. So if found another dataset that had each teams stats on every day of the year. Combining these two files, I created a new data frame that included for each game the linear difference in wins between the two teams, the difference in point differential, and the difference in their wins over their last 10 games. Once I had this data frame, I used a number of models built in to sklearn and found that the best algorithm was the GaussianNB, with an accuracy of .658.

    2. The most interesting insight that I found is that the predictor was in fact more accurate when including all of the data. I was worried that the predictions would be just as accurate using only point differential, but this shows that momentum and being able to close out tight games does lead to better future results.

    3. Graphics weren’t a large portion of my project, but I did create some box and whisker plots to see how the data was distributed. One thing that I found interesting was how spread out the point differentials were. I didn’t realize that there were so many NBA teams with average margin of victory gaps as big as 20.

    4. I did not use any formulas to determine accuracy besides the percent of correct estimates on test data.

Leave a Reply

Your email address will not be published. Required fields are marked *