Finishing Up Protest Project

It is such a huge relief to be finishing up my Student Protests project. This project has taught me a lot about everything involved in DH projects. I got a first hand experience doing research, creating a database, and doing analysis.

The most challenging part of this project has been collecting the data. There were many road bumps that I ran into while collecting my data. For one, I was using ProQuest as my only source for the articles. ProQuest is limited to articles that have been archived online by MSU’s partnership with associated press organizations, therefore if the press doesn’t update their own database with most recent articles ProQuest does not update either. To get a holistic sample with current articles included, I searched sites such as Lansing State Journal, and The State News for recent articles about student protests. Another thing I do not like about ProQuest is that search results are so specific to the search term for example there is a huge difference between searching Michigan State University and “Michigan State University” and Protest and Protests.

I also am limited by what is actually archived online. There are many sections in the MSU library that include news articles but they are not archived online. The librarians informed that the library has advanced OCR technology for archival purposes but the waiting list is long and usually reserved to faculty and people with research grants. There is a whole chunk of missing data from the corpus from years before the 1990s, including women’s rights and Vietnam war protests. If I were to continue growing this project I would take it upon myself to go through some of the protest articles and summarize them in my database. I am especially curious about the Iraq war protests in comparison to Vietnam war protests which was the last war in which we drafted soldiers.

I am using Google’s Fusion Tables for my analysis so I can filter out data and report only data that pertains to my research questions, for example researching the number of people involved in only Anti-war protests. I also really like that Fusion Tables comes with many different charting options, such as pie charts and line graphs, even network graphs. Using Fusion Tables I am also able to create a geographical map of where the different protests were located. For my project I am exploring how location effects outcome of protests, or if it does at all. To do this I am comparing the protests with the most people involved to their locations.

Overall I think that this project would be a great one to use on my portfolio. It will be a great display of all the concepts we have learned through the semester such as: researching, data analysis, database creation, mapping, and summarization of results. I also am learning a lot by making my results public. By letting other people interpret the data, I must make it as clear as possible and set clear definitions of anything that could be misconstrued.

Continuing With the Protests; Limited Data

As I am building my database for the protests on MSU’s campus there are a lot that I have to consider. As I am shifting through the articles I noticed that there were no articles before the 1980s. This excludes many important protests including all the vietnam war protests. Doing some more research I found that the Vietnam war was an extremely hot topic on campus. There were dozens of protests and demonstrations on campus in solidarity of the Anti-War movement. In fact, the one major factor that separated MSU and UoF at the time was their stance on the way, MSU being pro and UoM being anti.

I found that there was a large section in the library that included artifacts and articles about MSU’s involvement in the war, but unfortunately these are not archived electronically. I am finding information about these protests in more recent articles that refer the protests but I am hesitant to add these to the database without the original source.

Also to my disappointment, proquest includes articles that are archived, but current articles are not updated on the database, therefore the most recent protests are not included, such as the Furgeson protests and the protests against consent on campus. These articles being very important to my research question (What students protest today vs. what they did in the past) I might have to just collect these articles from different sources. In that case, I probably would search through the Lansing Journal website, knowing they would have the most relevant information.

I am also finding that some of my definitions might be vague, for example the difference between protests and demonstrations and pickets. To be clear about my classifications I decided to define as I go and include these definitions in the description on my project. I am also choosing to make database public so that if there project goes further in the future students will be able to add to it.

I am definitely getting a first hand experience collecting data, and the lack there of. It has become the most challenging part of the class. It makes me think about all the projects and research questions could be done if all the information was already made available. I hope that one day with the advancement of technology that everything will inherently be archived electronically.

Collecting Data and Building a Database

I am really excited to get going on my final project. I decided to do something a little different than the mission statements. I am hoping this will be a good project to showcase my research, digital mapping, database, and analysis skills. I think it will be a lot of fun to research something I have an interest in and can inform people about the history of important events concerning students on campus.

I want to research protests and demonstrations that occurred on campus. Using articles found on ProQuest I am building a database of the protests that I will eventually map using Google’s Fushion Tables. By mapping out the protests we will be able to see where student protests begin at and whether there are places on campus that have a bigger impacts on the message being delivered. I can also use data summarizations to research what kinds of protests were the biggest and what protests turn violent.

Going into it, I have a couple of research questions in mind such as:

What kinds of events did students protest in our past versus what students protest now?
What kind of protest turn violent?
Where do students convene to convey their message?
Does location have an effect on the outcome of people involved?

By having my research questions in mind, it was a lot easier to create a database because I have an idea of what kind of data Ill need to collect. So far my database categories include:

Location of protest
Date of protest
Reason for protest
Event (speaker, university decision, to raise awareness, ect.)
Type of protest (picket-line, riot, demonstration, ect.)
People Involved
People Arrested
Link to article citation

As much as I would love to have a full database I think it would be impossible to read all the articles in time. This is also something I have to keep in mind when choosing which events to include. I am hoping that I have a corpus of at least 40 articles. I have a little over 20 articles logged so far. I am trying to have a representative sample by using a variety of dates and types of protests. I believe that it will be a good start to the project.

I have already run into some obstacles with the data. Some of the articles aren’t as specific as others and do not include specific locations. I also have to make the decision of including the location where the protests originated or where they ended up; for consistently I decided to stick with the point of origin. I also decided to include only protests that involved Michigan State University students.

So far I think I am off to a good start and am excited to see the outcomes!

Planning for Final Project

Planning for the final project has proven difficult. This is not because of lack of inspiration or research questions, but because of the available data. I was really surprised to find that the University did not have an electronic database of all the mission statements of the past. I am sure after some more digging we would be able to find them dust-covered in the library but that would require us to hand archive them all. So we were left with the data made available to us online. Using the Wayback Machine I was able to find the mission statements dating back to 1997, but since then our mission statement has only changed once. Only giving us two mission statements to work with. It really made me think about all the information we have available of our past and how much is getting lost by not being archived electronically. My understanding of Digital Humanities is the analysis of a large collection of data, the difference between close reading a small sample and distant reading a large sample to draw comprehensive conclusions, as Matthew Jockers points out in Macroanalytics. So it did not really make sense to use topic modeling or related tools on such a small corpus.

My group and I then had the idea to use Macroanalytic tools with a larger corpus such as articles. We then explored the databases of articles that were available. These articles were all on databases hosted by ProQuest. The way that ProQuest is set up is that you have to choose each article individually to export it, and even then it exports as a PDF, meaning to topic model a set of articles we would have to also convert each PDF into a .txt file, another extremely tedious and time consuming task.

All the difficulty trying to find a corpus to use has proven the importance of the Librarians and Digital Humanists who make the effort to open source and archive information of our past electronically for anyone to use and ask questions about. I definitely have a newfound respect for those who painstakingly archive every word found in old newspapers. I can only hope the future OCR and related technology can make it easier for students and scholars to ask questions about our past.

As for the final project, my group and I have more brainstorming to do and hopefully we can come up with a good question to ask related to the mission statements.


Digital Mapping With Fusion Tables

I choose to explore Google’s Fusion Tables when doing my research. The corpus of data included information about Wifi in and around New York city. I really enjoyed learning about Google’s digital mapping tool and found it really easy to use and get started with.

I didn’t start with a research question in mind so I decided to play around with the data and tools to see what kind of features I would be able to work with. I was able to pull the data right from my Drive and import it into Fusion Tables. I first ran into a problem when Google only geolocated my ‘City” column and not the actual addresses of the Wifi locations. Without the addresses, I wouldn’t be able to fully map the corpus. I figured out that I had to use the column settings change the ‘Address’ column from text to location. This worked but it took a long time to geocode without the cities included in the addresses. This can be avoided in the future by using text functions in Excel to combine the two columns.

I first noticed that I was able to filter the data and sort that by value or by count. I decided to first filter by City. I started with New York because it had the most locations logged. I was curious what the difference was between ‘fee-based’ and ‘free’ Wifi, I noticed that there were corporations for  in both, I was curious why some corporations had ‘fees’ while others did not, assuming that the fee was included in a purchase at the store. My research question is:

Which major corporations offer free Wifi in the major New York Cities as opposed to those who offer Wifi with a  fee?

The major cities I choose to examine were the ones with the highest count: New York, Brooklyn, and the Bronx.

The corporations in New York that offered the majority of the free Wifi were Cosi (15), Barnes and Nobles (14), and Whole Foods (5).

Screen Shot 2014-11-10 at 11.04.55 PM


In Brooklyn the only corporation that offered free Wifi was Barnes and Noble (3). I decided to use the search feature to search Cosi and Whole Foods. There were no Cosis but I was able to see that there was one Whole Foods store that offered Wifi, but was fee based. That led me to wonder why different Whole Foods stores offered free Wifi in New York but fee based in other places.

And in the Bronx there was no one corporation that offered the majority of the free Wifi, with counts of all one.

All though I can not make any definitive conclusions I think this is a good start to a larger question when examining which companies offer free Wifi and how location affects those company’s decision.


Topic Modeling

I loved getting a hands-on experience doing topic modeling. I have seen examples of topic modeling, but I have not done any myself. I thought that using the Grange Visitor Corpus was a good example because we knew very little about it, leaving the topic modeling to speak for itself. Although we did not know about the Grange’s content in detail, by examining the the files for consistencies we were able to draw conclusions, making the end analysis easier to understand. I think that examining the files before hand and using the advice of Jennifer Vinopal in her article Biases and errors in our tools ,we were able to better sort the files after concluding they were indicative of the publishing date and that the articles OCR was not spot on, leaving some words out of the analysis due to error. Knowing these things, can affect the conclusions we draw.

Although the corpus is not 100% accurate, our sample is large enough that the OCR errors should not have a great effect on our topic model. Because I did not know much about the Grange newspaper, I decided to start by analyzing the results looking for patterns that would lead me to some conclusions. I started by opening the CSV files in excel and using the ‘countif’ function to get a total count of each topic ID. I noticed that the order in which the topics were put were not exactly indicative of their frequency. Since I was totaling the topics in the TopicsInDocs I concluded the topics which topics were used in the majority of the documents. From this I was able to see that the top three topics were 1, 2, and 7, some words that jumped out at me were home, farm, business, sister, brother and money. These word seem sort of opposing to me. On one hand, the most frequent topic includes words that have to do with famm business, profits, and committees, while coming up close there are topic words that are have to do with families and homes. To me this means that the newspaper could have included articles about the business of the farm, as well as articles that were focused on homemaking and family. That makes me wonder about what kind of audience would have read the newspaper, back in the 70s and 80s I would imagine that farmers and housewives had differing interests. This also makes me wonder whether the topics in the articles could have shifted as modern times progressed.

I then decided to use text functions in excel to separate the year from the file name, then used a pivot table sort the data into a readable way. I sorted the pivot table so I could see the count of each topic in each year.

Screen Shot 2014-11-03 at 10.25.07 PM

This gave me great insight as to which topics were used the most in which years. From the chart I was able to see that:

topic number one (grange state order granges members master visitor committee business secretary) was most prevalent in the 70s, dropping off completely by the late 80s.

topic number two (good bro grange day kalamazoo home time sister cobb brother) was most prevalent in early the 80s.

topic number seven (grange work good farm year ing money home mrs send) was used most in the 90s.

and the most consistently used topics were 6 (farmers work men great farmer state good people school agricultural) and 8 (time day man make life made water world money land).


Although my research is just a rough analysis of the corpus, I was able to come to conclusion that The Grange Newspaper articles included content about business and farming early on, and as time went on the topics became more about family and home, with the topics about money and agricultural staying consistent through out.

Gathering Data

Gathering data is the first step of any research project. Whether you know what your research question is beforehand or not, the data contains the conclusions you are looking for. The hard part is knowing how to use the data. This was perfectly displayed when we started our research project. Although we had a data set in mind (mission statements) we found it difficult to come up with an approach to effectively analyze the data. Although we could find access to the mission statements, we struggled with what about the mission statements do we want to know, and what data do we want to compare it to, and whether we had access to said data.

The explosion of analysis and critic in the humanities is greatly due to the access of thousands, even millions, of data sets made available over the internet. It almost hinders me when asking questions because of the vast amount of data available it can be hard to pinpoint what you want to use. Even texts from hundreds of years ago are made available online with the use of OCR. We are starting to see an age where more and more collections are archived online. And were starting to reach a point where information done online or on a computer is inherently archived and saved without us even having the think about it. (Google has millions of bytes of information about all searches and users that they store indefinitely)

Although all this information is available online, does not mean that it is in the most useful format. As our presenter pointed out, there is a difference between structured data and unstructured data. For many researchers structured data is far better to use because everything is neatly organized in a database with metadata attached. Even then, as Stephen Ramsay points out, not al databases are created equally. Some are more organized than others, and some may contain the same information but can be structured in completely different ways. This is why you have to understand the database from the authors perspective in order to effectively use it.

I did not think that the beginning stages of a research project can be so difficult, but it makes sense since it possibly is the most important stage. In this stage we collect the data needed and learn about how it is structured, elements that come into play through the rest of the project. I learned that although you may have a research question and the necessary tools to complete it, it is important to ask yourself how this information is important or relevant, and how and why a user uses the information. I also think that it is important to realize that the possibilities that data analysis might hold and to not limit yourself to your early research question.