Skip to main content

Digital Humanities: Intro to Text Analysis & Data Mining

Text Analysis & Data Mining

Text mining, sometimes called Text Analytics or Text Data Mining, is a way to analyze batches of text in natural language and gather statistical information.  

The basic steps are:

  1. Have a research question
  2. Find the data
  3. Digitize the data (if not born digital)
  4. Clean the data
  5. Analyze the data
  6. Visualize the data

Many scholarly disciplines require deep and close analysis of text-based data, ­whether it be a digitized text of an 18th century novel, bank of blog entries, or qualitative data from a user feedback survey.

When analyzing your text-based data, what is your question? Are you analyzing for:

  • Micro/Macro Analysis
  • Visual Analysis
  • Word frequency (lists of words and their frequencies)
  • Word trends
  • Comparative?
  • Grounded theory question? A research methodology where a hypothesis is reverse-engineered by coding the data after it is collected.
  • Collocation (words that commonly appear near each other)
  • Concordance (alphabetical list of words)
  • N-grams (common two-, three-, etc.- words, phrases, phonemes, syllables, letters, etc...which can then be checked for frequency and context)
  • Entity recognition (identifying names, places, time periods, etc.)
  • Sentiment analysis

Sources for texts already digitized:

If you are not using text that is already digitized, you can always digitize texts yourself. Be sure to check on copyright and fair use. 

Once you find your blocks of text or series of text you need to prep them to be read by a computer.

  • We usually need to clean it and then save it as some sort of spreadsheet so it can be systematically read. 
  • We  often add  metadata for each data source to save it in the spreadsheet logically. 
  • Each source might have many parts like title, author, abstract. So we need to tell the computer, via the spreadsheet how to read those chunks. 
  • We isolate the chunks by adding metadata and code.
  • We don't have to stop there. We can isolate the different parts/style/structure etc... by adding metadata and code. 
  • We also need to reduce noise (clean up typos, etc..) indiscrepancies, unnecessary information to make it easier to read. We need to remove stop words (lemmatization ) and tell the computer to ignore certain variations (stemming). 

There are many free tools available to help you with scrubbing, lemmatization stemming, etc... 

http://wheatoncollege.edu/lexomics/tools/

Choose from the many free text analysis tools listed on this page. 

 

Text Analysis and Visualisation: A comparison of different tools. 

 

7 Things you should know about data visualization: A short article published by Educause, 2009.

Related Guides:

GIS Research Guide

WMU ScholarWorks Page


TED Talk on Google N-grams: What We Learned From 5 Million Books

Sentiment Analysis: Obama's Victory Speech

Loading