Text mining, sometimes called Text Analytics or Text Data Mining, is a way to analyze batches of text in natural language and gather statistical information.
The basic steps are:
Have a research question
Find the data
Digitize the data (if not born digital)
Clean the data
Analyze the data
Visualize the data
Many scholarly disciplines require deep and close analysis of text-based data, whether it be a digitized text of an 18th century novel, bank of blog entries, or qualitative data from a user feedback survey.
When analyzing your text-based data, what is your question? Are you analyzing for:
Word frequency (lists of words and their frequencies)
Grounded theory question? A research methodology where a hypothesis is reverse-engineered by coding the data after it is collected.
Collocation (words that commonly appear near each other)
Concordance (alphabetical list of words)
N-grams (common two-, three-, etc.- words, phrases, phonemes, syllables, letters, etc...which can then be checked for frequency and context)
Entity recognition (identifying names, places, time periods, etc.)
WordHoard WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer, Shakespeare, and Spenser.
Texts born digital: email, blogs, etc...
Web Scraping: There are tools available that allow you to use websites as sources of data. Propublica and School of Datahave a useful 'how to' guides.
If you are not using text that is already digitized, you can always digitize texts yourself. Be sure to check on copyright and fair use.
Once you find your blocks of text or series of text you need to prep them to be read by a computer.
We usually need to clean it and then save it as some sort of spreadsheet so it can be systematically read.
We often add metadata for each data source to save it in the spreadsheet logically.
Each source might have many parts like title, author, abstract. So we need to tell the computer, via the spreadsheet how to read those chunks.
We isolate the chunks by adding metadata and code.
We don't have to stop there. We can isolate the different parts/style/structure etc... by adding metadata and code.
We also need to reduce noise (clean up typos, etc..) indiscrepancies, unnecessary information to make it easier to read. We need to remove stop words (lemmatization ) and tell the computer to ignore certain variations (stemming).
There are many free tools available to help you with scrubbing, lemmatization stemming, etc...