About Text Analysis and Semantics
What is the content of a text? Or to be more precise: what are the core elements which must be identified in order to grasp the essential meaning of a text?
Whether it is a press article, a book, a speech, or any other sequence of language, every text contains a few key sentences conveying the ideas that make up its framework: its skeletal structure. The problem, then, is to locate this central core of the text that holds the essentials of its meaning. This is the crucial first step, before any attempt at interpretation can be made.
We can say that a text consists of various worlds in which different actors do, form or say various things in combination with other actors. And we can say that these worlds ‒ which are invariably propositional in form ‒ have different levels of importance in the structure of the text. And finally, we can say that some of these worlds ‒ a very small proportion of them ‒ constitute the foundations of the text, in that if they were removed, the “textual construction” would collapse, and the meaning would be lost.
Content analysis, then, applies a set of techniques to a given text to determine:
- the identity of the main actors,
- the relations in which they stand to each other,
- the hierarchy of these relations and how they evolve.
To sum up, content analysis consists in revealing the framework of a text, i.e. its meaning. This necessarily implies two things. First, there must be a theoretical conception of the text: this must describe both the textual organization of the things that are said and the structural organization of the thought-processes of the people who say them. Secondly, it implies the use of a tool derived from this theoretical conception and which rigorously excludes the subjectivity of the investigator – at least until the analysis is finished.
Natural Language Processing and Semantics
To process a text analysis, the semantic engine operates in 6 stages:
- sentence and proposition hashing,
- ambiguity solving (with respect to the words of the text),
- identification of equivalent classes (senses),
- statistics, detection of bundles and episodes,
- detection of the most characteristic parts of text,
- layout and display of the result.
Words are grouped together in several main word categories. Among these, six are of interest to us:
- connectors (conjunctions, conjunctive phrases),
- modalities (adverbs or adverbial phrases),
- qualifying adjectives,
- personal pronouns,
- substantives and proper nouns.
An analysis is highly complex. During the process, the software will:
- assign all the significant words to the above categories;
- analyze their distribution into subcategories (word categories, equivalent classes, see below);
- examine their occurrence order, both within the propositions (relations, actants and acted) and throughout the text (distribution graph, bundles, episodes, most characteristic parts of text, see below).
Propositional Hashing (Lexical Analysis)
To simplify the analysis, the semantic engine divides the text into propositions (simple sentences). This first stage is based on a scrutiny of the punctuation, and on a complex process of syntax analysis, which will not be detailed here. This yields highly reliable co-occurrence statistics (relations), since it is not possible for two words to fit into the same grammatical proposition if they are not closely connected. Propositional hashing is bound to involve errors (propositions that are either too short or too long), but this does not affect the results.
Ambiguity Solving (Semantic and Lexical Analysis)
The automatic interpretation of words in any living language, either written or spoken, requires the solving of numerous ambiguities:
- grammatical and syntactic (the word “pound”, for example, corresponds to a noun in “a pound of fish”, and to a verb in “Pound the spices into a paste”),
- semantic (the noun “pound”, for example, can refer to a weight or to a currency). On average, one word out of four is deemed ambiguous.
One of the main functions of this software is to solve these ambiguities by means of a set of artificial intelligence problem-solving algorithms. Though a perfect result is impossible to achieve, the error rate is low enough to guarantee an accurate analysis of your text.
Word Categories (Semantic Analysis)
- Verbs may be:
- factive, i.e. expressing actions (“to do”, “to work”, “to walk”, etc.)
- stative, i.e. expressing states or concepts of possession (“to be”, “to stay”, etc.)
- declarative, i.e. expressing a statement about circumstances, beings, objects, etc. (“to think”, “to believe”, etc.)
- performative, i.e. expressing an act through the language and contained within the language (“I promise”, “I demand”, etc.)
Connectors (coordinating and subordinating conjunctions, conjunctive phrases) link together various parts of the discourse through concepts of:
- condition (“if”, “in as much as”, “in case”, etc.),
- cause (“because”, “in consequence of”, “therefore”, etc.),
- goal (“so that”, “in order that”, etc.),
- addition (“and”, “moreover”, “along with”, etc.),
- disjunction (“or”, “either… or”, “whether… or”, etc.),
- opposition (“but”, “nevertheless”, “however”, etc.),
- comparison (“as”, “like”, “as well as”, etc.),
- time (“when”, “while”, “since”, etc.),
- place (“whereby”, “whereupon”, etc.)
Personal pronouns are displayed in person (“I”, “you”, “he”, etc.) and in number (“they”, “we”, etc.) The middle/old English “thou” form is also detected here.
Modalities (adverbs or adverbial phrases) enable speakers to involve themselves in what they say, or to locate what they say in time and space, through concepts of:
- time (“now”, “yesterday”, “tomorrow”, etc.),
- place (“there”, “here” , “below”, etc.),
- manner (“directly”, “together”, etc.),
- assertion (“absolutely”, “certainly”, etc.),
- doubt (“maybe”, “probably”, etc.),
- negation (“not”, “never”, “nothing”, etc.),
- intensity (“very”, “much”, “strongly”, etc.)
Adjectives may be:
- objective, i.e. describing beings or objects, regardless of the speaker’s standpoint (color adjectives, for example),
- subjective, i.e. indicating judgment on something or on somebody, thus enabling expression of the speaker’s standpoint (“beautiful”, “small”, “nice”, etc.),
- or numeral, i.e. grouping together numbers (in letters or in figures), along with ordinal and cardinal adjectives.
Other word categories include pronouns, articles, prepositions and non-qualifying adjectives. You do not need to take these categories into account, as they are used only for ambiguity solving.
Use of Word categories (Text Analysis)
Broadly speaking, we can say that:
- time and place connectors and modalities provide the means to locate the action,
- intensity and negation modalities provide the means to dramatize the discourse,
- cause and condition connectors provide the means to construct a chain of reasoning,
- addition connectors provide the means to enumerate facts or characteristics,
- opposition connectors more specifically provide the means to argue, to put things into perspective and to set out conflicting standpoints.
Statistical, probabilistic and cognitive analyses (Semantic Analysis)
Tropes performs different types of text analyses:
- statistics on the total frequency of occurrence of the main word categories and of their subcategories,
- statistics on the co-occurrence and connection rates of equivalent classes and word categories,
- a probabilistic analysis of the words occurring in bundles,
- a geometric analysis of the Bundles delimiting the episodes,
- a cognitive-discursive analysis (CDA), by which the most characteristic parts of a text can be detected.
The statistics are used for various purposes, among others to create the graphs and set out the results.
The frequent word categories and the text style are obtained by comparing the distribution of the occurrence frequency of the categories observed in the text with linguistic production norms. These norms have been derived from the study of a great number of different texts. They are stored in specific in-built tables.
References and Relations (Semantic Analysis)
The equivalent classes constitute groups of closely related meanings (common nouns, proper nouns, trademarks, etc.) appearing frequently throughout the text. For example, “father” and “mother” are grouped together into the “family” class.
The reference fields group together the words comprising the equivalent classes in order to enable the software to build up a representation of the context. To achieve this, the semantic equivalents dictionary of Tropes is composed of three different classification levels. At the lowest level are the references, which are next merged more broadly into reference fields 2, which, in turn, are merged into reference fields 1.
In the example below, the word “Lord Chancellor” belongs to the “minister” reference, included in the “government” field 2, which is part of the “politics” field 1. The “politics” field 1 includes broader concepts, such as “political system”, “foreign policy”, etc.
|Fields 1||Fields 2||References||Words|
|Politics||Government||Federal government||Federal government|
|Politics||Government||Head of Government||Head of Government|
|Politics||Government||Head of Government||Prime Minister|
|Politics||Government||Minister||Secretary of State|
Bundles and Episodes (Chronological Analysis)
Tropes employs two different tools to study the chronology of a discourse. This analysis is based on two notions, bundles and episodes:
- a bundle groups together word occurrences (belonging to an equivalent class or to a word category) that tend to appear in a remarkable density within a limited portion of the text (somewhere near the beginning, the middle or the end, but not on a regular pattern throughout the whole text).
- an episode corresponds to a part of the text in which a number of bundles have been formed and completed. These are large blocks of arguments, quite representative of the structure of the discourse studied.
Most Characteristic Parts of Text (Summarization)
The contraction of the text reveals the most characteristic parts of text. These are “propositions introducing main themes or characters, expressing events that are essential to the progression of the story (causal attributions of consequences, results, aims)”.
To extract these propositions, Tropes carries out a complex cognitive-discursive analysis process (CDA). To simplify matters, let us say that each proposition of the text is allotted a score, depending on its relative weight, its occurrence order and its argumentative role. The propositions are then sorted according to their respective scores. To enable you to control the amount of displayed propositions, and to insure that the result obtained reflects the text analyzed, Tropes provides means for adjusting the contraction rate of the text.
Text Style (Semantic Analysis)
The software makes a diagnosis of the text style and of its setting according to the statistical indicators retrieved during the text analysis.
Here are the possible styles:
|Argumentative||the speaker involves himself, argues, explains or analyses in order to try to convince the interlocutor|
|Narrative||a narrator states a series of events, happening at a given time, and in a given place|
|Enunciative||the speaker and the interlocutor establish a mutual relation of influence, make their standpoints known|
|Descriptive||a narrator describes, identifies or classifies something or somebody|
Semantic Text Analysis features (Tropes)
- automatic generation of hierarchical keywords (thesaurus generation)
- automatic semantic and syntactic ambiguity solving
- automatic semantic analysis and categorization
- three-layer automatic concept classification
- text summarization
- chronological discourse analysis
- natural language ontology manager with ready-to-use classifications (scenarios)
- minimum recommended size for a text: less than 1 page (1 Kb),
- maximum recommended size for a single text: 5,000 pages (50 Mb)
- average analysis throughput: from 250 Kb to 1 Mb/second (on a single thread); analysis throughput may be slower if you analyze complex PDF or Word 2007 documents.
Semantic Search Engine & Desktop Search features (Zoom)
- automatic generation of hierarchical keywords (ontology or thesaurus generation)
- natural language search with ambiguity solving
- automatic information filtering (based on a pertinence threshold)
- similarity search with controlled vocabularies
- massive data mining analysis (text mining)
- search improvement for the references (nouns, trademarks and proper names)
- integrated Web Spider
- maximum number of indexed folder structures: unlimited
- average text indexing throughput: from 1 to 4 Gb/hour (single thread, on a local HDD); indexing throughput may be slower if you analyze complex PDF or Word 2007 documents.
- Microsoft Windows 2000 SP4, XP, Vista or Seven (32 bits)
- a recent Web browser, Internet Explorer 6+ / Firefox 2 (or above)
- Adobe Reader (or equivalent PDF reader)
- word processing software (optional)
- Microsoft Word 2007 / Office 2007 or Microsoft Filter Pack (free) are required for Word 2007 DOCX
- Spreadsheet (optional), Microsoft Excel is required for some analysis reports
- PC computer (Pentium IV or compatible)
- CD-ROM or DVD-ROM Drive
- XGA (1024×768) graphic display
- a free USB port
Minimum recommended configuration
|Software||Processor||RAM memory||Disk space|
|Tropes Zoom||1 GHz||1 Gb (2)||100 Mb|
(1) Under Windows Vista or Seven you must install a minimum of 1 Gb of RAM for correct performances.
(2) 2 Gb of RAM are recommended to perform decision-making analysis (on a significant number of indexed items) with Tropes Zoom.
Maximum capacity (theoretical)
|Software||Text size||Number of files||Number of words||Raw Database Size|
|Zoom||100 Mb||4,000,000||40 trillions||2 Tb|
By “unlimited”, we mean that the theoretical capacity of these software packages widely exceeds what it is possible to handle on a present-day computer. Your computer naturally has limited capacities.
For Zoom, we show the theoretical maximum capacity for a single folder. But you can index an unlimited number of folder structures.
By raw database size, we mean the maximum theoretical capacity in terms of text files indexed by the software in a single folder. Of course, it may require terabytes of disk space.
|German||unavailable (1)||unavailable (1)|
|Italian||unavailable (1)||unavailable (1)|
(1) Language in course of finalization, available under certain conditions: contact us.
- file formats supported: HTML, Microsoft Word 97-2007, Microsoft WordML (Word XML), PDF, RTF, Unicode, ASCII, ANSI, XML, SGML and Macintosh raw texts
- automatic extraction of Microsoft Outlook messages (via an external utility)
- automatic exportation of the results to other software
- runtime, or some features (for example, very large text mining) may require the use of additional statistics / data mining software and/or a RDBMS
(Scientific background (*) and related works)
J. L. Austin (*): How to Do Things With Words, Oxford University Press, 1962.
D. Bonet & G. Pache: What they say and what they do… The logistical relationships between manufacturers and large food retailers re-examined through managers’ words, proceedings of the 88th EAAE Seminar, Paris, May 2004, pp. 1-14, CD-ROM.
M. Brugidou & P. Le Queau (*): Analysis of non directive interviews with the “bundle” method, Paris, WAPOR conference, 1999.
J.-M. Burkhardt & M. Wolff: Comparing designers’ viewpoints in simulation and virtual reality, proceedings of the International Ergonomics Association conference (IEA2003), Seoul, Korea, 2003.
S. Despres & B. Delforge: Designing medical law ontology from technical texts and core ontology, 12th International Conference on Knowledge Engineering and Knowledge Management, EKAW Ontology and texts Workshop, 2000.
R. Ghiglione (*): Verbal behaviour analysis and processing of textual data, conference, University of Reykjavik, 1993.
R. Ghiglione, G. Minnini & E. Salès (*): The intralocutor’s diatextual frame, Journal of pragmatics, 1995.
Y. Kodratoff: Comparing Machine Learning and Knowledge Discovery in DataBase, text associated with a course delivered at the ECCAI summer course, Crete July 1999.
Y. Kodratoff: Knowledge Discovery in Texts: a definition and applications, Foundation of Intelligent Systems, Ras & Skowron (Eds.) LNAI1609, Springer, 1999.
N. Langlie: Educational Technology Leaders: Competencies for a Conceptual Age, dissertation presented in partial fulfilment of the requirements for the degree doctor of philosophy, Capella University, 2008.
J. Lyons (*): Semantics, Cambridge, Cambridge University Press, 1993.
A.Piolat & R.Bannour: An example of text analysis software (EMOTAIX-Tropes) use: the influence of anxiety on expressive writing, Current Psychology Letters [Online], 2009.
F. Raby: A User-Centred Ergonomic Approach to CALL Research, in J. Ebert & G.Petrie (Eds.) CALL Research Perspectives, Mahwah, NJ: Lawrence Erlbaum Associates, 2005.
Schank, R. (*): The structure of episodes in memory, in D. Bobrow and T. Collins (eds.), Representation and understanding: Studies in cognitive science, New York, Academic Press, 1975.
Schank, R., and R. Abelson (*): Scripts, plans, goals, and understanding: An inquiry into human knowledge structures, Hillsdale, NJ: Lawrence Erlbaum, 1977.
J. Vander Putten & A. L. Nolen: Qualitative Data Analysis: Comparing Results From Constant Comparative and Computer Software Methods, conference of the Association for the Study of Higher Education, November 5-8, 2008 in Jacksonville.
T.A. Van Dijk & W. Kintsch (*): Strategies of Discourse Comprehension, New York, Academic Press, 1983.
D. Veale, G. Poussin, F. Benes, J.-L. Pepin & P. Levy: Identification of quality of life concerns of patients with obstructive sleep apnoea at the time of initiation of continuous positive airway pressure: A discourse analysis, in Quality of Life Research. Springer Science + Business Media B.V. Volume 11, Number 4, June 2002.
W. Visser & M. Wolff: Route Plan Descriptions. A Cognitivo-Discursive Analysis, Thirteenth Annual Meeting of the Society for Text and Discourse, Madrid, Spain, 2003.
W. Visser & M. Wolff: A cognitive approach to spatial discourse production. Combining manual and automatic analyses of route descriptions, proceedings of European Science Conference 2003, EuroCognSci03 (pp.355-360), Osnabrück, Germany, 2003.
S. With. & S. Kaiser: Multimodal Annotation of Emotional Signals in Social Interactions , proceedings of the International Meeting at the Institute of Psychology, University of Innsbruck / Austria, September 28 – 29, 2007.
© Semantic Knowledge, www.semantic-knowledge.com