Author profiling
Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features. Characteristics analysed commonly include age and gender, though more recent studies have looked at other characteristics like personality traits and occupation
Author profiling is one of the three major fields in Automatic Authorship Identification, the other two being authorship attribution and authorship identification. The process of AAI emerged at the end of the 19th century. Thomas Corwin Mendenhall, an American autodidact physicist and meteorologist, was the first to apply this process to the works of Francis Bacon, William Shakespeare, and Christopher Marlowe. From these three historic figures, Mendenhall sought to uncover their quantitative stylistic differences by inspecting word lengths.
Although much progress has been made in the 21st century, the task of author profiling remains an unsolved problem due to its difficulty.
Techniques
Through the analysis of texts, various author profiling techniques can be applied to predict information about the author. For example, function words, as well as part-of-speech analysis, can be referenced to determine the author's gender and truth of a text.The process of author profiling usually involves the following steps:
- Identifying specific features to be extracted from the text
- Building an adopted, standard representation for the target profile
- Building a classification model using a standard classifier for the target profile
- Support Vector Machines
- Naive Bayes Classifiers
- Deep averaging networks, many layers in a cycle of machine learning that utilizes the mean of word embeddings within a text
- Long Short-Term Memory
With the advances in technology, author profiling on the Internet has become increasingly common. Digital texts, such as social media posts, blog posts and emails, are now being used. This has sparked greater research efforts because of the advantages analysing digital texts can bring to sectors like marketing and business. Author profiling on digital texts has also enabled predictions of a wider range of author characteristics such as personality, income and occupation.
The most effective attributes for author profiling on digital texts involve a combinations of stylistic and content features. Author profiling on digital texts focuses on cross-genre author profiling, whereby one genre is used for training data and another genre is used for testing data, though both need to be relatively similar for good results.
Tthere are some problems when performing author profiling techniques on online texts. These problems include:
- Wide variation in lengths of texts used
- Class imbalance in data
Author Profiling and the Internet
Social Media
The increased integration of social media in people’s daily lives have made them a rich source of textual data for author profiling. This is mainly because users frequently upload and share content for various purposes including self-expression, socialisation, and personal businesses. The Social bot is also a frequent feature of social media platforms, especially Twitter, generating content that may be analysed for author profiling. While different platforms contain similar data they may also contain different features depending on the format and structure of the particular platform.There are still limitations in using social media as data sources for author profiling, because data obtained may not always be reliable or accurate. Users sometimes provide false information about themselves or withhold information. As a result, the training of algorithms for author profiling may be impeded by data that is less accurate. Another limitation is the irregularity of text in social media. Features of irregularity include deviation from normal linguistic standards such as spelling errors, unstandardised transliteration as with the substitution of letters with numbers, shorthands, user-created abbreviations for phrases and et cetera, which may pose a challenge to author profiling. Researchers have adopted methods to overcome these limitations in training their algorithms for author profiling.
In the context of Facebook, author profiling mainly involves English textual data, but also uses non-english languages that include: Roman Urdu, Arabic, Brazilian Portuguese, Spanish. While author profiling studies on Facebook have been predominantly for gender and age-group identification, there have been attempts to derive attributes to predict religiosity, the IT background of users, and even basic emotions among others.
Author profiling for Weibo content requires algorithms different from those utilised for other social media platforms, mainly due to the linguistic differences between Mandarin Chinese and Western languages. For example, Chinese emotions involve Chinese characters describing the gesture or facial expression in brackets, such as: e.g. ‘laughter’, ‘tears’, ‘giggle’, ‘love’, ‘heart’. This differs from the use of punctuation symbols for emoticons in Western languages, or the common use of the Unicode emojis in other platforms such as Facebook, Instagram, et cetera. Further, while there are around 161 western emoticons, there are around 2900 emoticons regularly used in Mainland China for web content as in Weibo. In order to tackle these differences, author profiling algorithms have been trained on Chinese emoticons and linguistic features. For example, author profiling algorithms have been designed to detect Chinese stylistic expressions expressing formality and sentiment, in place of algorithms detecting English linguistic features such as capital letters.
As compared to other more popular, globalised platforms, texts on Weibo are not as commonly used in the task of author profiling. This is likely due to the centralisation of Weibo in the Chinese population of Mainland China, limiting its usage to predominantly China Nationals. Studies done for this platform have utilised bots, machine learning algorithms to identify authors’ age and gender. Data is acquired from Weibo microblog posts of willing participants to be analysed, and used to train algorithms that build concept-based profiles of users to a certain accuracy.
Chat Logs
Chat logs have been studied for author profiling as they include much textual discourse, the analysis of which have contributed to applicational studies including social trends and forensic science. Sources of data for author profiling from chat logs include platforms such as Yahoo!, AIM and WhatsApp. Computational systems have been devised to produce concept-based profiles listing chat topics discussed in a single chat room or by independent users.Blogs
Author profiling can be used to identify characteristics of blog writers, such as their age, gender and geographical location, based on their different writing styles, This is especially useful when it comes to anonymous blogs. The choice of content words, style-based features and topic-based features are analyzed in order to discover characteristics of the author.In general, features that are frequently occur in blogs include a high distribution of verbs per writing and a relatively high use of pronouns. The frequency of verbs, pronouns and other word classes are used to profile and classify emotions in the writings of authors, as well as their gender and age. Author profiling using classification models that were used on physical documents in the past, such as Support Vector Machines, have also been tested on blogs. However, it has been proven to be unsuitable for the latter due to its low performance.
The machine learning algorithms that work well for author profiling on blogs include:
- Instance-based learning
- Random Decision Forests
Email
In author profiling for email, content is processed for important textual data, while unimportant features such as metadata and other hyper-text markup language redundancies are excluded. Important parts of the Multi-purpose Internet Mail Extensions that contain content of the emails are also included in the analysis. Obtained data is often parsed into various sections of content, including author text, signature text, advertisement, quoted text, and reply lines. Further analysis of email textual content in author profiling tasks involves the extraction of tone of voice, sentiment, semantics and other linguistic features to be processed.
Applications
Author profiling has applications in various fields where there is a need to identify specific characteristics of an author of a text, with a growing importance in fields like forensics and marketing. Depending on its application, the task of author profiling can vary in terms of the characteristics to be identified, number of authors studied and number of texts available for analysis.Although its applications have traditionally been limited to written texts, such as literary works, this has extended to online texts with the advancement of the computer and the Internet.
Forensic Linguistics
In the context of forensic linguistics, author profiling is used to identify characteristics of the author of anonymous, pseudonymous or forged text, based on the author’s use of the language. Through linguistic analysis, forensic linguists seek to identify the suspect’s motivation and ideology, along with other class features, such as the suspect’s ethnicity or profession. While this does not always lead to decisive author identification, such information can help law enforcement narrow the pool of suspects.In most cases, author profiling in the context of forensic linguistics involves a single text problem, in which there is either no or few comparison texts available and no external evidence that points to the author. Examples of text analysed by forensic linguists include blackmailing letters, confessions, testaments, suicide letters and plagiarised writing. This has also extended to online texts as well, such as sexually explicit online chat logs between middle-aged men and underaged girls, with the increasing number of cybercrimes committed on the Internet.
One of the earliest and best-known examples of the use of author profiling is by Roger Shuy, who was asked to examine a ransom note linked to a notorious kidnapping case in 1979. Based on his analysis of the kidnapper’s idiolect, Shuy was able to identify crucial elements of the kidnappers identity from his misspellings and a dialect item, that is, the kidnapper was well-educated and from Akron, Ohio. This eventually led to a successful arrest and confession by the suspect.
However, there are criticisms that author profiling methods lack objectivity, since these methods are reliant on a forensic linguist’s subjective identification of crucial sociolinguistic markers. These methods, such as those adopted by literary critic Donald Wayne Foster, are said to be speculative and based entirely on one’s subjective experience, and therefore cannot be tested empirically.
Bot Detection
Author profiling is adopted in the identification of social bots, the most common being Twitter bots. Social bots have been deemed as a threat given their commercial, political and ideological influence, such as the 2016 United States Presidential Election, during which they polarised political conversations, and spread misinformation and unverified information. In the context of marketing, social bots can artificially inflate the popularity of a product by posting positive reviews, and undermine the reputation of competitive products with unfavourable reviews. Therefore, bot detection from an author profiling perspective is a task of high importance.Made to appear as human accounts, bots can mostly be identified by information on their profiles, like their username, profile photo and time of posting. However, the task of identifying bots solely from textual data is significantly more challenging, requiring author profiling techniques. This usually involves a classification task based on semantic and syntactic features.
The task of bot and gender profiling was one of four shared tasks organised by PAN, which organises a series of scientific events and shared tasks of digital text forensics and stylometry, in its 2019 edition. Participating teams had achieved much success, with the best results for bot detection for English and Spanish tweets at 95.95% and 93.33% respectively.
Marketing
Author profiling is also useful from a marketing viewpoint, as it allows businesses to identify the demographics of people that like or dislike their products based on an analysis of blogs, online product reviews and social media content. This is important since most individuals post their reviews on products anonymously. Author profiling techniques are helpful to business experts in making better informed strategic decisions based on the demographics of their target group. In addition, businesses can target their marketing campaigns at groups of consumers who match the demographics and profile of current customers.Literary Works
Author profiling techniques are used to study traditional media and literature to identify the writing style of various authors as well as their written topics of content. Author profiling for literature is also been done to deduce the social networks of authors and their literary influence based on their bibliographic records of co-authorship.Some examples of author profiling studies on literature and traditional media include studies on the following:
- The Bible
- Gospels of the New Testament
- Shakespeare’s works
- The Federalist Papers in the 1990s and 1960s
- Author profiling studies for Lithuanian Literary Texts
Library Cataloguing
In using author profiling for library cataloguing, researchers have utilised machine learning for automatic processes in the library, such as Support Vector Machine algorithms. With the use of SVMs for author profiling, bibliographic records of authors within existing databases may be identified, tracked, and updated to identify an author based on her topics of literary content and expertise as indicated in his or her bibliographic records. In this case, author profiling utilises the social structures of authors that may be derived from physical copies of published media to catalogue library resources.