Multimodal rank

It is generally assumed that segmentation plays an important role in visual communication as part of the socially shared communicative resources employed both by text producers and by text receivers (Kress 2010: 44–5).

Rank in the description of language

In Systemic Functional Linguistics (e.g. Martin 1992; Halliday and Matthiessen 2004; Martin and Rose 2003) the concept of rank is used to identify and describe units that display grammatical patterns. The units (clause, group/phrase, word) are hierarchically organized, so that a clause is composed of groups/phrases that fulfill clause functions and a group/phrase is composed of words that fulfill group functions (intrastratal realization). The clause, as the largest grammatical unit, is considered a realization of the semantic categories (experiential) figure, (interpersonal) move and (textual) message (interstratal realization). The structure of a grammatical unit is described by reference to the corresponding system network (a clause is described by reference to the clause network, a group/phrase by reference to the group/phrase network). Rank scales are also used in the description of phonology and semantics.


  • One problem of the hierarchical rank scale used in SFL is that it is partly based on the (internal) structure of a unit and partly based on the (external) function of the unit. Thus, Hello! is considered a clause when it fulfills a move in an interaction, but it cannot be analyzed by means of clause functions such as Subject, Finite, etc. It is considered a minor clause (but still a clause). On the other hand, who lives in France (in: My mother who lives in France is a beautiful woman) is considered a clause because it is composed of clause functions (Subject, Finite, etc), but is treated as a rankshifted clause that does not fulfill a move in an interaction.
  • Basic question: Should we be able to identify multimodal grammatical units – and where does this leave the linguistic and visual rank scales?

Visual rank

One of O’Toole’s major contributions to visual social semiotics is his segmentation of images into the four rank scale levels of ‘Work’, ‘Episode’, ‘Member’ and ‘Figure’ (O’Toole 2011), where he takes inspiration from Halliday’s  segmentation of verbal language into ‘morpheme’, ‘word’, ‘group’ and ‘clause’ (2004: 32). O’Toole’s rank scale is geared to visual art and the terms are coined to shed light on the segmentation in more classical paintings. He defines Work as the overall level of the entire piece (equal to the overall text) wherein the other levels play different parts. Figure is what we recognize as a complete entity in the image – for O’Toole typically a person – and Episodes are groupings of Figures that are involved in different kinds of shared processes. Members are elements of the Figures that play important roles in the overall meaning. (O’Toole 2011) This could e.g. be the limbs of a person that may conduct different actions in the picture such as ‘walk’, ‘hold’, ‘point’ and so on. O’Toole argues that when viewing an image we tend to ‘home in’ on configurations of Members, Figures and Episodes and then ‘a kind of shuttling process’ takes place between the individual parts and the whole image (O’Toole 2011/1994: 12). It is demonstrated in cognitive eye-tracking experiments how this understanding is in fact close to how we actually perceive visual texts (see e.g. Holsanova 2001, 2008, 2011).

Baldry and Thibault (2006) use the term ‘clustering’ to account for the grouping of elements in visual texts. Their cluster theory builds on a multi-variable rank scale with the spatial grouping of elements as the major factor. The rank scale is dynamic in the sense that the rank scale levels are not a priori defined by certain formal features, and the elements at various levels may interact in very complex ways. According to Baldry and Thibault the reading of the fixated structure in images, which they term ‘cluster hopping’ (2006: 26), is based on a number of mechanisms such as periodicity and variation. The strength of Baldry and Thibault’s approach is that their dynamic clusters can account for very complex structures in visual texts and are not confined to visual art. On the other hand it stays rather abstract and while based solely on the relative proximity of elements in the pictorial frame it does e.g. not account for ideational relations between elements as was the case in O’Toole’s rank scale.

Boeriis (2012) has presented a dynamic functional rank scale that builds on the inspiration from  O’Toole, Baldry and Thibault and others. It describes basic visual ranking functions which are not a priory given by certain formal entities, as any grouping or sub-unit in an image is ‘suggested’ by the image itself in the given context. Boeriis presents the notion of ‘text zoom’ as a way of understanding how the way we may perceive (and create) texts as complex ‘texts in texts’ which we zoom in and out of (similarly to ‘shuttling’ or ‘cluster hopping’) as we make sense of them. At the level of the overarching ‘global text whole’ as well as on the level of any given ‘local text whole’ there are rank scale functions at play, so the different rank scales potentially apply simultaneously at many levels in complex visual texts. The dynamic rank scale functions are described as follows:

  • The Whole (both global and local) is what functions as an overall whole in the visual text, usually consisting of several Groups. More often than not the Whole corresponds to the totality of what is considered the visual multimodal text. The Whole is similar to O’Tooles ‘Work’, but is seen as a dynamic function.
  • Groups are a number of Units functioning as one unified entity. This may e.g. be the case when Units are parts in what Kress and van Leeuwen coin a classificatory process (2006). Relations within Groups can be described in terms of ranking mechanisms such as e.g. proximity, segregation, framing and joint process involvement, process fusion (See Boeriis 2012; Boeriis and Holsanova 2012). Groups are similar to O’Toole’s ‘Episodes’, but are seen as dynamic functions.
  • Units are elements that function as complete entities within the (world of the) visual text. Units may be any whole entity within a visual text such as for instance a person, a table, a house, a car or a tree. Units cannot be divided without breaking them into parts. Units are similar to O’Toole’s ‘Figures’, but are seen as dynamic functions.
  • Components are elements functioning as parts of a Unit and may play an important role in the overall meaning-making in a visual text; for instance, when certain features of a person’s look (intensive process) or clothing (possessive process) make it possible to recognize him or her as somebody known (identificational process). Or in component process fusion, e.g. when a number of action processes in the component of the face and body unite into a ‘look happy’ process at the Unit level. Components are similar to O’Toole’s ‘Members’, but are seen as dynamic functions.

A simple example:

Visual elements may play different united functional roles on the rank scale. Elements functioning as Components may join in certain ways and elements functioning as Units may group in certain other ways. These ‘ways of coming together’ are called ranking mechanisms (see Boeriis 2009; Boeriis 2012; Boeriis and Holsanova 2012).

Multimodal rank 

It seems fair to assume that rank hierarchies play a role in all kinds of multimodal texts. One way to create a multimodal rank scale description could be to take a point of departure in Boeriis’ dynamic functional rank scale’s basic notion of functions of wholes, groups, units and components. This, however, has not been examined further yet.

Citing this entry:

Boeriis, Morten. 2016.  “Multimodal rank.” In Key Terms in Multimodality: Definitions, Issues, Discussions, edited by Nina Nørgaard.


Baldry, A.P. & Paul J. Thibault (2006). Multimodal Transcription and Text Analysis. London: Equinox

Boeriis, M. (2009) Multimodal Socialsemiotik og Levende Billeder. Ph.D. dissertation. Odense: University of Southern Denmark.

Boeriis, M. 2012. ”Tekstzoom – om en dynamisk funktionel rangstruktur i visuelle tekster”, in T. Andersen and M. Boeriis (eds) Nordisk Socialsemiotik – multimodale, pædagogiske og sprogvidenskabelige landvindinger. Odense, Denmark: University Press of Southern Denmark, pp. 131–153.

Boeriis, M. & Holsanova, J. 2012. “Tracking visual segmentation: connecting semiotic and cognitive perspectives.” In: Journal of Visual Communication. Vol. 11 (3) London: Sage.

Halliday, M.A.K. and Matthiessen, C.M.I.M. 2004. An Introduction to Functional Grammar, 3rd edn. London: Arnold.

Holsanova, J. 2008. Discourse, Vision, and Cognition. Human Cognitive Processes 23. Amsterdam: John Benjamins.

Holsanova, J. 2011. “How We Focus Attention in Image Viewing, Image Description, and during Mental Imagery.” In K. Sachs-Hombach and R. Totzke (eds), Bilder, Sehen, Denken. Cologne: Herbert von Halem Verlag, pp. 291–313.

Holsanova, J., Rahm, H. and Holmqvist, K. 2006. “Entry Points and Reading Paths on the Newspaper Spread: Comparing Semiotic Analysis with Eye-Tracking Measurements”, Visual Communication 5(1): 65–93.

Martin, J.R. 1992. English Text. System and Structure. Philadelphia and Amsterdam: John Benjamins.

Martin, J. and Rose, D.  2003. Working with Discourse. Meaning Beyond the Clause. London and New York: Continuum.

O’Toole, M. 2011/1994. The Language of Displayed Art. London: Leicester University Press.