Sunday, December 9, 2007

Rapid Radiation, Borrowing and Dialect Continua in the Bantu Languages

Rapid Radiation, Borrowing and Dialect Continua in the Bantu Languages

Clare J. Holden & Russell D. Gray


Rapid Radiation, Borrowing and Dialect Continua
in the Bantu Languages

1. Introduction

Despite several decades of study, several fundamental questions about Bantu linguistic relationships remain unresolved, as well as numerous questions of detail (see Chapter 4 this volume). Phylogenetic analysis has shown that Bantu languages fit a branching-tree model of evolution surprisingly well, but a tree model does not explain all the variation in the Bantu linguistic data. Moreover, several different Bantu trees appear to fit the data almost equally well. Our difficulties in resolving the Bantu tree are o􀄞en ascribed to a lack of data and research, and it is true that there are many more Bantu languages than linguists. However there are probably also more fundamental reasons why a single Bantu tree has proven elusive, arising from the historical processes under which these languages developed. In this chapter, we show how the networkbuilding method Neighbor-Net (Bryant & Moulton 2003) can be used to distinguish between different historical reasons why some linguistic relationships are not well resolved. We test three hypotheses for why some Bantu languages might not fit a tree model well: rapid radiation, linguistic borrowing and dialect chains, all thought to have been widespread within the Bantu family.
Bantu is a large family of over 450 languages that are spoken across sub-Equatorial Africa (Fig. 2.1). We define ‘Bantu’ in the sense of Ruhlen’s (1991) ‘Narrow Bantu’. Bantoid is the larger language group to which Bantu belongs. Bantoid belongs to the Niger-Kordofanian phylum, whose deepest branches are found in West Africa (Williamson & Blench 2000). Bantu languages are identified by codes originally assigned by Guthrie (1967–71), who classified Bantu languages into 15 zones (later expanded to 16), labelled A to S, based on geographical and linguistic criteria. (Many of these zones are probably not valid genetic groups.) Guthrie also divided the whole of Bantu into two large subdivisions, West Bantu and East Bantu (see Chapter 4 this volume). In this chapter, we use a modified version of Guthrie’s codes taken from Bastin et al. (1999) (for a correspondence between the codes of Bastin et al. and Guthrie see Maho 2002). Bantu is thought to have originated in Cameroon or Nigeria, where the non-Bantu Bantoid languages are spoken today. The spread of Bantu is associated with the spread of
farming; East Bantu in particular is associated with the Early Iron Age ‘Chifumbaze’ tradition in East and southeast Africa (Ehret 1998; Holden 2002; Phillipson 1993, 184–205; Vansina 1990).

Part 1: Tree approaches to Bantu language phylogeny

A number of Bantu trees have been published, constructed using different samples of languages and different tree-building methods. Distance-based lexicostatistical methods have been widely used as a heuristic device to infer Bantu relationships (Nurse 1996; see Chapter 4, this volume). Bastin et al. (1999) published the most comprehensive lexicostatistical Bantu trees, including 542 languages and dialects. Their linguistic data comprised coded information on cognates for 92 items of basic vocabulary, derived from the Swadesh 100-word list of basic vocabulary, with eight meanings (such as ‘snow’) that are not present in Bantu excluded (Bastin 1983). Subsequently, subsets of the data of Bastin et al. (1999) have been reanalyzed using phylogenetic tree-building methods, which use only innovations to define subgroups. In this respect, phylogenetic methods are comparable to the linguistic comparative method. Unlike the comparative method, however, phylogenetic methods use an explicit optimality criterion, such as maximum parsimony or likelihood, to choose among possible trees. Two further advantages of phylogenetic methods are that they let us test the fit of data on a tree using the consistency and retention indices, which measure the extent of homoplasy in the data set, and they allow us to evaluate the level of support in the data for each node, using tests such as bootstrap analysis, or, for Bayesian methods, by estimating the posterior probability of each node.
Holden (2002) reanalyzed 75 languages from Bastin et al. ’s (1999) data set using maximum parsimony tree-building methods, implemented by the computer program PAUP*4.0 (Swofford 1998; see Fig. 4.2 this volume). More recently, we have used Bayesian MCMC (Markov chain Monte Carlo) methods to infer phylogeny for 95 Bantu languages from the same dataset, using using the computer programs MrBayes (Huelsenbeck & Ronquist 2001) and BayesPhylogenies (Pagel & Meade 2004) (Fig. 2.2; Holden et al. 2005).
Instead of searching for the best tree(s) according to an optimality criterion, Bayesian methods sample trees in proportion to their likelihood, producing a sample of trees (usually several hundred) in which both moreand less-likely trees are included. One advantage of constructing a Bayesian tree sample is that it allows us to represent phylogenetic uncertainty in the sample, so that we do not have to treat the tree as if it were known without error, when, in fact, any tree remains simply a hypothesis about phylogeny. Our ability to reconstruct the true tree is inevitably limited by our data and our models of evolution; moreover, as we have noted, linguistic borrowing cannot be represented on a single tree.
Phylogenetic analysis suggests that Bantu basic vocabulary fits a tree model at least as well as typical biological data sets (Holden 2002; Sanderson & Donoghue 1989). For the 75-language Bantu tree constructed using equally weighted parsimony, Holden (2002) reported a consistency index (CI) of 0.65 and a retention index (RI) of 0.59; using weighted parsimony the CI was 0.72 and the RI was 0.68. These results indicate that Bantu linguistic evolution was substantially tree-like, at least for the basic vocabulary (other parts of the lexicon may be more prone to borrowing).
This was at first a surprising result, since borrowing among Bantu languages is thought to be widespread, whereas gene flow and hybridization are thought to be rare among many biological taxa. However, a branching tree model does not explain all the variation among the Bantu languages; there remains considerable conflicting signal in the Bantu data, which could be due either to borrowing or parallel evolution. In comparative perspective, Indo-European is even more tree-like than Bantu (Bryant et al. 2005; Rexova et al. 2003), but Austronesian may be less so (Gray & Jordan 2000).

Agreement and uncertainty among Bantu trees

Comparing Bantu trees, it is apparent that several major questions about Bantu linguistic relationships remain unresolved. Conflicts among Bantu trees are illustrated in Figures 2.2–2.4. Figure 2.2 shows a majority rule tree summarizing a Bayesian sample of 200 Bantu trees, sampled from 2 million trees, constructed using a reversible model of evolution with the computer program MrBayes (Huelsenbeck & Ronquist 2001). The majority rule tree in Figure 2.2 shows all nodes present on at least half the trees in the sample, plus all other compatible groupings. Alternative tree topologies (not shown) were also present in the sample. Node labels indicate the proportion of trees in the sample in which each node was found, which is equivalent to the posterior probability of that node. While many nodes were well supported in this analysis (being found in more than 95 per cent of trees in the sample) several nodes towards the root of the tree received much lower levels of support. An alternative summary of this tree sample, constructed using a consensus network (Holland & Moulton 2003), is shown in Figure 2.3. A consensus network allows us to represent the alternative tree topologies that were found in the sample.
This consensus network (Fig. 2.3) shows trees present in more than 18 per cent of the sample. Examining the consensus network reveals that most alternative tree topologies involve the East Bantu languages spoken in East Africa, particularly Yao P21, which has conflicting affiliations with both East and Southeast Bantu languages (among the la􀄴er, especially with the N zone languages and Makwa P31). The 18 per cent threshold was chosen to illustrate the maximum amount of conflict among trees without becoming too visually complex; if we decrease this threshold, the main effect is to reveal even more complexity among the East African languages. Figure 2.4 illustrates a range of plausible alternative Bantu trees. Figure 2.4a summarizes the Bayesian tree sample shown in more detail in Figure 2.2. Figure 2.4b shows a number of alternative Bantu tree topologies, found in other analyses that used a variety of methods including maximum parsimony (Holden 2002; see Fig. 4.2), Bayesian methods with a nonreversible model of evolution (Holden et al. 2005) and Neighbor-Joining (Saitou & Nei 1987; unpublished work by Holden). Figures 2.2–2.4 illustrate that there is significant uncertainty regarding the shape of Bantu history that we cannot currently resolve. Summarizing across Bantu trees, we have divided Bantu into four major areas: West and East Bantu, and two groups of languages in Central and Southwest Africa that seem to be intermediate between West and East Bantu. These areas are shown on Figure 2.1. In phylogenetic analyses, languages in the intermediate zone usually cluster in two groups, labelled Central and Southwest Bantu on Figures 2.1 and 2.2 (Holden 2002; Holden et al. 2005). These categories broadly agree with much previous work in Bantu linguistics (Bastin et al. 1999; Heine 1973; Nurse 1996), although
under Guthrie’s traditional West and East Bantu division, our Southwest zone and parts of our Central zone would be grouped with West Bantu (see Chapter 4 this volume).

West Bantu

In our classification, West Bantu includes languages spoken in west-Central Africa, belonging to zones A, B, C, H and parts of D (Figs. 2.1–2.2). The most fundamental disputed question about the shape of Bantu history is whether or not West Bantu is monophyletic. In other words, do West Bantu languages share a unique common ancestor that is not also ancestral to East Bantu languages? Many published trees show the deepest splits on the Bantu tree to be within West Bantu, among the northwestern Bantu languages belonging to zones A and B (Bastin et al. 1999; Heine 1973; Holden 2002; Holden et al. 2005; Figs. 2.1 & 2.4b).
The alternative hypothesis is that the deepest split on the tree is between East and West Bantu. The Bayesian tree sample summarized in Figure 2.2 supports this alternative hypothesis. The status of West Bantu has profound consequences for reconstructing ancestral Bantu language and culture. If the earliest splits were within the northwestern Bantu languages of zones A and B, then those languages would become highly influential in reconstructing the earliest Bantu linguistic forms. Many historians, archaeologists and anthropologists also treat the Bantu language tree as a source of information about population history in this region (Ehret 1998; Holden & Mace 2003; Phillipson 1993; Schoenbrun 1998; Vansina 1984; 1990). Uncertainty in the tree has received li􀄴le a􀄴ention in such studies, although different researchers have used quite different trees. For example, in his classic study of political development in the Equatorial Bantu, Vansina (1990) assumed that there was a primary split between East and West Bantu; in contrast, Ehret (1998) subscribed to the alternative model that the deepest splits on the tree are within West Bantu languages. The topology of the tree chosen can have a significant effect on the conclusions of such studies; again, for example, in determining the influence of the northwestern Bantuspeaking societies for reconstructing of ancestral Bantu culture.

East Bantu

In our classification, East Bantu includes languages of zones E, F, G, J, N and S, plus some languages of zone M (Figs. 2.1–2.2). East Bantu is monophyletic on all published phylogenetic Bantu trees, but this clade is not well supported: it was not recovered in a bootstrap analysis (Holden 2002) and it has a very low posterior probability in Bayesian analyses (Holden et al. 2005; see also Fig. 2.2). Within East Bantu, the languages spoken in Southeast Africa, belonging to zones N and S, form a clade on previously-published phylogenetic trees. Languages in zone S o􀄞en appear to be the most divergent within East Bantu when using ultrametric distance-based methods such as UPGMA; this is seen in some of the trees published by Bastin et al. (1999). However, maximum parsimony analysis (Holden 2002; see Fig. 4.2), which can display true branch lengths, suggests that this is because there was an increased rate of evolution among these languages. The languages spoken in East Africa sometimes form a clade within East Bantu, but not always (Fig. 2.4). In the present data set, the languages spoken in East Africa comprise zones E, F, G and J (Lakes Bantu), plus the individual languages Nyakyusa M31 and Yao P21 (Fig. 2.2). Again, alternative relationships among East Bantu languages imply different historical scenarios for the spread of these languages and their speakers. Within East Bantu, was there a primary division between the languages spoken in East and Southeast Africa, as Ehret (1998) suggests? Or are the deepest splits within the East African languages, suggesting that the East Bantu originated there, perhaps in association with the Urewe archaeological tradition around Lake Victoria (Holden 2002; Phillipson 1993)?

Southwest and Central Bantu

Bantu languages that seem to be intermediate between West and East Bantu belong to zones K and R Southwest Bantu), and to zones L and parts of M (Central Bantu). Central Bantu languages usually form a clade that is a sister-group to East Bantu (Figs. 2.2 & 2.4a). However, on some trees the Central languages split into an eastern and a western group, in which case the east-Central group usually forms a sister group to East Bantu, or occasionally to the southeastern (zones N-plus-S) clade within East Bantu, while the northwest-Central group clusters with Southwest Bantu (Fig. 2.4b). The position of the two languages Lega D25 and Binja D24 varies considerably across previously-published trees. They fall somewhere within, or as immediate outliers to, the East-plus-Central clade.
The position of Southwest Bantu is also somewhat variable among the trees that have been proposed. On some maximum parsimony trees, Southwest Bantu clusters with West Bantu languages of zones C and H (Holden 2002; Fig. 4.2). However, on the widely cited tree published by Heine (1973), and in Bayesian analyses (Holden et al. 2005; Fig. 2.2) Southwest Bantu forms a sister group to Central-plus-East Bantu. But in the Bayesian tree sample summarized in Figure 2.2, there is very strong support (posterior probability =1.0) for a clade which is a sister to Southwest Bantu and which comprises the languages of the Central Bantu and East Bantu regions of Figure 2.1 including Lega D25 and Binja D24 — regardless of how these languages are subgrouped among themselves.

Why is a single Bantu tree elusive?

It is unclear whether our difficulties in resolving the Bantu tree stem from a lack of data, or from a more fundamental mismatch between the actual process of Bantu evolution, o􀄞en thought to be characterized by widespread borrowing and dialect continua, and a bifurcating tree model of language evolution. Regarding
the lack of data, the linguistic data published by Bastin et al. (1999) comprised only 92 meanings. A 200-word vocabulary list would probably be preferable. All phylogenetic analyses by Holden have also used this data set, so potentially suffer from the same problem. Counteracting this limitation, we should note that most of these 92 items have numerous distinct word forms (see Chapter 15 this volume), comprising over 1600 cognates in the 95-language sample. Regarding the tree model of linguistic evolution, trees are rather simplistic models of both biological and linguistic evolution. In biology, the importance of evolutionary processes such as hybridization, lateral gene transfer and recombination, especially in bacterial and viral evolution, is increasingly recognized
(Boucher et al. 2003; Stone 2000; Woese 1998). Linguistic borrowing and the formation of creole languages are analogous to lateral gene transfer and hybridization (see Ringe et al. 2002, for a discussion of these phenomena). Parallel evolution can also give rise to ambiguous relationships among taxa. Such complex relationships cannot be represented on a single tree. When placed on a tree, admixed languages are usually positioned near the root of the branch of the parent language that contributed most to the mixed language (Bryant et al. 2005; Cavalli-Sforza et al. 1994). Unlike trees, which only permit branching and divergence among taxa, networks can also have reticulations among branches, making it possible to show more than one evolutionary pathway on a single graph. For this reason, networks may be preferable for describing linguistic relationships involving creoles, or among languages with extensive borrowing, as they allow us to represent more than one ‘parent’ per language.

Part 2: Network approaches to Bantu language phylogeny

In this analysis, we used a new network-building method, Neighbor-Net, to investigate the affiliations of those Bantu languages whose position varies across different trees. A primary question concerns the earliest Bantu history — can we resolve the question of whether there was a primary split between East and West Bantu, or whether the deepest splits on the tree are within West Bantu? The position of Southwest Bantu is also unclear from previous studies — is it a sister-group to East Bantu, or does it cluster with West Bantu languages? Is Central Bantu a valid group, or should it be split into two? Within East Bantu, are the Lakes (J) languages outliers to other languages, or do all East African languages form a clade?

Constructing a Bantu network also lets us distinguish between the different linguistic processes that might underlie the weak or conflicting signals for some parts of the Bantu tree. Such processes include rapid radiation and borrowing, the la􀄴er perhaps in the context of dialect continua. Rapid radiation may be inferred from a lack of phylogenetic signal, i.e. a rake- or star-shaped phylogeny, whereas reticulation would indicate possible borrowing. Reticulations can also pinpoint those languages which may have been each branch before the subsequent further spli􀄴ing of that branch. This leaves a weak phylogenetic signal that can be difficult to detect, so that the language tree appears to be star- or rake-shaped (Bellwood 1996). Borrowing is the transfer of linguistic elements from one language to another, o􀄞en between neighbouring languages. Extensive borrowing can lead to conflicting affiliations (where a language shows similarities to more than one divergent language groups) that cannot be represented on a single tree.
Unlike tree-building methods, constructing a network does not force the data into a bifurcating tree. If the data are truly tree-like, then the Neighbor-Net method will return an unrooted tree, but if there are conflicts within the data then it will construct a splits graph, in which conflicting relationships are represented by reticulations or joining among branches. From the shape of the Neighbor-Net network, we can infer whether either rapid radiation or borrowing occurred in different parts of the Bantu language family. On a network, we would expect rapid radiation to result in a star-shaped phylogeny, with poorly marked hierarchical structure but no evidence for conflict among language groupings. We would expect borrowing to result in reticulation among branches,
and we would expect dialect chains to be indicated by complex chains of reticulation involving numerous languages.



The sample included 93 Bantu languages and two non-Bantu Bantoid (hence simply ‘Bantoid’) languages, Tiv and Ejagham. The la􀄴er were used as outgroups to root the tree where appropriate (e.g. Fig. 2.2). Figure 2.1 shows the approximate geographical locations of the languages in the analysis. The data set included all languages for which linguistic data were published in Bastin et al. (1999), and for which ethnographic data from the corresponding cultural group were published in the Ethnographic Atlas (Murdock 1967). This data set was designed to let us use our knowledge of Bantu language relationships to study other aspects of cultural evolution in these populations in the future; this is possible insofar as linguistic relationships reflect population history (Barbujani 1991; Cavalli-Sforza et al. 1988). The data set is similar to the 75-language data set used by Holden (2002; see Fig. 4.2), except that in the 75-language data set, languages with more than 5 per cent missing data were excluded, whereas they have been included here. For this analysis, the data were coded in a multistate form, i.e. with each column representing a meaning, and most meanings having several different cognate forms. (We have also run this analysis with the data coded in binary form, i.e. each cognate having its own column; this made very li􀄴le difference to the results.)


Neighbor-Net (Bryant & Moulton 2003) is an agglomerative method for constructing networks that selects taxa on the basis of similarity and groups them together. The algorithms used in this method are analogous to the Neighbor-Joining method for building trees (Saitou & Nei 1987). In agglomerative tree-building methods, two taxa (or nodes) are chosen on the basis of similarity, then they are agglomerated (merged), the data matrix is reduced and we proceed to the next iteration. However, to construct a network, the Neighbor-Net method does not immediately agglomerate the selected taxa (or nodes). Instead, it waits until one of the chosen taxa (or nodes) has been grouped with a different node. Then the three nodes are reduced to two, and the process is repeated. The Neighbor-Net method represents similarities within the data set as a splits graph, constructed from a distance matrix. Splits are bipartitions of the data. An example of a splits graph for four languages, Ndebele S44, Swati S43, Ngoni S45 and Zulu S42, is shown in Figure 2.5. Sets of parallel lines indicate a single bipartition (split) in the data. The box shape at the centre of the graph is characteristic of conflicting data. Split A groups together Zulu and Ngoni on the one hand, and Ndebele and Swati on the other. Split B groups together Ndebele and Ngoni on the one hand, and Swati and Zulu on the other. Branch lengths (or edge weights) are proportional to the support for a split in the data: thus there is more evidence for split A than for split B. Distances between language pairs in the 95-language data set were calculated using PAUP v.4a (Swofford 1998), using mean character differences. Weighted splits were calculated and then represented as a splits
graph using the computer program SplitsTree v4beta.06 involved in borrowing. Complex chains of conflicting relationships involving numerous languages may indicate that borrowing occurred in the context of dialect chains. Under rapid radiation, a language diverges into several daughter languages very rapidly, so there is li􀄴le time for linguistic innovations to accumulate in (Huson 1998; Huson & Bryant 2006). The complete Bantu network resulting from this analysis is shown in Figure 2.6. Separate splits graphs for East and West Bantu are shown in Figures 2.7a and 2.7b. There is some evidence that Neighbor-Net may overfit the data, meaning that it produces some false splits (Nakhleh et al. 2005). However, such false splits have very small edge weights. To guard against the possibility of including false splits in our Bantu network, we also constructed a network from which edges weighted less than 0.002 were excluded; this cut-off point was essentially arbitrary. Eighty-seven of 331 edges or 26 per cent had weights less than 0.002. The network of weights greater than 0.002 is shown
in Figure 2.8; it may be interpreted as a simplified and conservative estimate of the Bantu network.


A splits graph of the complete sample of 95 Bantu and Bantoid languages is shown in Figure 2.6. The major groups including West, East, Southwest and Central Bantu are indicated. Figures 2.7a b show splits graphs for East and West Bantu, respectively, allowing us to focus on relationships within those groups in more detail. The simplified splits graph of edge weights greater than 0.002 is shown in Figure 2.8. Unless specified, the following discussion refers to splits that are present on both the complete network (Fig. 2.6) and on the reduced network (Fig. 2.8). On Figure 2.6, West Bantu languages cluster together on the top le􀄞 of the graph, while East Bantu languages cluster to the bo􀄴om right. Southwest and Central Bantu occupy a space intermediate between West and East Bantu, with Central Bantu being closer to East Bantu, and Southwest Bantu closer to West Bantu. This is in line with our expectations from Bantu trees (cf. Figs. 2.2 & 2.4). Although the absolute positions of the languages have been rotated on Figure 2.8, the relative positions of languages remain the same.

Central Bantu

Although Central Bantu is clearly defined by splits dividing this group from other Bantu languages, there are also conflicting splits dividing Central Bantu into (a) a northwestern group comprising Kaonde L42, Luba L33 and Songe D10S, and (b) an eastern group comprising M-zone languages. The northwestern group is more similar to West and Southwest Bantu, plus a number of East Bantu languages that border the West Bantu area (see Fig.
2.1), including subzones J5, J6, E5 and E6 plus Hima J13. The east-Central group is more similar to the other East Bantu languages. This suggests that there was an area of contact, leading to borrowing or convergence, among languages on at least one side of this Figure 2.7. a) Network of 48 East Bantu languages; relationships among East Bantu languages spoken in east Africa are particularly complex and conflicting. b) Network of 29 West Bantu languages, plus Tiv and Ejagham (Bantoid languages).

divide. Linguistically, Kaonde L42 falls within the northwestern group, but is also linked by conflicting splits to the eastern group. Geographically, Kaonde is closer to the eastern group (Fig. 2.1), so it seems likely that it originated as a northwest-Central language whose speakers later migrated south, and that there was subsequently borrowing between Kaonde and the east-Central languages.

Southwest Bantu

Southwest Bantu is very clearly separated from other languages, reflecting the robustness of this group in phylogenetic analysis (Fig. 2.2; Holden 2002). Figures 2.6 and 2.8 show a conflicting split linking Umbundu R11 to West Bantu (or alternatively, linking all the other Southwest Bantu languages to East-plus-Central
Bantu). This split aside, the splits graphs (Figs. 2.6 & 2.8) are consistent with a tree in which Southwest Bantu is an outlier to East-plus-Central Bantu, rather than clustering within West Bantu.

East Bantu

The following groups are supported by splits of varying lengths:
a) zone S;
b) zone N;
c) zone J (also known as Lakes Bantu);
d) zone F (including Sukuma F21, Nyamwezi F22 and Sumbwa F23).
Some of the most complex relationships in East Bantu appear among the East African languages of zones E (excluding E5 and E6), F and G. At the centre of the graph, there are no conflict-free major groupings among these languages, suggesting that these languages developed in a condition of dialect continua with borrowing across dialects (see Ringe et al. 2002 for discussion of an analogous situation among early Indo-European ‘Satem’ languages). The early evolution of these languages appears to be the least tree-like of all the Bantu languages in this analysis. Later, some clearer groups emerge such as Kaguru G12 + Gogo G11 + Luguru G35, and zone F, but these languages also continue to be involved in extensive conflicting relationships (Figs. 2.6, 2.7a & 2.8).
Within the S-zone languages, Venda shows conflicting relationships, being grouped on the one hand with Ndau S15 plus Shona S10, and on the other hand with the S30, S40 and S50 languages (Figs. 2.6, 2.7a & 2.8). Venda is geographically adjacent to the groups it shares similarities with, suggesting that borrowing
across neighbouring groups has occurred. However, it should be noted that adjacency alone does not always lead to linguistic convergence: historically specific factors must also have played a role.

West Bantu

In previous phylogenetic analyses, it has proven difficult to resolve the relationships among West Bantu languages. The deepest splits involving West Bantu languages receive very low support on trees (Fig. 2.2; Holden 2002; Holden et al. 2005). Examining the splits graphs (Figs. 2.6, 2.7b & 2.8) suggests that there was a more or less simultaneous divergence of six groups of West Bantu languages.
These include:
a) the H languages plus Madzing B80 and Teke B73;
b) Duala A24, Puku A32, Bakoko A43, Fang A75, Kota B25 and Tsogo A43;
c) Sakata C34, Mongo C61, Nkundo Mongo C61, Kela C75, Tetela C71 and Lele C84;
d) Lingala C36, Doko C40 and Ngombe C41;
e) Kumu D37 and Bira D32;
f) Likile C57 and Mbesa C51.
There is li􀄴le evidence for conflict (borrowing) among West Bantu languages. The shape of the network suggests that our difficulties in resolving relationships among West Bantu languages may result from rapid early radiation of these languages, leaving li􀄴le phylogenetic signal or deep hierarchical structure in the data.
We also wished to address the question of whether some northwestern Bantu languages are the most divergent relative to other Bantu languages. In the tree sample shown in Figure 2.2 the most divergent West Bantu languages are Bubi A31 and a pair consisting of Likele C57 and Mbesa C51. On the 75- language maximum parsimony tree shown in Figure 4.2, Mpongwe B11 is also highly divergent. Investigating the affiliations of these languages on the complete splits graph (Fig. 2.6) yielded the following results. Bubi, a long isolated language spoken on Bioko Island, is linked by one split to the outgroups Tiv 802 and Ejagham 800. Presumably these similarities are primitive (i.e. they result from ancestral similarities, which other Bantu languages have lost). Another split links Tiv, Ejagham, Bubi, Kumu D37 and Bira D32. These similarities (among far-flung languages) are probably also primitive. Another split links Bubi to Bira, Kumu, Doko C40 and Ngombe C41; these are all geographically peripheral West Bantu languages (Fig. 2.1). One split links together all the West Bantu languages apart from Mpongwe B11. These links are consistent with the view that Bubi A31, and possibly Mpongwe B11, diverged or became isolated early among the West Bantu languages, and lack some innovations that most other West Bantu languages share. No splits linking Likile C57 and Mbesa C51 to any particular West Bantu (or other) languages were detected.
With the exception of Bubi A31, there was no evidence that languages of zones A and B are particularly divergent, either among themselves, or in relation to other Bantu languages. Zones A and B (apart from Bubi) appear to cluster with zone H (Figs. 2.2–2.3, 2.6, 2.7b & 2.8).


The results of this analysis suggest why persistent ambiguities in the Bantu tree remain, despite several decades of study. From the shape of the network it appears that different historical processes caused the ambiguities in the East and West Bantu parts of the tree. West Bantu languages underwent a rapid early radiation — near simultaneous divergence of several major West Bantu branches — resulting in a star-shaped phylogeny with li􀄴le internal structure. With the exception of the isolated language Bubi A31 (and possibly Mpongwe B11) we found no evidence that the greatest divergence within Bantu involved languages of zones A and B. There is li􀄴le evidence for borrowing among West Bantu languages, possibly indicating that these speech communities were more isolated from one another than were early East Bantuspeaking populations. In contrast, both borrowing and dialect continua seem to have been important among East Bantu languages, especially in East Africa. There is extensive conflict at the centre of the graph, involving all East Bantu languages and also the eastern-Central Bantu languages. The fact that so many languages are involved suggests that the borrowing among these languages occurred early in their history, when their proto-languages were geographically closer together,
possibly in the context of dialect continua. There also seems to have been ongoing borrowing among the Bantu languages of zones E (excluding E5 and E6), F and G in East Africa. However, there is a fairly tree-like structure among the Southeast African languages of zones N and S, indicating that subsequent divergence among these languages was substantially tree-like. There is some evidence for more recent, localized borrowing, indicated by conflict involving fewer languages, for example that associated with Venda. Another case is the borrowing between Kaonde (a northwest-Central language) and the east-Central languages (Bemba, Lala and Lamba). Ethnographic evidence suggests that some of the languages that have experienced linguistic borrowing may have borrowed other cultural a􀄴ributes along with vocabulary. For example, Kaonde-speakers are predominantly matrilineal, like the Bemba, Lala and Lamba, but unlike modern speakers of the other northwest-Central languages, Songe and Luba. A parsimonious interpretation is that a􀄞er Kaonde speakers migrated south,
they borrowed both linguistic elements and matrilineal descent from their new neighbours.
This analysis shows how, by constructing a network, we can move beyond the question of whether, or how well, Bantu languages fit a tree model. Although previous phylogenetic analysis of Bantu languages showed that relationships among these languages are more tree-like than we might have expected, a tree model does not explain all the variation among them (Holden 2002). Moreover, several alternative trees appear to fit the Bantu linguistic data almost equally well, so that we cannot, at present, define a single best tree. A network model lets us test alternative hypotheses for why some Bantu relationships may not be tree-like, including rapid radiation, borrowing and past dialect continua, providing new insights into why parts of the Bantu language family have remained intractable to phylogenetic analysis.


Our results suggest that there was rapid early radiation among West Bantu languages; in contrast, there was extensive borrowing, partly within dialect continua, among East Bantu languages in East Africa. We propose that these are the underlying reasons why some parts of the Bantu language family have until now proven intractable to phylogenetic analysis.

C.J.H. was supported by a Research Fellowship from the AHRB Centre for the Evolutionary Analysis of Cultural Behaviour at University College London (UCL). This research received additional funding from the Marsden Fund, Royal Society of New Zealand and the UCL Graduate School Research Projects Fund.


Barbujani, G., 1991. What do languages tell us about human microevolution? Trends in Ecology and Evolution 6, 151–6.
Bastin, Y., 1983. Classification lexicostatistique des langues bantoues (214 releves). Bull. Seanc. Acad. R. Sci Outre-Mer 27, 173–99.
Bastin, Y., A. Coupez & M. Mann, 1999. Continuity and divergence in the Bantu languages: perspectives from a lexicostatic study. Annales, Sciences humaines 162, 1–226.
Bellwood, P., 1996. The origins and spread of agriculture in the Indo-Pacific region: gradualism and diffusion or revolution and colonization?, in The Origins and Spread of Agriculture and Pastoralism in Eurasia, ed. D.R. Harris. London: UCL Press, 465–98.
Boucher, Y., C.J. Douady, R.T. Papke, D.A. Walsh, M.E.R.
Boudreau, C.L. Nesbo, R.J. Case & W.F. Dooli􀄴le, 2003. Lateral gene transfer and the origins of prokaryotic groups. Annual Review of Genetics 37, 283–328.
Bryant, D. & V. Moulton, 2003. Neighbor-Net: an agglomerative method for the construction fo phylogenetic networks. Molecular Biology and Evolution 21, 255–65.
Bryant, D., F. Filimon & R.D. Gray, 2005. Untangling our past: languages, trees, splits and networks, in The Evolution of Cultural Diversity: a Phylogenetic Approach, eds. R. Mace, C. Holden & S. Shennan. London: UCL Press, 69–85.
Cavalli-Sforza, L.L., E. Minch & J. Mountain, 1988. Reconstruction of human evolution: bringing together the genetic, archaeological and linguistic data. Proceedings of the National Academy of Sciences of the USA 85, 6002–6.
Cavalli-Sforza, L.L., P. Menozzi & A. Piazza, 1994. The History and Geography of Human Genes. Princeton (NJ): Princeton University Press.
Ehret, C., 1998. An African Classical Age: Eastern and Southern Africa in World History, 1000 􀑏􀑐 to 􀑎􀑑 400. City?? (VA): University Press of Virginia.
Gray, R. & F. Jordan, 2000. Language trees support the express-train sequence of Austronesian expansion. Nature 405, 1052–5.
Guthrie, M., 1967–71. Comparative Bantu: an Introduction to the Comparative Linguistics and Prehistory of the Bantu Languages. Farnborough: Gregg International.
Heine, B., 1973. Zur genetischen Gliederung der Bantu-Sprachen. African Language Studies 14, 82–104.
Holden, C.J., 2002. Bantu language trees reflect the spread of farming across sub-Saharan Africa: a maximumparsimony analysis. Proceedings of the Royal Society of London Series B 269, 793–9.
Holden, C.J. & R. Mace, 2003. Spread of ca􀄴le pastoralism led to the loss of matriliny in Africa: a co-evolutionary analysis. Proceeding of the Royal Society of London Series B 270, 2425–33.
Holden, C.J., A. Meade & M. Pagel, 2005. Comparison of maximum parsimony and Bayesian Bantu language trees, in The Evolution of Cultural Diversity: a Phylogenetic Approach, eds. R. Mace, C.J. Holden & S. Shennan. London: UCL Press, 53–65.
Holland, B. & V. Moulton, 2003. Consensus networks: a method for visualising incompatibilities in collections of trees, in Algorithms in Bioinformatics, eds. G. Benson & R. Page. Berlin: Springer-Verlag, 165–76.
Huelsenbeck, J.P. & F. Ronquist, 2001. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17, 754–5.
Huson, D.H., 1998 SplitsTree: a program for analyzing and visualizing evolutionary data. Bioinformatics 14, 68–73.
Huson, D.H. & D. Bryant, 2006. Application of phylogenetic methods in evolutionary studies. Molecular Biology and Evolution 32, 254–67.
Maho, J.F., 2002. Bantu Line-up: Comparative Overview of Three Bantu Classifications, pp. 1–59. Unpublished manuscript. Goteborg, Sweden: Goteborg University.
Available for download at
Murdock, G.P., 1967. Ethnographic Atlas. Pi􀄴sburgh (PA): University of Pi􀄴sburgh Press.
Nakhleh, L., T. Warnow, C.R. Linder & K. St John, 2005. Reconstructing reticulate evolution in species — theory and practice. Journal of Computational Biology 12, 796–811.
Nurse, D., 1996. Historical classifications of the Bantu languages, in The Growth of Farming Communities in Africa from the Equator Southwards, vol. XXIX–XXX, ed. J.E.G. Su􀄴on, Nairobi: British Institute in Eastern Africa, 65–75.
Pagel, M. & A. Meade, 2004. A phylogenetic mixture model for detecting pa􀄴ern-heterogeneity in gene sequence or character-state data. Systematic Biology 53, 571–81.
Phillipson, D.W., 1993. African Archaeology. (Cambridge World Archaeology.) Cambridge: Cambridge University Press.
Rexova, K., D. Frynta & J. Zrzavy, 2003. Cladistic analysis of languages: Indo-European classification based on lexicostatistical data. Cladistics 19, 120–27.
Ringe, D., T. Warnow & A. Taylor, 2002. Indo-European and computational cladistics. Transactions of the Philological Society 100, 59–129.
Ruhlen, M., 1991. A Guide to the World’s Languages, vol. 1: Classification. London: Edward Arnold.
Saitou, N. & M. Nei, 1987. The Neighbor-Joining method: a new method for reconstruction of phylogenetic trees. Molecular Biology and Evolution 4, 406–25.
Sanderson, M.J. & M.J. Donoghue, 1989. Pa􀄴erns of variation in levels of homoplasy. Evolution 43, 1781–95. Schoenbrun, D.L., 1998. A Green Place, a Good Place: Agrarian Change, Gender and Social Identity in the Great Lakes Region to the 15th Century. (Social History of Africa series.) Portsmouth: Heinemann.
Stone, G., 2000. Phylogeography, hypbridization and speciation. Trends in Ecology and Evolution 15, 354–5.
Swofford, D.L., 1998. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Sunderland (MA): Sinauer Associates.
Vansina, J., 1984. West Bantu expansion. Journal of African history 25, 129–45.
Vansina, J., 1990. Paths in the Rainforests: Toward a History of the Political Tradition in Equatorial Africa. London: James Currey.
Williamson, K. & R.M. Blench, 2000. Niger-Congo, in African Languages: an Introduction, eds. B. Heine & D. Nurse. Cambridge: Cambridge University Press, 11–42.
Woese, C., 1998. The universal ancestor. Proceedings of the National Academy of Sciences of the USA 95, 6854–9.