- Source: Birch
- Source: BIRCH
A birch is a thin-leaved deciduous hardwood tree of the genus Betula (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae. The genus Betula contains 30 to 60 known taxa of which 11 are on the IUCN 2011 Red List of Threatened Species. They are typically short-lived pioneer species and are widespread in the Northern Hemisphere, particularly in northern areas of temperate climates and in boreal climates. Birch wood is used for a wide range of purposes.
Description
Birch species are generally small to medium-sized trees or shrubs, mostly of northern temperate and boreal climates. The simple leaves are alternate, singly or doubly serrate, feather-veined, petiolate and stipulate. They often appear in pairs, but these pairs are really borne on spur-like, two-leaved, lateral branchlets. The fruit is a small samara, although the wings may be obscure in some species. They differ from the alders (Alnus, another genus in the family) in that the female catkins are not woody and disintegrate at maturity, falling apart to release the seeds, unlike the woody, cone-like female alder catkins.
The bark of all birches is characteristically marked with long, horizontal lenticels, and often separates into thin, papery plates, especially upon the paper birch. Distinctive colors give the common names gray, white, black, silver and yellow birch to different species.
The buds, forming early and full-grown by midsummer, are all lateral, without a terminal bud forming; the branch is prolonged by the upper lateral bud. The wood of all the species is close-grained with a satiny texture and capable of taking a fine polish; its fuel value is fair.
= Flower and fruit
=The flowers are monoecious, and open with or before the leaves. Once fully grown, these leaves are usually 3–6 millimetres (1⁄8–1⁄4 in) long on three-flowered clusters in the axils of the scales of drooping or erect catkins or aments. Staminate catkins are pendulous, clustered, or solitary in the axils of the last leaves of the branch of the year or near the ends of the short lateral branchlets of the year. They form in early autumn and remain rigid during the winter. The scales of the mature staminate catkins are broadly ovate, rounded, yellow or orange colour below the middle and dark chestnut brown at apex. Each scale bears two bractlets and three sterile flowers, each flower consisting of a sessile, membranous, usually two-lobed, calyx. Each calyx bears four short filaments with one-celled anthers or strictly, two filaments divided into two branches, each bearing a half-anther. Anther cells open longitudinally. The pistillate segments are erect or pendulous, and solitary, terminal on the two-leaved lateral spur-like branchlets of the year. The pistillate scales are oblong-ovate, three-lobed, pale yellow-green often tinged with red, becoming brown at maturity. These scales bear two or three fertile flowers, each flower consisting of a naked ovary. The ovary is compressed, two-celled, and crowned with two slender styles; the ovule is solitary. Each scale bears a single small, winged nut that is oval, with two persistent stigmas at the apex.
Taxonomy
= Subdivision
=Betula species are organised into five subgenera.
Birches native to Eurasia include
Betula albosinensis – Chinese red birch (northern + central China)
Betula alnoides – alder-leaf birch (China, Himalayas, northern Indochina)
Betula ashburneri – (Bhutan, Tibet, Sichuan, Yunnan Provinces in China)
Betula baschkirica – (eastern European Russia)
Betula bomiensis – (Tibet)
Betula browicziana – (Turkey and Georgia)
Betula buggsii – (China)
Betula calcicola – (Sichuan + Yunnan Provinces in China)
Betula celtiberica – (Spain)
Betula chichibuensis – (Chichibu region of Japan)
Betula chinensis – Chinese dwarf birch (China, Korea)
Betula coriaceifolia – (Uzbekistan)
Betula corylifolia – (Honshu Island in Japan)
Betula costata – (northeastern China, Korea, Primorye region of Russia)
Betula cylindrostachya – (Himalayas, southern China, Myanmar)
Betula dahurica – (eastern Siberia, Russian Far East, northeastern China, Mongolia, Korea, Japan)
Betula delavayi – (Tibet, southern China)
Betula ermanii – Erman's birch (eastern Siberia, Russian Far East, northeastern China, Korea, Japan)
Betula falcata – (Tajikistan)
Betula fargesii – (Chongqing + Hubei Provinces in China)
Betula fruticosa – (eastern Siberia, Russian Far East, northeastern China, Mongolia, Korea, Japan)
Betula globispica – (Honshu Island in Japan)
Betula gmelinii – (Siberia, Mongolia, northeastern China, Korea, Hokkaido Island in Japan)
Betula grossa – Japanese cherry birch (Japan)
Betula gynoterminalis – (Yunnan Province in China)
Betula honanensis – (Henan Province in China)
Betula humilis or Betula kamtschatica – Kamchatka birch platyphylla (northern + central Europe, Siberia, Kazakhstan, Xinjiang, Mongolia, Korea)
Betula insignis – (southern China)
Betula karagandensis – (Kazakhstan)
Betula klokovii – (Ukraine)
Betula kotulae – (Ukraine)
Betula luminifera – (China)
Betula maximowicziana – monarch birch (Japan, Kuril Islands)
Betula medwediewii – Caucasian birch (Turkey, Iran, Caucasus)
Betula megrelica – (Republic of Georgia)
Betula microphylla – (Siberia, Mongolia, Xinjiang, Kazakhstan, Kyrgyzstan, Uzbekistan)
Betula nana – dwarf birch (northern + central Europe, Russia, Siberia, Greenland, Northwest Territories of Canada))
Betula pendula – silver birch (widespread in Europe and northern Asia; Morocco; naturalized in New Zealand and scattered locations in US + Canada)
Betula platyphylla – (Betula pendula var. platyphylla)—Siberian silver birch (Siberia, Russian Far East, Manchuria, Korea, Japan, Alaska, western Canada)
Betula potamophila – (Tajikistan)
Betula potaninii – (southern China)
Betula psammophila – (Kazakhstan)
Betula pubescens – downy birch, also known as white, European white or hairy birch (Europe, Siberia, Greenland, Newfoundland; naturalized in scattered locations in US)
Betula raddeana – (Caucasus)
Betula saksarensis – (Khakassiya region of Siberia)
Betula saviczii – (Kazakhstan)
Betula schmidtii – (northeastern China, Korea, Japan, Primorye region of Russia)
Betula sunanensis – (Gansu Province of China)
Betula szechuanica – (Betula pendula var. szechuanica)—Sichuan birch (Tibet, southern China)
Betula tianshanica – (Kazakhstan, Kyrgyzstan, Tajikistan, Uzbekistan, Xinjiang, Mongolia)
Betula utilis – Himalayan birch (Afghanistan, Central Asia, China, China, Tibet, Himalayas)
Betula wuyiensis – (Fujian Province of China)
Betula zinserlingii – (Kyrgyzstan)
Note: many American texts have B. pendula and B. pubescens confused, though they are distinct species with different chromosome numbers.
Birches native to North America include
Betula alleghaniensis – yellow birch (B. lutea) (eastern Canada, Great Lakes, upper eastern US, Appalachians)
Betula caerulea – blue birch (northeast of North America)
Betula cordifolia – mountain paper birch (eastern Canada, Great Lakes, New England US)
Betula glandulosa – American dwarf birch (Siberia, Mongolia, Russian Far East, Alaska, Canada, Greenland, mountains of western US and New England, Adirondacks)
Betula kenaica – Kenai birch ( Alaska, northwestern North America)
Betula lenta – sweet birch, cherry birch, or black birch (Quebec, Ontario, eastern US)
Betula michauxii – Newfoundland dwarf birch (Newfoundland, Labrador, Quebec, Nova Scotia)
Betula minor – dwarf white birch (eastern Canada, mountains of northern New England and Adirondacks)
Betula murrayana – Murray's birch (Great Lakes endemic)
Betula nana – dwarf birch or bog birch (also in northern Europe and Asia)
Betula neoalaskana – Alaska paper birch also known as Alaska birch or Resin birch (Alaska and northern Canada)
Betula nigra – river birch or black birch (eastern US)
Betula occidentalis – water birch or red birch (B. fontinalis) (Alaska, Yukon, Northwest Territories, western Canada, western US)
Betula papyrifera – paper birch, canoe birch or American white birch (Alaska, most of Canada, northern US)
Betula populifolia – gray birch (eastern Canada, northeastern US)
Betula pumila – swamp birch (Alaska, Canada, northern US)
Betula uber – Virginia round-leaf birch (southwestern Virginia)
= Etymology
=The common name birch comes from Old English birce, bierce, from Proto-Germanic *berk-jōn (cf. German Birke, West Frisian bjirk), an adjectival formation from *berkōn (cf. Dutch berk, Low German Bark, Danish birk, Norwegian bjørk), itself from the Proto-Indo-European root *bʰerHǵ- ~ bʰrHǵ-, which also gave Lithuanian béržas, Latvian Bērzs, Russian берёза (berëza), Ukrainian береза (beréza), Albanian bredh 'fir', Ossetian bærz(æ), Sanskrit bhurja, Polish brzoza, Latin fraxinus 'ash (tree)'. This root is presumably derived from *bʰreh₁ǵ- 'to shine, whiten', in reference to the birch's white bark. The Proto-Germanic rune berkanan is named after the birch.
The generic name Betula is from Latin, which is a diminutive borrowed from Gaulish betua (cf. Old Irish bethe, Welsh bedw).
Evolutionary history
Within Betulaceae, birches are most closely related to alder. The oldest known birch fossils are those of Betula leopoldae from the Klondike Mountain Formation in Washington State, US, which date to the early Eocene (Ypresian) around 49 million years ago.
Ecology
Birches often form even-aged stands on light, well-drained, particularly acidic soils. They are regarded as pioneer species, rapidly colonizing open ground especially in secondary successional sequences following a disturbance or fire. Birches are early tree species to become established in primary successions, and can become a threat to heathland if the seedlings and saplings are not suppressed by grazing or periodic burning. Birches are generally lowland species, but some species, such as Betula nana, have a montane distribution. In the British Isles, there is some difference between the environments of Betula pendula and Betula pubescens, and some hybridization, though both are "opportunists in steady-state woodland systems". Mycorrhizal fungi, including sheathing (ecto)mycorrhizas, are found in some cases to be beneficial to tree growth.
A large number of lepidopteran insects feed on birch foliage.
Uses
Because of the hardness of birch, it is easier to shape it with power tools; it is quite difficult to work it with hand tools.
Birch wood is fine-grained and pale in colour, often with an attractive satin-like sheen. Ripple figuring may occur, increasing the value of the timber for veneer and furniture-making. The highly decorative Masur (or Karelian) birch, from Betula verrucosa var. carelica, has ripple textures combined with attractive dark streaks and lines.
Birch plywood is made from laminations of birch veneer. It is light but strong, and has many other good properties. It is among the strongest and dimensionally most stable plywoods, although it is unsuitable for exterior use. Birch plywood is used to make longboards (skateboard), giving it a strong yet flexible ride. It is also used (often in very thin grades with many laminations) for making model aircraft.
Extracts of birch are used for flavoring or leather oil, and in cosmetics such as soap or shampoo. In the past, commercial oil of wintergreen (methyl salicylate) was made from the sweet birch (Betula lenta).
Birch-tar or Russian oil extracted from birch bark is thermoplastic and waterproof; it was used as a glue on, for example, arrows, and also for medicinal purposes.
Fragrant twigs of wintergreen group birches are used in saunas.
Birch is also associated with the feast of Pentecost in Central and Eastern Europe and Siberia, where its branches are used as decoration for churches and homes on this day.
Ground birch bark, fermented in sea water, is used for seasoning the woolen, hemp or linen sails and hemp rope of traditional Norwegian boats.
Birch twigs bound in a bundle, also called birch, were used for birching, a form of corporal punishment.
Many Native Americans in the United States and Indigenous peoples in Canada prize the birch for its bark, which because of its light weight, flexibility, and the ease with which it can be stripped from fallen trees, is often used for the construction of strong, waterproof but lightweight canoes, bowls, and wigwams.
The Hughes H-4 Hercules was made mostly of birch wood, despite its better-known moniker, "The Spruce Goose".
Birch plywood was specified by the BBC as the only wood that can be used in making the cabinets of the long-lived LS3/5A loudspeaker.
Birch is used as firewood because of its high calorific value per unit weight and unit volume. It burns well, without popping, even when frozen and freshly hewn. The bark will burn very well even when wet because of the oils it contains. With care, it can be split into very thin sheets that will ignite from even the smallest of sparks. Birch wood can be used to smoke foods.
Birch seeds are used as leaf litter in miniature terrain models.
Birch oil is used in the manufacture of Russia leather, a water-resistant leather.
= As food
=The inner bark is considered edible as an emergency food, even when raw. It can be dried and ground into flour, as was done by Native Americans and early settlers. It can also be cut into strips and cooked like noodles.
The sap can be drunk or used to make syrup and birch beer. Tea can be made from the red inner bark of black birches.
= Cultivation
=White-barked birches in particular are cultivated as ornamental trees, largely for their appearance in winter. The Himalayan birch, Betula utilis, especially the variety or subspecies jacquemontii, is among the most widely planted for this purpose. It has been cultivated since the 1870s, and many cultivars are available, including 'Doorenbos', 'Grayswood Ghost' and 'Silver Shadow'; 'Knightshayes' has a slightly weeping habit. Other species with ornamental white bark include Betula ermanii, Betula papyrifera, Betula pendula and Betula raddeana.
= Medical
=Approved topical medicine
In the European Union, a prescription gel containing birch bark extract (commercial name Episalvan, betulae cortex dry extract (5–10 : 1); extraction solvent: n-heptane 95% (w/w)) was approved in 2016 for the topical treatment of minor skin wounds in adults. Although its mechanism of action in helping to heal injured skin is not fully understood, birch bark extract appears to stimulate the growth of keratinocytes which then fill the wound.
Research and traditional medicine
Preliminary research indicates that the phytochemicals, betulin and possibly other triterpenes, are active in Episalvan gel and wound healing properties of birch bark.
Over centuries, birch bark was used in traditional medicine practices by North American indigenous people for treating superficial wounds by applying bark directly to the skin. Splints made with birch bark were used as casts for broken limbs in the 16th century.
= Paper
=Wood pulp made from birch gives relatively long and slender fibres for a hardwood. The thin walls cause the fibre to collapse upon drying, giving a paper with low bulk and low opacity. The birch fibres are, however, easily fibrillated and give about 75% of the tensile strength of softwood. The low opacity makes it suitable for making glassine.
In India, the birch (Sanskrit: भुर्ज, bhurja) holds great historical significance in the culture of North India, where the thin bark coming off in winter was extensively used as writing paper. Birch paper (Sanskrit: भुर्ज पत्र, bhurja patra) is exceptionally durable and was the material used for many ancient Indian texts. The Roman period Vindolanda tablets also use birch as a material on which to write and birch bark was used widely in ancient Russia as notepaper (beresta) and for decorative purposes and even making footwear (lapti) and baskets.
= Use in musical instruments
=Birch wood is sometimes used as a tonewood for semiacoustic and acoustic guitar bodies, and occasionally for solid-body guitar bodies. It is also a common material used in mallets for keyboard percussion. Drum manufacturers, such as Gretsch and Yamaha, have been known to use birch wood in the construction of drum shells, owing to its strength and colour which takes stain in an appealing way, and which can also amber over very well, while also giving the drums an appealing tone which changes depending on the type of birch used.
Culture
Birches have spiritual importance in several religions, both modern and historical. In Celtic cultures, the birch symbolises growth, renewal, stability, initiation, and adaptability because it is highly adaptive and able to sustain harsh conditions with casual indifference. Proof of this adaptability is seen in its easy and eager ability to repopulate areas damaged by forest fires or clearings. Birches are also associated with Tír na nÓg, the land of the dead and the Sidhe, in Gaelic folklore, and as such frequently appear in Scottish, Irish, and English folksongs and ballads in association with death, or fairies, or returning from the grave. The leaves of the silver birch tree are used in the festival of St George, held in Novosej and other villages in Albania.
The birch is New Hampshire's state tree and the national tree of Finland and Russia. The yellow birch is the official tree of the province of Quebec (Canada). The birch is a very important element in Russian culture and represents the grace, strength, tenderness and natural beauty of Russian women as well as the closeness to nature of the Russians. It's associated with marriage and love. There are numerous folkloric Russian songs in which the birch tree occurs. The Ornäs birch is the national tree of Sweden. The Czech word for the month of March, Březen, is derived from the Czech word bříza meaning birch, as birch trees flower in March under local conditions. The silver birch tree is of special importance to the Swedish city of Umeå. In 1888, the Umeå city fire spread all over the city and nearly burnt it down to the ground, but some birches, supposedly, halted the spread of the fire. To protect the city against future fires, wide avenues were created, and these were lined with silver birch trees all over the city. Umeå later adopted the unofficial name of "City of the Birches (Björkarnas stad)". Also, the ice hockey team of Umeå is called Björklöven, translated to English "The Birch Leaves".
"Swinging" birch trees was a common game for American children in the nineteenth century. American poet Lucy Larcom's "Swinging on a Birch Tree" celebrates the game. The poem inspired Robert Frost, who pays homage to the act of climbing birch trees in his more famous poem, "Birches". Frost once told "it was almost sacrilegious climbing a birch tree till it bent, till it gave and swooped to the ground, but that's what boys did in those days".
See also
Birch bark
Birch bark manuscript
Taxonomy of Betula
References
Sources
Furlow, John J. (1997). "Betula". In Flora of North America Editorial Committee (ed.). Flora of North America North of Mexico (FNA). Vol. 3. New York and Oxford: Oxford University Press – via eFloras.org, Missouri Botanical Garden, St. Louis, MO & Harvard University Herbaria, Cambridge, MA.
Li, Pei-chun; Skvortsov, Alexei K. "Betula". Flora of China. Vol. 4 – via eFloras.org, Missouri Botanical Garden, St. Louis, MO & Harvard University Herbaria, Cambridge, MA.
Grimshaw, John (2009). New Trees, Recent introductions to cultivation. Kew Publishing, RBG Kew. pp. 163–174.
Chisholm, Hugh, ed. (1911). "Birch" . Encyclopædia Britannica. Vol. 3 (11th ed.). Cambridge University Press.
External links
Tree Family Betulaceae Diagnostic photos of many species, Morton Arboretum specimens
Eichhorn, Markus (July 2010). "The Birch Tree". Test Tube. Brady Haran for the University of Nottingham.
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.
Problem with previous methods
Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO (input/output) operations. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.
Advantages with BIRCH
It is local in that each clustering decision is made without scanning all data points and currently existing clusters.
It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important.
It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs.
It is also an incremental method that does not require the whole data set in advance.
Algorithm
The BIRCH algorithm takes as input a set of N data points, represented as real-valued vectors, and a desired number of clusters K. It operates in four phases, the second of which is optional.
The first phase builds a clustering feature (
C
F
{\displaystyle CF}
) tree out of the data points, a height-balanced tree data structure, defined as follows:
Given a set of N d-dimensional data points, the clustering feature
C
F
{\displaystyle CF}
of the set is defined as the triple
C
F
=
(
N
,
L
S
→
,
S
S
)
{\displaystyle CF=(N,{\overrightarrow {LS}},SS)}
, where
L
S
→
=
∑
i
=
1
N
X
i
→
{\displaystyle {\overrightarrow {LS}}=\sum _{i=1}^{N}{\overrightarrow {X_{i}}}}
is the linear sum.
S
S
=
∑
i
=
1
N
(
X
i
→
)
2
{\displaystyle SS=\sum _{i=1}^{N}({\overrightarrow {X_{i}}})^{2}}
is the square sum of data points.
Clustering features are organized in a CF tree, a height-balanced tree with two parameters: branching factor
B
{\displaystyle B}
and threshold
T
{\displaystyle T}
. Each non-leaf node contains at most
B
{\displaystyle B}
entries of the form
[
C
F
i
,
c
h
i
l
d
i
]
{\displaystyle [CF_{i},child_{i}]}
, where
c
h
i
l
d
i
{\displaystyle child_{i}}
is a pointer to its
i
{\displaystyle i}
th child node and
C
F
i
{\displaystyle CF_{i}}
the clustering feature representing the associated subcluster. A leaf node contains at most
L
{\displaystyle L}
entries each of the form
[
C
F
i
]
{\displaystyle [CF_{i}]}
. It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter
T
{\displaystyle T}
. A node is required to fit in a page of size
P
{\displaystyle P}
.
B
{\displaystyle B}
and
L
{\displaystyle L}
are determined by
P
{\displaystyle P}
. So
P
{\displaystyle P}
can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.
In the second step, the algorithm scans all the leaf entries in the initial
C
F
{\displaystyle CF}
tree to rebuild a smaller
C
F
{\displaystyle CF}
tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.
In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their
C
F
{\displaystyle CF}
vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.
= Calculations with the clustering features
=Given only the clustering feature
C
F
=
[
N
,
L
S
→
,
S
S
]
{\displaystyle CF=[N,{\overrightarrow {LS}},SS]}
, the same measures can be calculated without the knowledge of the underlying actual values.
Centroid:
C
→
=
∑
i
=
1
N
X
i
→
N
=
L
S
→
N
{\displaystyle {\overrightarrow {C}}={\frac {\sum _{i=1}^{N}{\overrightarrow {X_{i}}}}{N}}={\frac {\overrightarrow {LS}}{N}}}
Radius:
R
=
∑
i
=
1
N
(
X
i
→
−
C
→
)
2
N
=
N
⋅
C
→
2
+
S
S
−
2
⋅
C
→
⋅
L
S
→
N
=
S
S
N
−
(
L
S
→
N
)
2
{\displaystyle R={\sqrt {\frac {\sum _{i=1}^{N}({\overrightarrow {X_{i}}}-{\overrightarrow {C}})^{2}}{N}}}={\sqrt {\frac {N\cdot {\overrightarrow {C}}^{2}+SS-2\cdot {\overrightarrow {C}}\cdot {\overrightarrow {LS}}}{N}}}={\sqrt {{\frac {SS}{N}}-({\frac {\overrightarrow {LS}}{N}})^{2}}}}
Average Linkage Distance between clusters
C
F
1
=
[
N
1
,
L
S
1
→
,
S
S
1
]
{\displaystyle CF_{1}=[N_{1},{\overrightarrow {LS_{1}}},SS_{1}]}
and
C
F
2
=
[
N
2
,
L
S
2
→
,
S
S
2
]
{\displaystyle CF_{2}=[N_{2},{\overrightarrow {LS_{2}}},SS_{2}]}
:
D
2
=
∑
i
=
1
N
1
∑
j
=
1
N
2
(
X
i
→
−
Y
j
→
)
2
N
1
⋅
N
2
=
N
1
⋅
S
S
2
+
N
2
⋅
S
S
1
−
2
⋅
L
S
1
→
⋅
L
S
2
→
N
1
⋅
N
2
{\displaystyle D_{2}={\sqrt {\frac {\sum _{i=1}^{N_{1}}\sum _{j=1}^{N_{2}}({\overrightarrow {X_{i}}}-{\overrightarrow {Y_{j}}})^{2}}{N_{1}\cdot N_{2}}}}={\sqrt {\frac {N_{1}\cdot SS_{2}+N_{2}\cdot SS_{1}-2\cdot {\overrightarrow {LS_{1}}}\cdot {\overrightarrow {LS_{2}}}}{N_{1}\cdot N_{2}}}}}
In multidimensional cases the square root should be replaced with a suitable norm.
BIRCH uses the distances DO to D3 to find the nearest leaf, then the radius R or the diameter D to decide whether to absorb the data into the existing leaf or whether to add a new leaf.
= Numerical issues in BIRCH clustering features
=Unfortunately, there are numerical issues associated with the use of the term
S
S
{\displaystyle SS}
in BIRCH. When subtracting
S
S
N
−
(
L
S
→
N
)
2
{\displaystyle {\frac {SS}{N}}-{\big (}{\frac {\vec {LS}}{N}}{\big )}^{2}}
or similar in the other distances such as
D
2
{\displaystyle D_{2}}
, catastrophic cancellation can occur and yield a poor precision, and which can in some cases even cause the result to be negative (and the square root then become undefined). This can be resolved by using BETULA cluster features
C
F
=
(
N
,
μ
,
S
)
{\displaystyle CF=(N,\mu ,S)}
instead, which store the count
N
{\displaystyle N}
, mean
μ
{\displaystyle \mu }
, and sum of squared deviations instead based on numerically more reliable online algorithms to calculate variance. For these features, a similar additivity theorem holds. When storing a vector respectively a matrix for the squared deviations, the resulting BIRCH CF-tree can also be used to accelerate Gaussian Mixture Modeling with the expectation–maximization algorithm, besides k-means clustering and hierarchical agglomerative clustering.
Instead of storing the linear sum and the sum of squares, we can instead store the mean and the squared deviation from the mean in each cluster feature
C
F
′
=
(
N
,
μ
,
S
)
{\displaystyle CF'=(N,\mu ,S)}
, where
n
{\displaystyle n}
is the node weight (number of points)
μ
{\displaystyle \mu }
is the node center vector (arithmetic mean, centroid)
S
{\displaystyle S}
is the sum of squared deviations from the mean (either a vector, or a sum to conserve memory, depending on the application)
The main difference here is that S is computed relative to the center, instead of relative to the origin.
A single point
x
{\displaystyle x}
can be cast into a cluster feature
C
F
x
=
(
1
,
x
,
0
)
{\displaystyle CF_{x}=(1,x,0)}
. In order to combine two cluster features
C
F
A
B
=
C
F
A
+
C
F
B
{\displaystyle CF_{AB}=CF_{A}+CF_{B}}
, we use
N
A
B
=
N
A
+
N
B
{\displaystyle N_{AB}=N_{A}+N_{B}}
μ
A
B
=
μ
A
+
N
B
N
A
B
(
μ
B
−
μ
A
)
{\displaystyle \mu _{AB}=\mu _{A}+{\frac {N_{B}}{N_{AB}}}(\mu _{B}-\mu _{A})}
(incremental update of the mean)
S
A
B
=
S
A
+
S
B
+
N
B
(
μ
B
−
μ
A
)
∘
(
μ
B
−
μ
A
B
)
{\displaystyle S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})\circ (\mu _{B}-\mu _{AB})}
in vector form using the element-wise product, respectively
S
A
B
=
S
A
+
S
B
+
N
B
(
μ
B
−
μ
A
)
T
(
μ
B
−
μ
A
B
)
{\displaystyle S_{AB}=S_{A}+S_{B}+N_{B}(\mu _{B}-\mu _{A})^{T}(\mu _{B}-\mu _{AB})}
to update a scalar sum of squared deviations
These computations use numerically more reliable computations (c.f. online computation of the variance) that avoid the subtraction of two similar squared values. The centroid is simply the node center vector
μ
{\displaystyle \mu }
, and can directly be used for distance computations using, e.g., the Euclidean or Manhattan distances. The radius simplifies to
R
=
1
N
S
{\displaystyle R={\sqrt {{\frac {1}{N}}S}}}
and the diameter to
D
=
2
N
−
1
S
{\displaystyle D={\sqrt {{\frac {2}{N-1}}S}}}
.
We can now compute the different distances D0 to D4 used in the BIRCH algorithm as:
Euclidean distance
D
0
=
‖
μ
A
−
μ
B
‖
{\displaystyle D_{0}=\|\mu _{A}-\mu _{B}\|}
and Manhattan distance
D
1
=
‖
μ
A
−
μ
B
‖
1
{\displaystyle D_{1}=\|\mu _{A}-\mu _{B}\|_{1}}
are computed using the CF centers
μ
{\displaystyle \mu }
Inter-cluster distance
D
2
=
1
N
A
S
A
+
1
N
B
S
B
+
‖
μ
A
−
μ
B
‖
2
{\displaystyle D_{2}={\sqrt {{\frac {1}{N_{A}}}S_{A}+{\frac {1}{N_{B}}}S_{B}+{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}}
Intra-cluster distance
D
3
=
2
N
A
B
(
N
A
B
−
1
)
(
N
A
B
(
S
A
+
S
B
)
+
N
A
N
B
‖
μ
A
−
μ
B
‖
2
)
{\displaystyle D_{3}={\sqrt {{\frac {2}{N_{AB}(N_{AB}-1)}}\left(N_{AB}(S_{A}+S_{B})+N_{A}N_{B}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}\right)}}}
Variance-increase distance
D
4
=
N
A
N
B
N
A
B
‖
μ
A
−
μ
B
‖
2
{\displaystyle D_{4}={\sqrt {{\frac {N_{A}N_{B}}{N_{AB}}}{\big \|}\mu _{A}-\mu _{B}{\big \|}^{2}}}}
These distances can also be used to initialize the distance matrix for hierarchical clustering, depending on the chosen linkage. For accurate hierarchical clustering and k-means clustering, we also need to use the node weight
N
{\displaystyle N}
.
Clustering Step
The CF-tree provides a compressed summary of the data set, but the leaves themselves only provide a very poor data clustering.
In a second step, the leaves can be clustered using, e.g.,
k-means clustering, where leaves are weighted by the numbers of points, N.
k-means++, by sampling cluster features proportional to
S
+
N
min
i
|
|
μ
−
c
i
|
|
{\displaystyle S+N\min _{i}||\mu -c_{i}||}
where the
c
i
{\displaystyle c_{i}}
are the previously chosen centers, and
(
N
,
μ
,
S
)
{\displaystyle (N,\mu ,S)}
is the BETULA cluster feature.
Gaussian mixture modeling, where also the variance S can be taken into account, and if the leaves store covariances, also the covariances.
Hierarchical agglomerative clustering, where the linkage can be initialized using the following equivalence of linkages to BIRCH distances:
Availability
ELKI contains BIRCH and BETULA.
scikit-learn contains a limited version of BIRCH, which only supports D0 distance, static thresholds, and which uses only the centroids of the leaves in the clustering step.
References
Kata Kunci Pencarian:
- Burja
- Angela Birch
- Konjektur Birch–Tate
- Bandar Udara Birch Mountain
- Evan Bayh
- Charles Birch
- Chloe Birch
- Jack W. Birch
- Herbert G. Birch
- Thora Birch
- Birch
- BIRCH
- Birching
- Thora Birch
- Betula papyrifera
- Birch (disambiguation)
- Betula pendula
- Birch bark
- Ornäs birch
- Birch wood