Author Archives: Vitaliy Kurlin

Introduction to Periodic Geometry

This post by Phil Smith and Vitaliy Kurlin will introduce the new research area of periodic geometry. Using crystals as motivation, the periodic geometry studies periodic sets of points and their equivalence classes. We will describe the classification problem of periodic sets up to rigid motions and the work we have carried out in trying to tackle this problem.

Crystals

Solid periodic crystalline materials, or simply crystals, are everywhere, ranging from basic everyday substances like table salt to the height of luxury in diamonds. Crystals are highly geometric structures, and are good examples of real-world occurrences of periodic geometry.

This is because periodic crystals are composed of a building block (called a unit cell) that is repeated in three linearly independent directions. To study crystals from a mathematical angle, we therefore need a good grasp of this periodic geometry, so let us introduce the main objects.

crystals in a phone

Lattices

Mathematically, given a set of \(n\) linearly independent vectors \(v_1, v_2, …, v_n\in \mathbb{R}^n\), a lattice \(\Lambda\) is defined to be the set of all points that are integer linear combinations of the \(v_i\). Namely, \(\Lambda = \{\sum\limits_{i = 1}^n c_iv_i \,|\, c_i \in \mathbb{Z}\}\). As a 2D example (so \(n = 2)\), if we take \(v_1 = (1, 0)\) and \(v_2 = (0.5, \frac{\sqrt{3}}{2})\), then we get the hexagonal lattice below. Try to make out the hexagons!

hexagonal lattice

The unit square lattice consists of all points with integer coordinates  \((x, y)\), where \(x, y \in \mathbb{Z}\). What pairs of vectors can generate this lattice? See all exercises at the end of the post.

Unit Cells

Let \(\Lambda\) be a lattice generated by the vectors \(v_1, …, v_n \in \mathbb{R}^n\). Then restricting the coefficients of the vectors to the interval [0, 1) yields a unit cell for the lattice. For the hexagonal lattice with vectors as given above, this would yield the unit cell \(U_2\) in the picture below. But as the picture below shows, there are infinitely many possible unit cells for a given lattice, partly explained by the fact that there are infinitely many sets of vectors that can generate the same lattice.

unit cells

In short, a unit cell is any region of the space that is spanned by some basis vectors, hence can be periodically repeated to tile the whole space.

Periodic Sets

A periodic set can be defined by a unit cell and its finite subset called a motif. The left picture below shows a unit cell for the hexagonal lattice, and a 5-point motif of the unit cell.

MotifPeriodic Set

If we repeatedly translate the motif on the left by the given basis vectors, we get a periodic set in the right hand side picture above. Mathematically, for a motif M and a lattice \(\Lambda\), a periodic set can be written as a Minkowski sum \(A = M+\Lambda = \{m + v \, | \, m \in M, v \in \Lambda\}\).

To make periodic sets more practical models of real crystals, one can add chemical types or other properties of atoms or ions as labels of points. Similarly, one can represent chemical bonds with any extra information as labeled edges and get a periodic graph.

Equivalency up to Rigid Motions

Most solid crystals are rigid bodies. Hence using periodic sets as an abstract description of a crystal, it makes sense to say that two periodic sets are equivalent if they can be related to each other by a rigid motion.

A rigid motion is an operation on a space that preserves distances as well as orientation. If the space is \(\mathbb{R}^3\), a rigid motion is a composition of translations and rotations around straight lines.

Are any of the three periodic sets below, with unit cells drawn, equivalent to any of the others? Feel free to submit your answer as a comment to this post.

Lattice1

We are interested in coming up with a solution to deciding if two periodic sets are equivalent, and also by quantifying how much they differ if they are not equivalent…

Classification Problem

Treating crystals as periodic sets, we desire a fingerprint function from the complicated space of periodic sets to a simpler space, for example \(\mathbb{R}^n\). There are trivial solutions to this problem, but for this fingerprint function to be helpful, we desire that it satisfies these three properties:

  1. Invariant under a change of unit cell and rigid motions of \(\mathbb{R}^3\);
  2. Stable under perturbations of points in a periodic set;
  3. Computable in a polynomial time in the size of a motif;
  4. Complete, meaning that non-equivalent periodic sets are mapped to different objects.
  5. Invertible in the sense that any periodic set can be reconstructed from this fingerprint.

A unit cell of a crystal is not invariant under a change of basis, but its minimal volume in \(\mathbb{R}^3\) is invariant. Can you give an example of a periodic set whose minimal cell discontinuously changes its volume under a small perturbation?

One well-known stable invariant is the density, which is the number of points in a motif divided by the volume of a corresponding unit cell. This single number contains too little information and can not be complete, because many non-equivalent sets have the same density.

Density Functions

Finally we introduce density functions, which we believe could be a solution to the above problem. For a periodic set \(A\), we denote by \(B(A;t)\) the set of all balls of radius \(t\) centred at all points \(a \in A\). If the periodic set \(A\) has a unit cell \(U\), we define the \(k\)-th density function \(\psi_k^A(t)\) as the fractional volume of the unit cell \(U\) to be covered by exactly k balls of \(B(A;t)\).

square_lattice_r=025square_lattice_r=055square_lattice_r=075square_lattice_r=1

square_lattice_densities

Above and below are two sets of images, the pictures above are for the square lattice and the pictures below are for the hexagonal lattice with distance 1 between closest points. In each set, the top row shows 4 images of the lattice with balls growing around each point of radii 0.25, 0.55, 0.75, 1. The bottom pictures show the eight density functions \(\psi_k^{\Lambda}(t)\) for each lattice \(\Lambda\).

hexagonal_lattice_r=025hexagonal_lattice_r=055hexagonal_lattice_r=075hexagonal_lattice_r=1

hexagonal_lattice_densities

Density functions are by construction invariant under rigid motions of \(\mathbb{R}^3\), and less trivially, in collaboration with Teresa Heiss, Mathijs Wintraecken and Herbert Edelsbrunner from IST Austria, we have proven the density functions are stable in work that we hope to publish soon.

Exercises on periodic sets

  • Q1. For the unit square lattice in the plane, what pairs of vectors can be a basis? Hint: there are infinitely many such pairs. Can you describe all of them?
  • Q2. Which of the 3 periodic sets with red points and green unit square cells above equivalent to each other up to rigid motions and why?
  • Q3. Give an example of a periodic set whose minimal cell discontinuously changes its volume under a small perturbation.

If you wish to solve the exercises, feel free to reply or email Philip Smith or Vitaliy Kurlin.

Mergregram of a dendrogram

This post describes the new invaiant introduced in the paper “The mergegram of a dendrogram and its stability” (arxiv:2007.11278) by Yury Elkin and Vitaliy Kurlin, which was peer-review and presented in a 25-min video at MFCS 2020 (Mathematical Foundations of Computer Science).

The mergegram is the new intermediate stable-under-noise invariant of a cloud of a points. The mergegram sits between (1) the 0-dimensional persistence of a distance-based filtration on the cloud, which is stable, but weak as an isometry invariant, and (2) the dendrogram of clustering, which is much stronger, but is combinatorially unstable under perturbations of given points. The picture below shows a dendrogram on the left and its mergegram on the right.

What is a single-linkage clustering?

A clustering is any splitting of a given dataset into subgroups or clusters so that data points within each cluster are more similar to each other to data points from other clusters.
There are many measures of similarity, hence thousands of clustering algorithms.

The single-linkage (SL) clustering is one of the simplest methods, which can be applied to any abstract data points given by pairwise distances between them. For a distance threshold s>0, two points a,b belong to the same single-linkage cluster, if they are joined by a finite sequence of points a=a1, a2, … , ak=b such that the distance between any successive points is at most 2s. Here the factor 2 is chosen only for convenience of further constructions. Instead of a distance at most 2s one can say that the closed balls of radius s centered at given points intersect.

SL clustering on a 10-point cloud

For the scale s=0, each given point form its own 1-point cluster. For any large enough s (larger than the maximum distance between given points), all points are in the same single cluster.

What is a dendrogram of clustering?

The dendrogram of clustering visualizes how cluster evolve when a threshold grows. For the cloud of 10 points in the Euclidean plane, the SL clustering starts from 10 isolated points at threshold or scale s=0.

10-point cloud at scale 1.1   cloud10_dendrogram

A denrogram is a tree whose edges represent evolving clusters and internal nodes represent mergers of clusters. Each vertical edge `spans a life’ of a cluster while a scale s is increasing. The ten vertical edges at the bottom of the dendrogram above correspond to 10 given points.

A dendrogram is well-defined not only for the single-linkage clustering, but for any hierarchical clustering algorithm. A clustering is called hierarchical if by increasing a threshold one can only merge clusters, but never split them.

Instability of a dendrogram under noise.

More than two clusters can simultaneously merge into one. The 10-point cloud in the left picture above has the dendrogram on the right, where two quadruples of points merge into two clusters at the scale of s≈1.1.

Slightly perturbing one of these points, one can transform any 4-to-1 merger into a sequence of simpler 3-to-1 or 2-to-1 mergers of clusters. Hence the combinatorial structure, e.g. the list of vertex degrees, of a dendrogram is unstable under perturbations of points.

  cloud10_perturbed_dendrogram

All real life data are noisy, hence our data descriptors should tolerate at least small noise.
Moreover, the problem of quantifying a similarity requires that any slightly different data objects have close values of their descriptors.

What is the mergegram of a dendrogram?

The mergegram resolves the instability of a dendrogram by extracting its stable pairs (birth,death) or dots in the plane diagram below. This stability will be discussed in a future post.

The mergegram of a dendrogram consists of the pairs (birth,death) in a 1-1 correspondence with all clusters that live from a scale s=birth to a scale s=death. If k clusters have identical life spans from their birth to death, then the mergegram includes the pair (birth, death) with multiplicity k.

In the example below at the scale s=0.5, the points 9,10 merge into the cluster {9,10}, so the two rightmost vertical edges merge into one (shows as joined by a horizontal edge in the dendrogram). This merger contributes two pairs (0,0.5) to the mergregram.

The cluster {9,10} lives from its birth at the scale s=0.5 until its merger with the cluster {4,.6}(death at s=1.5), hence contributes the pair (0.5,1.5) to the mergregram on the right below.

dendrogram+mergegram

Experts in Topological Data Analysis have already recognized that the mergegram of pairs (birth,death) looks similar to a persistence diagram. In the case of clustering, the corresponding 0-dimensional persistence has all births equal to 0 (points born at scale 0), hence only pairs of the form (0,death). The mergegram contains pairs with positive births. A future post will include a couple of clouds with identical persistence, but different mergegrams, see Q2 below.

Exercises on understanding the new concept of a mergegram.

  • Q1. Find a cloud C of points in the real line whose SL dendrogram is unstable under perturbation of points. Draw the mergegram of C or list its pairs in the form (birth,death).
  • Q2. Let a cloud A in the real line have ordered points a1<a2<…<an. Find another cloud B of ordered points b1<b2<…<bn such that (1) B is not related to A by a translation or reflection, (2) A has the same unordered set of distances ak+1-ak between successive points as B, (3) the mergegrams of \(A\) and \(B\) are different.

You could write a brief answer or feedback in your comment: reply.

How to join the research group

This post might help potential intern students, PhD candidates and postdoctoral fellows, who would like to join the Data Science Theory and Applications group.

data-science-areas

  • Why is a multi-stage selection needed for candidates to join our group?
  • Stage 1: applications to Materials Science, your skills and expectations.
  • Stage 2: tuition fees and potential funds for your living costs at Liverpool.
  • Stages 3-4: University PhD requirements about past degrees and English.
  • Stages 5-6: your CV and video introduction of motivations and skills.
  • Stages 7-8: problem solving and a 30-min informal Skype discussion.
  • Stages 9-10: an official online application and a formal interview.
  • Traditional exercises to engage with all readers of this blog.

Why is a multi-stage selection needed for candidates to join our group?

Every week I receive several requests for supervision, mainly from PhD candidates, and also from intern students and postdoctoral fellows. Update: in March 2021 I am supervising

  • 7 PhD students as the first supervisor,
  • 6 PhD students as the second supervisor,
  • 3 PhD students as the third supervisor.

The multi-stage selection is justified by the following reasons.

  • An objective and transparent process described below is better than ad-hoc decisions.
  • PhD admissions at the University of Liverpool follow rules from higher authorities.
  • Since I’m already overloaded, there is a physical limit of potential supervision.
  • I seriously consider any supervision, hence a selection process is also serious.
  • Mentees seem to be much more motivated after they were properly selected.

Stage 1: applications to Materials Science, your skills and expectations.

Though a few current students complete their PhDs in different areas, my main research is for the Materials Innovation Factory. Hence all Liverpool-funded PhD projects will be focused on applications of geometry and topology to Materials Science, in particular crystals. A supervision in other areas, e.g. Computer Vision, can be considered if you can come with external funds.

The previous knowledge of Chemistry is not needed, however you should be open-minded enough to learn new research topics. Mathematical crystallographers and computational chemists are welcome, because the group already has strong skills in theoretical areas.

Most relevant subjects are Mathematics (linear algebra, geometry or topology, group theory) and Computer Science (graph algorithms, computational geometry, optimisation). The key requirement is your strong programming experience: C++ (preferable) and/or Python. We use software libraries in C++ and Python APIs (Application Programming Interfaces) of databases.

Since the quality is more important than quantity, new members of the group across all levels will be more rigorously selected. Everyone can expect on average 1 hour per week of 1-1 time to discuss their research, sometimes with co-supervisors or other colleagues, because I also teach big classes of undergraduates and have important administrative commitments.

Short term project students or interns can expect maximum 1 hour (a half-hour on average) per week of 1-1 time. These students may more frequently talk to a postdoc or a good PhD student. Collaborative projects are often discussed at the weekly group seminar or in smaller groups.

Everything is possible and highly motivated students, who make regular substantial progress and can present their results in a clear concise form, may naturally get more attention (hence more time for discussions) and recognition for genuine enthusiasm and hard work. Once I’ve talked to a 3rd year undergraduate for nearly 4 hours, because the project was important.

So Stage 1 is your own evaluation of your skills and motivations. You have passed this stage if

  • you are motivated to work with other people and learn new topics;
  • you can demonstrate your programming skills in C++ and/or Python;
  • you have sufficient expertise in Mathematics and/or Computer Science;
  • (for PhD candidates) you agree with the rules of the PhD programme.

Stage 2: tuition fees and potential funds for your living costs at Liverpool.

This stage is most important for non-European PhD candidates and intern students.

All available PhD positions are funded at the level of UK tuition fees (£4500 per year from 2021) plus the standard bursary (£15609 per year from 2021), which is tax free.

The tuition fees for non-UK students are much higher (£24250 per year from 2021). A UK embassy might ask for a proof of your funds for living costs when you apply for a visa. The university recommends a minimum £9135 for a single person per year.

The recommended websites for checking prices and accommodation at Liverpool are Liverpool Student Homes, Rightmove, Zoopla. You could say in advance how you plan to cover costs.

The good news is the list of scholarships for international PhD candidates. The financial requirements are outside my control. I get no financial rewards for any supervision.

Stages 3-4: university PhD requirements about past degrees and English.

These stages are formal requirements by the university for all PhD candidates.

Stage 3 is to make sure that your past degree is equivalent to at least 2:1 (roughly grade point average 60%) in the UK classification. Please e-mail the PGR office for more details.

If you are a citizen of a country, where English is the main language, or you have completed your degree in such a country, your English is acceptable and you can go to next stage 5.

In all other cases Stage 4 is to think about required IELTS grades (overall 6.5, minimum 6 in each component) or any equivalent. These grades should be obtained not later than 2 years before a PhD start date. You could postpone passing IELTS until after receiving an offer conditional on IELTS grades. Please e-mail the PGR office for more details.

Stages 5-6: your CV and video introduction of motivations and skills.

Most candidates start from Stage 5 by sending their CV, which is ok for postdoctoral fellows or local students interested in final year projects or summer dissertations. PhD candidates could follow Stages 1-4 above. Intern candidates may think about finances at Stage 2.

Most efficient e-mails are short (as this phrase). All details can be in your CV (needed) and informal cover letter (optional, only if you would like to express your motivations and highlight relevant experience). File names could include your name, e.g. Last_name.First_name.CV.pdf.

You could include a brief description of (or give links to) your past programming projects, e.g. what software libraries have you used and what challenges have you overcome?

The most important expertise to work in our diverse group is your communication skills. You are expected to communicate well with colleagues from different research areas and industry.

If I have replied to your initial expression of interest, Stage 6 is to e-mail me your short 1-2 min video presentation to introduce yourself. Though your English will be formally checked and some candidates include photos in their CVs, your video will quickly show how you talk.

For your self-presentation, I may ask to highlight any relevant skills depending on your CV. At any stage if I haven’t replied within a week, you could e-mail me again once, please not more.

Stages 7-8: problem solving and a 30-min informal Skype discussion.

Congratulations if you have completed previous Stage 6! Depending on your level, at Stage 7 you will solve mathematical problems and programming exercises. You could have 1-2 weeks for preparing your solutions. We can discuss a suitable period if you are currently busy.

If I am relatively free, Stage 8 is to arrange an informal chat by Skype, say up to 30 min. Similarly to a video presentation, a real time discussion will help to establish our future relationships.

For intern students or postdoctoral fellows applying for external grants, this Stage 8 can be the last one. After that I usually talk to the head of department to arrange an invitation letter.

Stages 9-10: an official online application and a formal interview.

Stage 9 for PhD candidates is to submit a formal application at the university website. If you have passed Stages 1-8, you could mention me as a potential supervisor, possibly a project title with a short description (if already agreed).

Postdoctoral candidates will have another link to the application from a job advert. The advice for postdocs is not to start from this Stage 9, but e-mail me your expression of interest.

Stage 10 is a formal interview for PhD candidates and postdoctoral fellows. All PhD students at Liverpool should have at least two supervisors (the standard split is 80/20). Hence a co-supervisor is involved in an interview, often by Skype for PhD candidates outside the UK.

Postdoctoral candidates will face an interview panel of at least 3 people including a representative from HR. We try to invite European candidates for on-site interviews, though Skype interviews are also possible. If anything seems unclear, feel free to e-mail me your questions. If you contact me for the first time, I would be grateful if you say how you found me.

Traditional exercises to engage with all readers of this blog

    • Q1. How many stages will PhD candidates pass before joining the group?
    • Q2. How many people were under my overall supervision in March 2021?

You could write a brief answer or feedback in your comment: reply.

Can you detect an Atmospheric River?

This post is jointly completed by Dr Vitaliy Kurlin and his new student Grzegorz Muszynski, who has started a PhD on “Topological Analysis of the Climate System” at University of Liverpool in April 2017 funded by Intel through the Big Data Centre at the Lawrence Berkeley lab (US).

What is an Atmospheric River?

An Atmospheric River (AR) is a narrow filament of concentrated water vapour in the atmosphere, usually up to several thousand kilometers long and a few hundred kilometers wide.
These filaments were called Atmospheric Rivers in the paper “Atmospheric rivers and bombs” (pdf) in 1994, because a single filament can carry more water than the Amazon River. Hence an Atmospheric River can be informally considered as a “river” flowing in the atmosphere.

integrated_water_vapour

The picture above shows the integrated water vapour (IWV) measured in grams over a squared millimetre, formally the mass of water in the vertical column over a square 1×1 mm. Higher values of IWV correspond to the red colour, lower values are shown by the blue colour.

The red box from the picture above is zoomed in the picture below showing how an Atmospheric River hits the California coast in the US.

AR_hits_California

Why are Atmospheric Rivers important?

At any given moment there are 3-5 Atmospheric Rivers on the planet and all of them contribute over 90% to the global north-south water vapour transport. When an Atmospheric River hits a coast, this “river” flows down to the land as heavy rain, which causes severe floods.

These extreme weather events regularly happen along the West Coast of North America, Western Europe and the west coast of North Africa, e.g. read “Rivers in the Sky Are Flooding The World With Tropical Waters” (pdf).

The paper “Winter floods in Britain are connected to atmospheric rivers” (pdf) justifies that all winter floods in the UK in 2000-2010 were caused by Atmospheric Rivers including the 19th November 2009 severe flood on the River Eden in Cumbria (UK).

How are Atmospheric Rivers detected?

The input for a detection is a scalar field of the Integrated Water Vapour (IWV) over a regular grid whose lines are usually parallel to meridians and longitudes. The input can be visualised as a matrix of IWV values that are obtained from weather observations or computer simulations. So every node in the regular grid has an associated value of the Integrated Water Vapour and is connected to the four neighbours (north, west, south, east) in the grid.

High moisture regions that bring water vapour from mid-latitudes in the ocean up to the land in the north are called Atmospheric Rivers to distinguish them from other high moisture regions that don’t cause floods. A detection algorithm should identify only Atmospheric Rivers.

IWV_time261

The picture above shows a big hole in the yellow-red region that doesn’t form an elongated filament. The picture below contains the yellow high moisture region without holes, but this filament doesn’t reach the coast. Hence there are no Atmospheric Rivers in both cases.

IWV_time129

The traditional approach to detect an Atmospheric River is to fix a threshold of the Integrated Water Vapour, say 20 g/mm2, and consider all nodes with values above this threshold. If these nodes form a connected component in the regular grid that has expected geometric parameters (length and width) and also joins the mid-latitude region (the bottom line of the chosen box) with the California coast, the latest detection algorithm in the TECA software (Toolkit for Extreme Climate Analysis, pdf) says that an Atmospheric River is detected.

The state-of-the-art algorithms work only for carefully chosen parameter values. Many Climate Scientists propose different values. That is why we are now working on a parameterless approach combining ideas of Topological Data Analysis with Machine Learning.

Exercise: detect an Atmospheric River

Which of the pictures below show Atmospheric Rivers in your opinion and why?

  • Q1 IWV_time3
  • Q2 IWV_time22
  • Q3 IWV_time326

You could write a brief answer or feedback in your comment: reply.

Topological Computer Vision is a new research area

google-car

This post motivates the new research area of Topological Computer Vision.

Why don’t we have self-driving cars yet?

Here is the response by Dr. Andreas Wendel (Google) from his invited talk Self-Driving Cars at the CVPR 2015 workshop Computer Vision in Vehicle Technology: “We can’t predict all possible road accidents. In the weirdest case our car stopped and waited for an old lady in a wheelchair chasing a duck with a broomstick … in the middle of a road!”

lady-broomstick A Google car makes about 200 decisions per second. If any of these 200 decisions is wrong, there could be a fatal accident. Self-driving cars will appear on the market when the error rate is less than 0.01%. My collaborator Andrew Fitzgibbon from Microsoft Research Cambridge has predicted that we might wait for another 10 years.

Stability under noise is still a problem

blue-brain-nets The current flagship method in speech and image recognition is Deep Learning. Briefly, an algorithm is trained (often for weeks) to predict correct outputs from big labelled data. For instance, the ImageNet database has more than 14M images split into over 21K categories like cars, frogs etc. These images were manually labelled by humans, which required about 25K Amazon Mechanical Turks.

During the training, the algorithm finds features that best split all labelled images into required categories. During validation, the algorithm chooses the category whose features are closest to those of a new given image. The overall error rate, when the algorithm mis-classifies images, is about 6.7%, see page 20 in ImageNet Large Scale Visual Recognition Challenge (arXiv/1409.0575, 8M). However, exercises below how this approach fails in the presence of little noise.

First key results of Topological Computer Vision

Topological Computer Vision was introduced as a new research area within Topological Data Analysis (TDA) in the invited talk at the scoping workshop of the Alan Turing Institute at Oxford on 10th September 2015.

cloud10points       cloud10points-hopes
The first key concept is a Homologically Persistent Skeleton (HoPeS) depending only on a point cloud C without extra input parameters. HoPeS(C) is the first structure that provides a closed geometric approximation to an unknown graph given only by a noisy sample C. Details are in

The big aim is to combine the stable-under-noise persistence from TDA with the state-of-the-art tools of Deep Learning that currently suffer from noise.

Exercises on analyzing noisy images by your (!) deep learning

  • Q1. The middle image below is the difference (multiplied by 10) between 2 dog images. One image is correctly recognized by the state-of-the-art Deep Learning Net as a dog. However, another image with little added noise is misclassified as an … ostritch. Where is the noisy image: on the left or on the right?dog+noise
  • Q2. This is an example from CVPR 2015, the top conference in Computer Vision and Pattern Recognition). Can you guess how the image below is mis-classified?
    (Hint for possible answers: kid’s drawing, pedestrian, school bus, trademark).black-yellow-strips

You could write a brief answer or feedback in your comment: reply.

What is topological data analysis about?

heart-red-cloudThis post answers the following questions about topological data analysis:

  • What does the word data mean?
  • What are the practical aims?
  • What is a typical problem?

Input data in topological data analysis

The usual data input in topological data analysis is a noisy sample of points in a Euclidean or in a more general metric space. For example, a black-and-white image can be given as a finite sample of black points in the plane.
black-cloud-with-holesMore generally, data can be sampled from any topological shape (or a space). Examples of shapes below are a graph, a figure-eight shape in the plane, a 2-dimensional torus.

K5symmetric figure-eight-shape   green-torus2D

Practical aims of topological data analysis

The ultimate goal is to understand the meaning of data. The practical aims are the following:

  • Represent data in an easy way for further processing. Why?
    We need to find and easily encode extra structures to work later
    with structured mathematical objects rather than with raw data.
  • Quantify or measure given data by topological invariants. Why?
    Extra structures allow us to find topological invariants that
    depend only on the shape of data, not on extra structures.
  • Make robust statistical predictions about topology of data. Why?
    If data are huge, any algorithm inevitably works on subsamples and
    we should be able to take (say) the average of all topological outputs.

We give links to more details about each of the 3 practical aims above:

Easy example of a typical hard problem

Our trained human eye can recognize a familiar heart shape in the cloud of red points below. The red heart shape is easy enough and we could connect each point with its two nearest neighbors to get a reasonable contour.

heart-black-contourHowever, a robust contour detection in noisy images is still a hot problem in computer vision. Topological data analysis looks for methods beyond the simplest nearest neighbor search. The key idea is not to fix a scale parameter when searching for neighbors, but analyze a summary of data over all scales so that this summary is stable under noise.

In conclusion, we highlight answers to the questions posed at the beginning of this post:

  • the input data are finite clouds of points without much structure
  • the practical aims are to represent and measure data in easy ways
  • a typical problem is to reconstruct contours from a noisy sample.

Exercises on the introduction to topological data analysis

  • Q1. What topological shape can one reconstruct from the cloud of points given
    at the beginning of the post? Hint. You have seen a partial yellow shape above.
  • Q2. What can you see in the black-and-white image below? Hint. Find an animal.

dalmatian

You could write a brief solution or feedback in your comment: reply.

Simplicial complexes are discretizations of real-life shapes

yellow-ring-6triangles In this post we answer the following questions of topological data analysis:

  • what are simplicial complexes and what are they not?
  • how can a shape be represented by a simplicial complex?
  • why are simplicial complexes easy for representing shapes?

Simplices are building blocks of complexes

A simplicial complex is a high-dimensional generalization of a graph. That is why simplicial complexes are sometimes called hypergraphs. The building blocks of a simplicial complex are vertices, edges, triangles and higher-dimensional simplices like tetrahedra in \(\mathbb{R}^3\).
simplices

The standard k-dimensional simplex is the subset of points in the \(k\)-dimensional space:
\(\Delta^k=\{(x_1,\dots,x_k)\in\mathbb{R}^k \; :\; 0\leq x_i\leq 1,\; \sum\limits_{i=1}^k x_i=1\}\).

We may say that the simplex \(\Delta^k\) is spanned by its \(k+1\) vertices marked by (say) \(0,1,\dots,k\). Any subset of \(l\) vertices spans an \(l\)-dimensional face (or a subsimplex) of \(\Delta^k\).

Many geometrically different shapes are homeomorphic (topologically equivalent) to the same simplex \(\Delta^k\subset\mathbb{R}^k\). For instance, the 2-dimensional disk \(\{(x,y)\in\mathbb{R}^2\; :\; x^2+y^2\leq 1\}\) is homeomorphic to the standard triangle \(\Delta^2\). Then the combinatorial structure of \(\Delta^2\) induces a similar structure on the disk, namely 3 vertices, 3 edges and one 2-dimensional face.

curved-triangle

A triangulation is a structure on a shape

A triangulation of (or the structure of a simplicial complex on) a shape is

  • a splitting into finitely many pieces homeomorphic to simplices in such a way that
  • the intersection of any two simplices in the triangulation is their common face.

A shape is called triangulable if it has a triangulation satisfying the above conditions. A simplicial complex may contain simplices of different dimensions as shown in the sun glasses below. The dimension of a simplicial complex is the maximum dimension of its simplices.

sun-glasses-encodingRepresenting a shape as a simplicial complex allows us to introduce later topological invariants of shapes. For instance, the Euler characteristic of a shape is expressed via the number of \(l\)-dimensional simplices in a trianglation for different values of \(l\).

Encoding simplicial complexes

The condition on intersection of simplices in dimension 1 means that any two edges can meet only at a single common vertex. Hence a 1-dimensional simplicial complex can not have loops or double edges with the same endpoints. This condition seems too restrictive for graphs, but is essential for higher dimensions as explained below.

Any simplicial complex with vertices marked by (say) \(0,1,\dots,n\) can be encoded by a list of maximum simplices that are not contained in larger simplices. Any maximum simplex spanned by \((k+1)\) vertices \(i_0,\dots,i_k\) is encoded by the unordered \((k+1)\)-tuple \((i_0,\dots,i_k)\). The sun glasses with 8 vertices above are encoded by the list (012),(345),(05),(14),(26),(37).

Triangulations of 2-dimensional shapes

To encode a shape, first we should triangulate it. Let us split a round ring into 4 triangles with vertices 0,1,2,3 below. The corresponding list of triples is (012),(013),(023),(123).ring-4trianglesWe may notice that the same list of triples encodes the standard tetrahedron \(\Delta^3\) with four 2-dimensional faces. To avoid this confusion, the definition of a simplicial complex requires that any two simplices should meet along their common face. So the splitting into 4 curved triangles above is not a simplicial complex.

If our code contains triples (123) and (013), we should know that the corresponding triangles share the common edge (13) as in the tetrahedron \(\Delta^3\), not only the endpoints 1 and 3 as in the round ring. A minimum triangulation of a round ring is at the beginning of this post.

In conclusion, we highlight answers to the questions posed at the beginning of the post:

  • a simplicial complexes is glued from standard simplices along their common faces,
  • to represent a shape, we triangulate it into pieces homeomorphic to simplices,
  • simplicial complexes can be easily encoded in a computer memory, hence
    they provide a convenient representation of a shape for further analysis.

Exercises on triangulations of shapes

  • Q1. Encode the figure-eight shape below by a simplicial complex.
    Hint. Each of 3 boundary contours should have at least 3 vertices.
  • Q2. Find a minimum triangulation of the 2-dimensional torus below.
    Hint. A minimum triangulation of the torus has at least 7 vertices.

figure-eight-shape                   green-torus2D

You could write a brief solution or feedback in your comment: reply.