A New Approach to Data Analysis: Chloe Wohlgemuth Thesis Spotlight

alt= — *Chloe Wohlgemuth, senior thesis student. Photo courtesy of Chloe Wohlgemuth*.

The following was adapted from an interview between Chloe Wohlgemuth ’22 and Max Hauschildt.

Could tell us a bit about yourself? Why did you decide to do a senior thesis?

I am a computer science and physics double-major. Aside from one AP class, I had never taken computer science before coming to Amherst. Coming to Amherst, I took COSC 112 and really enjoyed the subject, making my way to data structures, AI and security as a sophomore. As I went through higher-level courses, it became more explorative work and I started wanting to do research with professors. For two summers, I managed to work with Professor Lee Spector and Matteo Riondato, but my research experience was helping run experiments (which I still really enjoyed). Eventually, I arrived at the question, “What if I want to be there when the paper first starts?” and decided to do a thesis.

Did you always know you would write a thesis? Did you at any point consider changing your field of study?

When I was deciding to do a thesis, there weren’t any professors doing research in physics that I was interested in, so I decided to do computer science. I didn’t know what I wanted to research for my thesis initially, but I knew I wanted it to be in AI. Specifically, I wanted [my work] to carry out a cool task in a cool way. I reached out to some computer science professors and found that I was interested in Scott Alfeld’s research and decided to do [my thesis] with him.

Can you tell us about what you’re researching for your thesis? Why did you want to do a thesis on this topic in particular? Did you always know you wanted to pursue this topic for your thesis?

My thesis studies kernelized k-planes clustering within the field of machine learning. Clustering is the process of searching for natural groupings in data by grouping elements that are “similar” together and separating “different” elements into different groups. An example would be having a bunch of Canadian cities being grouped in Canada and a bunch of French cities being grouped in France. In the field of clustering, my research builds on k-means and standard k-planes clustering, both forms of partition-based clustering:

Showcase different types of clustering for the same data set. — *Figure 1.1 from Chloe’s Thesis showcasing the different methods to cluster data. Figure provided by Chloe Wohlgemuth*.

Each of these clustering techniques has a way of partitioning the data (i.e., by using points or planes), to which the data is related via how “similar” or “distant” they are to the point or plane. However, k-means clustering doesn’t work for every data distribution [because] k-planes clustering enables linear data to be clustered more sensibly. What if you had data that was very curvy and nonlinear? How should the data be clustered? In my thesis, I explored how one might analyze this data which is not necessarily linear due to the wide variety of dimensions the data point possesses. This would enable us to determine relationships that are not necessarily linear or proportional.

The algorithm represents the data provided with higher dimensions, where it could be well-expressed with a plane. Each dimension refers to a computation/combination of the input data points (e.g., a person’s height, weight, age, BMI, etc.) In k-Planes, the data is only combined linearly, as opposed to the non-linear combinations. To take the data and lift it to higher dimensions, we use these non-linear combinations of the data we already have. However, to do so, we apply Kernel Principal Component Analysis (KPCA) to summarize the data’s various dimensions and cluster them. The kernel trick enables us to find these non-linear combinations, which are linear combinations of the lifted data, without explicitly storing or computing high (potentially infinite) dimensional data. My algorithm tells me which of these clusters each data point went to and what each cluster is characterized by (i.e., a best-fit plane or manifold).

The algorithm functions by alternating between two steps: updating its clusters and assigning each point to those clusters. It does this until the clusters converge, and the algorithm does not believe it can find a better clustering of the data. This enables me to analyze large data sets in a novel fashion by finding non-linear trends instead of predominantly linear ones.

What did your schedule look like throughout this process? What did you spend most of your time doing during the process? What complications did you encounter during the process, if any?

Originally, my research idea was to do something with k-planes and try to compute the plane in a new way. Unfortunately, I discovered that most of my ideas had already been researched by others, which sent me back to the drawing board. I wondered what I could do beyond k-planes, which led me to the thesis I have today.

After that, my time was divided into math, coding and writing. The math and programming components took up most of the time during this process. In the beginning, I spent most of [my time] reviewing literature regarding k-planes clustering and PCA. I also had to code specific scenarios to test the algorithm, exploring how I could get the clustering algorithm to handle non-linear data sets. I started with very slight variations from a linear relationship and gradually gave it more [complex] data and worked from there. At the same time, I had to figure out how to make the math work while preventing the computer from computing data in very high dimensions. I could run the algorithm on my laptop, which is the value of the kernel trick in enabling me to operate in a high-dimensional space without explicitly computing or storing those dimensions.

One of the biggest complications I encountered was to work out how to mathematically avoid computations in infinite dimensions. Working around that actually extended into this past semester, as even when it wasn’t computing in infinite dimensions, the algorithm still wouldn’t work. I eventually found that the data needed to be centered in higher dimensions, which took even more time to solve.

What responsibilities did you have while writing the thesis?

I worked a lot with my advisor Professor Scott Alfeld, who would give me comments and advice on how to proceed. If I reached roadblocks, he would guide me to read [specific] research papers or research other approaches. He helped a lot with the research component and gave feedback to help me make progress. He would suggest testing different toy data sets and ask what it might mean if the algorithm works or doesn’t work with that data. It was helpful to me while I was writing the thesis.

What impact do you hope for your thesis to have in your field? Did you foresee this outcome when you first decided to do a thesis?

According to my advisor, this could definitely be expanded upon, rigorously tested and investigated as a published paper. I had always intended to work on a new method that had never been formulated, but I didn’t see myself going into infinite dimensions. I never thought I would do this much math in my thesis: I originally thought that it would be much more empirical, but I found it really interesting. I really enjoyed clustering, and I definitely think I would keep researching it or publish a paper given the opportunity.

What advice would you give to students thinking about writing a thesis?

One, don’t be afraid to ask your advisor questions. If you’re at this point where you are writing a thesis, then you are smart and competent. So if you can ask for a quick point of clarification to save time, do it. If I was confused about something that my algorithm returned, I would spend a lot of time rigorously testing it but still ask my advisor. I would say asking [your advisor] for clarifications or questions when you’re stuck is worth it because it may prevent you from doing trivial work. Don’t be afraid to take a step back and ask your advisor if you didn’t quite understand or process their advice the first time.

Two, keep notes on where you started, where you ended up and what questions you had throughout the process. You should keep track of them even before you start writing because it’s a useful log and helps when you have to do a thesis defense.