To perform K-means clustering, especially on your t-SNE output data from an LSTM model, you can follow these steps, which I’ll detail for execution in Python using libraries like sklearn. Here’s a step-by-step guide:
Ensure your t-SNE output is in a suitable format, typically a NumPy array or a Pandas DataFrame with two columns if you visualized 2D t-SNE.
Deciding the number of clusters can be done using heuristic methods like the Elbow Method. You can plot the sum of squared distances from each point to its assigned center and look for the “elbow” where the rate of decrease sharply shifts.
You can use sklearn’s KMeans class to perform the clustering. Here’s a sample implementation:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Example data: Replace this with your t-SNE output
# Let's say tsne_output is your numpy array of shape (n_samples, 2)
tsne_output = np.random.rand(100, 2) # Dummy data for illustration
# Determining the number of clusters (K) using the Elbow Method
kmeans = KMeans(n_clusters=k, random_state=42)
sse.append(kmeans.inertia_)
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
# Choose the k after which the decrease in SSE becomes less pronounced
# For this example, let's say k=3
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(tsne_output)
# Plotting the clustered data
plt.scatter(tsne_output[:, 0], tsne_output[:, 1], c=clusters, cmap='viridis', marker='o', alpha=0.7)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('t-SNE Output Clustered using K-Means')
plt.colorbar(label='Cluster Label')
Once you have your data clustered, examine the clusters to understand their characteristics and differences. This might involve looking back at the items (e.g., motion capture sequences) that fall into each cluster to discern any patterns or commonalities.
Finally, to ensure the clusters are meaningful:
- Cross-validate with original features: Apply clustering to the LSTM outputs (before t-SNE) and see if similar clusters emerge.
- Use internal cluster validation metrics: such as silhouette scores, to assess the quality of the clusters regarding how compact and well-separated they are.
This process will help you explore the latent structure in your LSTM model’s outputs and can provide insights into the nature of the data and the model’s behavior.