Tweet Embedding Visualization

A dynamic web application for visualizing text embeddings using UMAP dimensionality reduction and Streamlit. Takes survey data or any text data in CSV format, generates semantic embeddings via LM Studio, and provides interactive 2D visualizations.

Features

CSV Data Processing: Converts survey data (like breakfast survey responses) into analyzable format
Semantic Embeddings: Generates 768-dimensional embeddings using LM Studio with Nomic Embed Text v1.5 model
Interactive Visualization: 2D UMAP visualization with hover tooltips showing full text content
Test Tweet Support: Add custom test entries to demonstrate clustering behavior in real-time
Configurable UMAP: Adjust clustering parameters (n_neighbors, min_dist, spread) via sidebar controls
Auto-detection: Automatically finds and uses the most recent embeddings file

Demo

The visualization displays text entries as points in 2D space where:

Similar content clusters together based on semantic meaning
Test entries appear as red dots for easy identification
Original entries appear as blue dots
Hover over any point to see the full text
Points are positioned using UMAP dimensionality reduction with cosine similarity

Installation

Prerequisites

Python 3.8+
LM Studio with Nomic Embed Text v1.5 model

Setup

Clone the repository:

git clone https://github.com/yourusername/twitter-embedding-visualization.git
cd twitter-embedding-visualization

Install dependencies:

pip install -r requirements.txt

Set up LM Studio:
- Install LM Studio
- Download the Nomic Embed Text v1.5 model
- Start the server on port 1234
- Enable "Serve on Local Network" in settings
- Note: Update the IP address in scripts if needed (currently set to 10.0.0.7)

Usage

1. Prepare Your Data

Start with a CSV file containing your text data (e.g., breakfast_survey_data.csv). The format should have text responses in one of the columns.

2. Convert Survey Data to Standard Format

python convert_breakfast_data.py

This converts your survey responses into a standardized format with two columns:

username: Identifier (e.g., "breakfast_survey")
full_text: The actual text content to analyze

The script outputs a timestamped file like: tweets_data_20250714_185858.csv

3. Generate Embeddings

Edit generate_embeddings_filtered.py to set your input file and target users:

input_file = "tweets_data_20250714_185858.csv"
target_users = ['breakfast_survey']

Then run:

python generate_embeddings_filtered.py

This processes each text entry through LM Studio and adds a 768-dimensional embedding column. Progress updates appear every 25 entries. Output file: tweets_with_embeddings_filtered_YYYYMMDD_HHMMSS.csv

4. Run Visualization

Auto-detect most recent embeddings file:

streamlit run streamlit_app.py

Specify a particular embeddings file:

streamlit run streamlit_app.py tweets_with_embeddings_filtered_20250629_144705.csv

Opens the interactive web application in your browser.

5. Add Test Entries (Interactive)

In the Streamlit app:

Enter text in the "Enter your test tweet" sidebar box
Click "Add Test Tweet"
The app generates a real embedding via LM Studio
Test entry appears as a red dot in the visualization
Use "Remove All Test Tweets" to clear them

6. Adjust UMAP Parameters

Use the sidebar controls to adjust clustering:

n_neighbors: Controls local vs global structure (5-200)
min_dist: Minimum distance between points (0.0-1.0)
spread: Scale of embedded points (0.5-3.0)
Max tweets: Limit visualization size for performance (1000-50000)

Configuration

LM Studio Connection

Update the IP address in generate_embeddings_filtered.py and streamlit_app.py:

lm_studio_url = "http://YOUR_IP:1234/v1/embeddings"

Input Data

Edit convert_breakfast_data.py to specify your input CSV:

df = pd.read_csv('your_survey_data.csv')

The script uses the third column (index 2) by default for text content. Adjust if needed:

full_text = row.iloc[2]  # Change index as needed

File Structure

├── streamlit_app.py                      # Main web application
├── convert_breakfast_data.py             # Convert survey data to standard format
├── generate_embeddings_filtered.py       # Generate embeddings for entries
├── analyze_users.py                      # Analyze entry counts per user/category
├── check_columns.py                      # Check CSV column structure
├── requirements.txt                      # Python dependencies
├── README.md                             # This file
├── breakfast_survey_data.csv             # Your source data
├── tweets_data_*.csv                     # Converted data (generated)
└── tweets_with_embeddings_*.csv          # Data with embeddings (generated)

Legacy Files (Not Currently Used)

These files contain old code for different data sources and are not part of the current workflow:

collect_all_tweets.py - Old Supabase tweet collector
test_supabase_connection.py - Supabase API testing
generate_embeddings_production.py - Batch processing version

Technical Details

Embedding Model: Nomic Embed Text v1.5 (768 dimensions)
Dimensionality Reduction: UMAP with cosine similarity metric
Web Framework: Streamlit with Plotly for interactive visualizations
Data Processing: Pandas for CSV handling, NumPy for numerical operations
API: RESTful requests to LM Studio for embedding generation

Utilities

Analyze Entry Distribution

python analyze_users.py

Shows entry count per user/category in your dataset.

Check CSV Structure

python check_columns.py

Displays available columns in your CSV files.

System Requirements

For ARM64 Windows Users

This project requires WSL (Windows Subsystem for Linux) due to library compatibility issues with ARM64 Windows. The visualization components work best in a Linux environment.

Supported Platforms

Linux (recommended)
macOS
Windows x64
Windows ARM64 (via WSL)

Troubleshooting

LM Studio Connection Errors

Ensure LM Studio server is running on the correct port (1234)
Enable "Serve on Local Network" in LM Studio settings
For WSL users: Use Windows IP address instead of localhost
Verify the IP address in all scripts matches your LM Studio instance

Embedding Generation Slow

Embedding generation time depends on dataset size and hardware
Progress updates appear every 25 entries
Average rate: ~60-100 entries/minute on typical hardware

Visualization Issues

Use "Refresh Data" button to reload CSV changes
Check that CSV files have valid embedding data (768-dimensional arrays)
Ensure test entries have proper embeddings from LM Studio
Reduce "Max tweets" slider for faster rendering

CSV Format Issues

Ensure your source CSV has text data in a consistent column
Check column structure with check_columns.py
Verify convert_breakfast_data.py is reading the correct column index

Workflow Summary

breakfast_survey_data.csv
    ↓ (convert_breakfast_data.py)
tweets_data_YYYYMMDD_HHMMSS.csv
    ↓ (generate_embeddings_filtered.py)
tweets_with_embeddings_filtered_YYYYMMDD_HHMMSS.csv
    ↓ (streamlit run streamlit_app.py)
Interactive Visualization

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

MIT License - feel free to use and modify as needed.

Acknowledgments

Built with Streamlit for rapid prototyping
Uses UMAP for dimensionality reduction
Powered by LM Studio and Nomic Embed Text v1.5

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
lib		lib
.gitignore		.gitignore
.lesshst		.lesshst
README.md		README.md
analyze_users.py		analyze_users.py
breakfast_survey_data.csv		breakfast_survey_data.csv
check_columns.py		check_columns.py
collect_all_tweets.py		collect_all_tweets.py
collect_tweets.py		collect_tweets.py
convert_breakfast_simple.py		convert_breakfast_simple.py
generate_embeddings.py		generate_embeddings.py
generate_embeddings_filtered.py		generate_embeddings_filtered.py
get_user_tweets.py		get_user_tweets.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
tweet_embeddings_umap.png		tweet_embeddings_umap.png
tweet_network.html		tweet_network.html
tweets_backup.csv		tweets_backup.csv
tweets_data_20250629_131806.csv		tweets_data_20250629_131806.csv
tweets_data_20250714_185858.csv		tweets_data_20250714_185858.csv
tweets_with_embeddings_filtered_20250714_190010.csv		tweets_with_embeddings_filtered_20250714_190010.csv
tweets_with_embeddings_filtered_20250715_160511.csv		tweets_with_embeddings_filtered_20250715_160511.csv
tweets_with_embeddings_filtered_20250715_170725.csv		tweets_with_embeddings_filtered_20250715_170725.csv
tweets_with_embeddings_filtered_YYYYMMDD_HHMMSS.csv		tweets_with_embeddings_filtered_YYYYMMDD_HHMMSS.csv
umap_tweets.py		umap_tweets.py

Folders and files

Latest commit

History

Repository files navigation

Tweet Embedding Visualization

Features

Demo

Installation

Prerequisites

Setup

Usage

1. Prepare Your Data

2. Convert Survey Data to Standard Format

3. Generate Embeddings

4. Run Visualization

5. Add Test Entries (Interactive)

6. Adjust UMAP Parameters

Configuration

LM Studio Connection

Input Data

File Structure

Legacy Files (Not Currently Used)

Technical Details

Utilities

Analyze Entry Distribution

Check CSV Structure

System Requirements

For ARM64 Windows Users

Supported Platforms

Troubleshooting

LM Studio Connection Errors

Embedding Generation Slow

Visualization Issues

CSV Format Issues

Workflow Summary

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages