Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Reposted from the original at https://blog.stephenturner.us/p/github-repo-to-text-for-llm-input.
—
If you use ChatGPT, Claude, or even some local model through Ollama or HuggingFace Assistants, you’ll know that the chat interface makes it challenging to feed in an entire repo like a Python or R package, because functions, tests, etc. can be scattered across many files throughout a repo. Here I’ll demonstrate how to turn an entire GitHub repo into a single text file for LLM-friendly input, with R and Python packages as examples.
This GitHub Repo to Text Converter can help here.
This little app takes a URL to a GitHub repo, lets you select which files in the repo directory structure to include, and creates a single plain text file you can feed into an LLM to ask for explanations, usage, or anything else you’d like to ask about the codebase. You can download or copy this text to the clipboard to later paste into an LLM of your choosing.
The tool runs entirely in the browser. You can also provide a personal access token if you want to access private repositories (again, securely, since everything runs in the browser).
I recently demonstrated writing a CLI app using Click from a Cookiecutter template in Python. It’s a simple app that tells you how much caffeine remains in your system based on how much you consume and when you’ll go to bed. It’s on GitHub at stephenturner/caffeinated.
Here I’m pasting the repo URL, https://github.com/stephenturner/caffeinated into the app. The tool is smart enough to recognize that this is a Python package, and I probably want the __init__.py
, __main__.py
, other package .py
files, and the unit tests in tests/
that I’ll use with pytest
.
After hitting “Generate Text File” you’ll see that it generates a single plain text file, first listing out the directory structure of the selected files, then listing out the contents of each file separately. Here’s a preview. My __init__.py
is empty (just to signal that this is a package), and the cli.py
file is truncated here.
Directory Structure: └── ./ ├── caffeinated │ ├── __init__.py │ ├── __main__.py │ └── cli.py └── tests └── test_caffeinated.py --- File: /caffeinated/__init__.py --- --- File: /caffeinated/__main__.py --- from .cli import caffeinated if __name__ == "__main__": caffeinated() --- File: /caffeinated/cli.py --- import click import math from datetime import datetime from importlib.metadata import version, PackageNotFoundError ### TRUNCATED ###
Once you have this, you can easily paste this into an LLM of your choice to chat with the codebase. Here, I pasted this into GPT-4o with the leading prompt: “Explain this code to me, how to use it what it does.” GPT-4o returned all of the text and code below as a single response.
First, the directory structure.
Next, an explanation of __main__.py
, cli.py
, and the tests.
Next, a demonstration of how to use it. Note above that I did not select the README to add to my text file. GPT-4o divined this usage information from the code itself, rather than regurgitating the README.
And finally, a simple explanation of the (minimal) test suite.
This also works on an R package. In my last job I wrote a package to assist with containerizing R packages with a usethis-like interface. It’s called pracpac (practical R packaging [with Docker]), and you can read more in the paper or on the GitHub repo. Here I’ll take the code from pracpac repo and pull out the relevant R source files.
This time I asked Claude 3.5 Sonnet to tell me more about this package.
If you look at the source code you’ll see that this is a simple implementation using just HTML and JavaScript. Which means you can download the repo and open the index.html, or if you’d prefer a vanity URL, you can fork the repo and set up GitHub pages to serve from your main branch (e.g. stephenturner.github.io/repo2txt).
Related