Trim Strings in Vim / Organize Firefox History

I was able to trim the end of a line in Vim. Combining that along with a few other Vim functions allowed me to organize my Firefox history.

Background

I have a list of URLs from my Firefox history. I want to sort and deduplicate them. I don't care about what pages I visited. I am only interested in a list of the hosts.

So a record like

https://en.wikipedia.org/wiki/Large_Hadron_Collider

would be reduced to

https://en.wikipedia.org/

And if I visited several pages

# Pages
https://en.wikipedia.org/wiki/George_R._R._Martin
https://en.wikipedia.org/wiki/Game_of_Thrones

# Hosts 
https://en.wikipedia.org/
https://en.wikipedia.org/

then I only want one example of each host to show in the output.

https://en.wikipedia.org/

I have the URLs in Vim. But I am not sure how to proceed.

I was able to load the URLs in Vim by:

:0 put +

This leaves me with a Vim window populated with my Firefox history. Specifically, there is a list of URLs. There is one URL per line.

I'm using Vim on Linux. These steps might vary on other platforms. For example, carriage returns, ^M, might not show on Windows.

Remove Carriage Returns

If you copy and paste from the Firefox History window, you will see a '^M' at the end of each line. These are explicit carriage returns which are normally non-printing characters.

You can remove the carriage returns using the :substitute command in Vim.

:%s/\r//

Match the Beginning of the Line

It is not immediately clear how to trim the end of each line.

One way to get started is to write a regular expression which captures the beginning of the string: the part of the URL I want to keep. Maybe we can use this as part of the solution.

So if we have a line like

https://www.phoronix.com/scan.php?page=news_item&px=System76-Scheduler-1.1

then we can progressively develop a regular expression that captures the beginning of the URL.

# Original URL
https://www.phoronix.com/scan.php?page=news_item&px=System76-Scheduler-1.1

# Manually delete the end of the string. 
https://www.phoronix.com/

# Anchor the beginning of the string by adding ^
^https://www.phoronix.com/

# Escape forward slashes. 
^https:\/\/www.phoronix.com\/

# Add wildcard for host name. 
^https:\/\/[^/]\+\/

We can test this regular expression by searching for it.

" Activate the Search Highlight option in Vim. 
:set hlsearch

" Search using the regular expression. 
/^https:\/\/[^/]\+\/

If you want to save typing the regular expression into the command line, it is possible to paste from the clipboard to the Vim command line. In this context, the search prompt is considered a command line for the purposes of Ctrl-R.

^https://[^/]+/

After typing the plus symbol, the command line will fill with the contents of the clipboard. Note that if you happened to copy a newline then a '^M' might show at the end of the line. You can press Backspace once to remove '^M'.

Since I turned on search highlight, you might see the highlighting take effect as you enter the regular expression. But you still have to press Enter to formally begin the search.

At this point, we can match the beginning of the URL.

Start of the Match

Vim has a regular expression atom \zs which sets the start of the match.

It is not immediately obvious what this means or how it is helpful.

At this point, we have two basic functions.

One, we can substitute text for other text. So, if we can find text then we can replace it with nothing. This effectively deletes the text we are searching for.

Two, we can match the beginning of the URL.

I want to delete the end of the string.

It doesn't make sense to delete the beginning of the string which is what we can match right now.

In principle, it is possible to match the end of a URL. But that has an unknown number of path components separated by forward slashes. It is easier to match the beginning of the URL because it has a fixed form: https://wildcard/ That is why I've chosen to match that part.

What we need now is some way to combine (a match for the beginning of the string) with (the substitute command). And that combination has to help me reach my goal.

This is where \zs comes in.

\zs works in combination with another regular expression.

First we search for something.

/^https:\/\/[^/]\+\/

where the leading / begins a search in Vim and

^https://[^/]+/

is the regular expression we are looking for.

Then we add \zs.

Note you can press / and then the up arrow on the keyboard to recall the last search. Then you can modify the search by adding \zs. Don't forget to press Enter to begin the search.

/^https:\/\/[^/]\+\/\zs

At this point, Vim interprets this to mean:

We can add a wildcard .* to match the rest of the line.

/^https:\/\/[^/]\+\/\zs.*

If you try this search in Vim with the 'hlsearch' option turned on, you will see the end of each URL get highlighted.

Another way to think of \zs is as a (regular expression ignore operator) in Vim.

We:

And the first thing we search for is ignored.

You might be interested to know there is a complementary Vim atom \ze which works at the end of a string.

Trim Strings

Now we have a way to reliably select the end of each URL. If we combine the regular expression with the substitute command then we can replace the end of the string with nothing. This will effectively trim the strings.

:%s/^https:\/\/[^/]\+\/\zs.*//

One shortcut you can take here is to reuse the regular expression from a recent search. Using Ctrl-R in this way lets you experiment with different searches and then transfer a working expression to :substitute.

One consequence of using 'https' in my regular expression is that 'http' sites still have their path components. You can remove those paths by making the 's' optional.

:%s/^https\?:\/\/[^/]\+\/\zs.*//

At this point we have trimmed the strings.

Sort and Deduplicate

I'd like to sort the URLs alphabetically and remove duplicates.

Vim has a :sort command. The :sort command has an option, u, which only returns unique entries. By default, :sort applies to the entire buffer or file.

:sort u

This sorts the strings alphabetically and removes any duplicates.

But we're left with a few problems.

One further refinement is to remove 'www.' from the results and sort again:

:%s/www\.// | sort u

http and https sites will be grouped separately after being sorted alphabetically. You can group the hosts together by stripping off the protocol.

:%s/https\?:\/\/// | sort u

This produces the final output and addresses my original problem.

Now I can go through my sorted list and see what sites I visited.

References

=> Domain Name Syntax | Wikipedia | Hot Dog Menu Icon | Wikipedia | The 0 in ':0 put +' | Vim Help | :put | Vim Help | "+ | The Clipboard Register | Vim Help | :substitute | Vim Help | \r | Carriage Return Escape Sequence | Vim Help | Regular Expressions | Regular-Expressions.info | Regular Expression Quick Reference | Microsoft Docs | 'hlsearch' | Vim Help | Ctrl-R | Paste from Clipboard to Command Line | Vim Help | \zs | Start of the Match | Vim Help | Path Components | URL | Wikipedia | "/ | The Last Search Pattern Register | Vim Help | :sort | Vim Help | Using a Vertical Bar to Separate Multiple Commands | Vim Help

Created: Thursday, May 19, 2022

Updated: Thursday, May 19, 2022

Proxy Information
Original URL
gemini://pwshnotes.flounder.online/gemlog/2022-05-18-trim-strings-firefox-history.gmi
Status Code
Success (20)
Meta
text/gemini; charset=utf-8
Capsule Response Time
711.813004 milliseconds
Gemini-to-HTML Time
4.209305 milliseconds

This content has been proxied by September (ba2dc).