Remove Duplicate Lines

Remove duplicate lines with advanced matching options and duplicate reporting.

1 lines • 0 chars
0 lines
Options
Runs entirely in your browser. No uploads. Your files stay private.

How Remove Duplicates Detects Repeated Lines

Remove Duplicates splits the pasted text on newline characters, then walks the array once, building a JavaScript Set of canonicalized keys. The first time a key is seen, the original line is pushed to the output and the key is recorded. Every later line that produces the same key is treated as a duplicate. This single-pass O(n) approach handles tens of thousands of lines in milliseconds because Set.has and Set.add are amortized O(1).
Canonicalization is configurable. Case-insensitive matching applies toLowerCase before hashing. Whitespace folding collapses runs of whitespace into a single space and trims the ends so that 'hello world' and ' hello world ' collapse to the same key. Punctuation stripping removes anything matching the regex /[^\w\s]/ before comparison. Each option only affects the comparison key, not the line that gets written to the output, so the original formatting is preserved.
Unicode normalization uses String.prototype.normalize. Many visually identical strings have multiple binary representations — the letter e-acute can be a single code point (U+00E9) or two code points (e plus combining acute, U+0065 U+0301). The 'normalize Unicode' switch runs both forms through NFC so they hash identically. The 'ignore accents' switch goes further: it decomposes to NFD, strips the combining-mark range U+0300 to U+036F, and matches accent-insensitively.
The tool keeps the first occurrence by default. There is no setting for 'keep last' because the trade-off is rarely useful on real data and always confusing when the duplicates differ in case or whitespace. If you specifically need the last occurrence, reverse the input, dedupe, and reverse the output again — the math works out the same.
Three output modes exist. 'Unique' returns the deduplicated list (default). 'Duplicates only' returns just the lines that appeared more than once, useful for audit trails. 'With counts' annotates each unique line with the number of occurrences, which is what you usually want when analyzing log files or survey responses.
Performance scales linearly with input size, but the comparison key calculation can dominate on very large inputs with all options enabled. For a million lines with case-folding, whitespace collapsing, punctuation stripping, and Unicode normalization, expect roughly five to fifteen seconds on a modern laptop. Plain exact-match dedupe of the same input runs in well under a second.
Output is exposed via Copy, Download (as a Blob in text/plain), and Replace Input (which feeds the result back into the input box for chaining with another tool such as Sort Lines or Case Converter). Nothing leaves the browser tab — there is no upload endpoint behind this page.

Common Use Cases

01

Mailing list deduplication

Strip duplicate email addresses from a list, optionally lowercasing first so 'Bob@example.com' and 'bob@example.com' collapse.

02

Log file cleanup

Remove repeated INFO and WARN lines from server logs to surface unique events for review.

03

CSV row dedupe

Paste a column extracted from a spreadsheet and produce a clean unique list ready to paste back into Sheets or Excel.

04

SEO keyword consolidation

Merge keyword exports from multiple tools and dedupe across the union, ignoring case and surrounding whitespace.

Frequently Asked Questions

First. The deduplication walks the input top-to-bottom and the first time a normalized key is seen, that line is preserved verbatim. To keep the last occurrence instead, reverse your input, run the dedupe, then reverse the output — that produces the same result as a hypothetical 'keep last' setting.
Two lines are duplicates if their normalized keys match. The key is the line itself by default, modified by whichever options you enable: case-insensitive, whitespace-folded, punctuation-stripped, NFC-normalized, or accent-stripped. The original line text is never modified — only the comparison key is normalized.
It uses String.prototype.normalize('NFC') to collapse equivalent code-point sequences. Many accented characters can be represented as either a single composed code point or a base letter plus a combining mark. NFC picks the composed form, which makes 'é' (U+00E9) and 'é' (U+0065 plus U+0301) compare equal. The accent-insensitive option goes further by stripping all combining marks.
By default the original order is preserved (the first occurrence keeps its original position). For sorted output, run the result through the Sort Lines tool — there is a 'Replace Input' button that hands the dedupe output to whatever tool you open next.
Instead of returning each unique line once, it returns each line with the count of how many times it appeared. The format is line followed by a tab and the count. Useful for log analysis, survey response tabulation, and frequency-sorted lists.
Memory is the only ceiling. A million ASCII lines with no normalization runs in well under a second and uses about 50 to 100 MB of browser memory. Enabling every normalization option on the same input takes five to fifteen seconds because each line builds a normalized key. For multi-million-line inputs, consider doing the work in a script with a streaming approach.
Only if you enable the 'remove empty lines' option. By default, blank lines are treated as a value and the second blank line is removed (as a duplicate) but the first is kept. The combination 'whitespace-folded plus remove empty' produces the cleanest list when input came from a copy-paste.
No. The dedupe walk runs synchronously inside the page using a JavaScript Set. There is no fetch call and no analytics on the input. You can disconnect from the network after the page loads and the tool keeps working.
Trailing commas, quoting differences ("foo, bar" vs foo, bar), and stray whitespace produce different keys. Enable whitespace-folding and punctuation-stripping if you only care about the underlying field values, or pre-process the CSV through a parser that normalizes the row format before pasting.
Yes. Switch to 'duplicates only' mode and the output becomes just the lines that appeared more than once. Each duplicate appears once in that view; switch to 'with counts' if you also need the multiplicities.

Advertisement