Text Cleaner & Formatter

Clean up messy text: remove extra spaces, normalize line endings, fix punctuation, change case, strip HTML, and more. 100% client-side, handles large inputs efficiently.

chars • lines
Quick Presets
Cleaning Options
chars
Runs entirely in your browser. No uploads. Your files stay private.

How the Text Cleaner Pipeline Works

Text Cleaner runs a configurable pipeline of small string transforms over the input. Each operation is implemented with native String methods and RegExp — no external NLP library is loaded — and the operations execute in a fixed order so combining them produces predictable output. The full pipeline covers more than twenty independent transformations spanning whitespace, line endings, quotes, HTML, casing, and Unicode normalization.
Whitespace cleanup uses three regex passes. Extra spaces collapse runs of two or more space characters to a single space using /[ ]{2,}/g (tabs are handled separately). Trailing whitespace per line is stripped with /[ \t]+$/gm using the multiline flag so $ matches each line end. Empty lines are detected with /^\s*$/gm and either removed or collapsed to a single blank.
Line-ending normalization handles the historical mess of CRLF (Windows, U+000D U+000A), LF (Unix and macOS, U+000A), and standalone CR (classic Mac, U+000D). The tool detects which style is present and rewrites the entire input to a single chosen style. This matters when pasted text mixes line endings — git, diff, and most code formatters treat CRLF and LF differently.
Quote normalization swaps between straight quotes (U+0022, U+0027) and the smart-quote family (U+201C, U+201D, U+2018, U+2019, plus the prime characters U+2032, U+2033). The toggle uses a small lookup map rather than a regex, so apostrophes and quotation marks are converted directionally based on whether the surrounding character is alphanumeric or whitespace.
Non-printable cleanup removes characters in the C0 and C1 control ranges (U+0000 to U+001F and U+007F to U+009F) excluding tab, newline, and carriage return. The byte-order-mark (U+FEFF) is also stripped if present at the start of the input — copy-pasting from a UTF-8 BOM file is the most common way that ghost character appears in copied prose. Zero-width spaces and joiner characters (U+200B through U+200D, U+2060) are also removed because they are invisible but break search and diff tools.
HTML stripping uses the regex /<[^>]+>/g to remove tags, then optionally decodes entities (&amp;, <, >, &quot;, &#39;, and the named-entity subset that document.createElement().textContent will resolve). HTML entity encoding does the reverse using a small lookup table. Neither operation parses HTML — they treat the input as a string — so malformed markup with unbalanced angle brackets may produce surprising results.
Settings persist via localStorage under the key 'text-cleaner-settings' and the live-preview pipeline is debounced by 200 ms to keep typing responsive. The full clean pass runs in O(n) over the input length, so multi-megabyte documents are interactive on modern hardware. No request leaves the browser tab — confidential drafts and unredacted logs stay local.

Common Use Cases

01

PDF copy-paste cleanup

Remove the stray spaces, smart quotes, and broken line wraps that PDF readers introduce when copying body text into a Word document or email.

02

Word-to-CMS migration

Strip Word's smart quotes, em-dashes, and Microsoft non-breaking spaces (U+00A0) before pasting into a CMS that expects plain ASCII.

03

Code linting prep

Normalize tabs to spaces, fix CRLF vs LF mismatches, and strip trailing whitespace per line before committing pasted-in source code.

04

Email body sanitation

Remove forwarded-message indentation, soft hyphens, and zero-width spaces that break search and reply formatting in long email threads.

Frequently Asked Questions

Control characters in U+0000 to U+001F (excluding tab, newline, carriage return) and U+007F to U+009F, the byte-order-mark U+FEFF when at the start of input, and the zero-width family U+200B (zero-width space), U+200C (zero-width non-joiner), U+200D (zero-width joiner), and U+2060 (word joiner). These are common pollution sources from rich-text copy operations.
A directional lookup runs over the input. Smart-to-straight is mechanical: U+201C and U+201D both become U+0022, U+2018 and U+2019 both become U+0027. Straight-to-smart is contextual — the tool checks the character before each quote to decide whether to emit the opening or closing form, matching the typesetting convention used by most word processors.
When 'fix line endings' is on, every CRLF, CR, and LF in the input is rewritten to whichever single style you select (LF by default, the Unix and macOS convention). This is essential for paste-from-Windows-into-shell-script scenarios where mixed endings cause `bash: command not found` errors.
No. It uses the regex /<[^>]+>/g, which removes anything between angle brackets. Self-closing tags, comments, and CDATA blocks are all caught. Malformed input with unbalanced angle brackets may leak text — for those cases use a real HTML parser. For typical pasted web content, the regex approach is fast and correct.
The smart-quotes toggle works in both directions. If you converted straight to smart and your text had ASCII apostrophes, they were rewritten to U+2019 (right single quotation mark). To reverse, run the same input through with the opposite direction selected, or paste a fresh copy and disable the option.
Yes by default. CJK, Cyrillic, Arabic, emoji, and accented Latin all pass through unchanged. The 'remove accents' toggle is the explicit opt-in that decomposes via NFD and strips combining marks, turning 'café' into 'cafe'. Without that option enabled, all Unicode is preserved.
Memory-bound only. The full pipeline is O(n) over input length, so multi-megabyte text stays responsive. The live preview debounces at 200 ms, so very large inputs may show a small lag while you type. Click 'Clean' to run the pipeline once instead if the live preview is sluggish.
No. Settings are stored in localStorage under the key 'text-cleaner-settings' and live in the current browser only. There is no account system and no cloud sync — clear the key via DevTools > Application > Local Storage to reset to defaults.
It corrects spacing around common punctuation: collapses double spaces after periods to a single space, removes spaces before commas and periods, ensures a space after sentence-ending punctuation, and normalizes ellipsis (three periods become U+2026 if 'smart quotes' is on, or U+2026 becomes three periods if straight is on).
No. The cleaner runs synchronously inside the browser tab. There is no fetch call and no analytics on the input. You can disconnect from the network after the page loads and the tool keeps working.

Advertisement