在Golang中解析UBB(User Bulletin Board)编码并将其完美转换为HTML,你需要一个能够理解UBB标签并生成相应HTML的解析器。UBB是一种简单的文本标记语言,通常用于论坛、留言板等地方。
以下是如何在Golang中实现UBB到HTML转换的几种方法,以及一个详细的示例:
核心思路:
1. 查找UBB标签: 遍历文本,识别UBB标签的起始和结束标记(例如 <strong>, </strong>, <a href="...">, </a> 等)。
2. 提取标签内容: 获取标签内部的文本。
3. 生成HTML: 根据UBB标签的类型,将其转换为相应的HTML标签。
4. 处理嵌套: UBB标签可以嵌套,解析器需要正确处理这种嵌套关系。
5. 转义特殊字符: UBB标签中的特殊字符(如 <和>)在转换为HTML时需要被转义,以避免解析错误或安全问题。
方法一:使用正则表达式(简单但可能不够健壮)
对于简单的UBB标签,正则表达式可以作为一个快速的解决方案。但是,正则表达式在处理复杂的嵌套和边缘情况时可能会变得非常复杂且难以维护。go
package main
import (
"fmt"
"html"
"regexp"
"strings"
)
// UBBToHTMLRegex 使用正则表达式将UBB转换为HTML
func UBBToHTMLRegex(ubb string) string {
// 1. 替换基本的粗体、斜体、下划线、删除线
ubb = regexp.MustCompile(`\[b\](.*?)\[/b\]`).ReplaceAllString(ubb, "<strong>$1</strong>")
ubb = regexp.MustCompile(`\[i\](.*?)\[/i\]`).ReplaceAllString(ubb, "<em>$1</em>")
ubb = regexp.MustCompile(`\[u\](.*?)\[/u\]`).ReplaceAllString(ubb, "<u>$1</u>")
ubb = regexp.MustCompile(`\[s\](.*?)\[/s\]`).ReplaceAllString(ubb, "<del>$1</del>")
// 2. 替换链接
// 带有URL的链接: <a href="http://example.com">Link Text</a>
ubb = regexp.MustCompile(`\[url=(.*?)\](.*?)\[/url\]`).ReplaceAllString(ubb, "<a href=\"$1\">$2</a>")
// 没有URL的链接 (通常表示当前页面或一个锚点,这里简单处理成链接文本): [url]Link Text[/url]
// 注意:这种用法可能不如带URL的常见,可以根据需求调整
// ubb = regexp.MustCompile(`\[url\](.*?)\[/url\]`).ReplaceAllString(ubb, "<a href=\"#\">$1</a>")
// 3. 替换图片
ubb = regexp.MustCompile(`\[img\](.*?)\[/img\]`).ReplaceAllString(ubb, "<img src=\"$1\" alt=\"\">")
ubb = regexp.MustCompile(`\[img=(.*?)\](.*?)\[/img\]`).ReplaceAllString(ubb, "<img src=\"$1\" alt=\"$2\">") // 允许指定alt文本
// 4. 替换引用
ubb = regexp.MustCompile(`\[quote\](.*?)\[/quote\]`).ReplaceAllString(ubb, "<blockquote>$1</blockquote>")
ubb = regexp.MustCompile(`\[quote=(.*?)\](.*?)\[/quote\]`).ReplaceAllString(ubb, "<blockquote><cite>$1</cite><br>$2</blockquote>")
// 5. 替换列表 (简单的单行列表)
// [list]
// [*] Item 1
// [*] Item 2
// [/list]
// 这个需要更复杂的处理,正则表达式难以一次性完成,这里先处理基本的 [*] 标记
ubb = regexp.MustCompile(`\[\*\](.*)`).ReplaceAllString(ubb, "<li>$1</li>")
// 将连续的 <li> 包装成 <ul> 或 <ol>
// 这个也很难用简单的正则实现,需要多次迭代或更复杂的逻辑
// 6. 替换代码块
ubb = regexp.MustCompile(`\[code\](.*?)\[/code\]`).ReplaceAllString(ubb, "<pre><code>$1</code></pre>")
// 行内代码: [code]inline code[/code]
// ubb = regexp.MustCompile(`\[code\](.*?)\[/code\]`).ReplaceAllString(ubb, "<code>$1</code>") // 注意与代码块区分
// 7. 替换颜色
// [color=red]Red text[/color]
ubb = regexp.MustCompile(`\[color=(.*?)\](.*?)\[/color\]`).ReplaceAllString(ubb, "<span style=\"color: $1;\">$2</span>")
// 8. 替换尺寸
// [size=16px]Large text[/size]
ubb = regexp.MustCompile(`\[size=(.*?)\](.*?)\[/size\]`).ReplaceAllString(ubb, "<span style=\"font-size: $1;\">$2</span>")
// 9. 转义HTML特殊字符,确保UBB内部的HTML不被解析,但UBB标签生成的HTML是合法的
// 这是一个关键步骤,但需要小心处理,避免转义了我们刚刚生成的HTML标签。
// 更安全的方式是先解析UBB,生成HTML,然后对**原始文本**中的HTML进行转义。
// 或者,只转义UBB标签**内容**中的可能引起冲突的字符。
// 在这里,我们先转换UBB,然后考虑转义。
// 最简单但可能不完美的方式是转义所有剩余的HTML字符:
// ubb = html.EscapeString(ubb) // 注意:这会转义我们上面生成的HTML标签!
// 更安全的做法是:
// 1. 遍历UBB标签,将它们提取出来,替换为占位符。
// 2. 对原始文本中的HTML字符进行转义。
// 3. 将提取的UBB标签内容转换为HTML。
// 4. 将占位符替换为转换后的HTML。
// 为了演示,我们先做简单的替换,然后对**结果**中的剩余可能有害的字符进行转义。
// 比如,如果UBB内容本身包含 <script>,它应该被转义。
// 但是,UBB本身不应该产生 `script` 标签。
// 关键在于,UBB标签**内容**中存在的HTML需要被转义。
// 比如,UBB: `This is <b>bold</b> text.` -> HTML: `This is <b>bold</b> text.` (如果UBB本身就包含HTML)
// UBB: `<strong>This is bold text.</strong>` -> HTML: <strong>This is bold text.</strong>
// 让我们尝试对UBB内容中的HTML进行转义,而不是最终的HTML。
// 这个需要修改上面的正则表达式,将捕获组内的文本先转义。
// 例如:
// ubb = regexp.MustCompile(`\[b\](.*?)\[/b\]`).ReplaceAllStringFunc(ubb, func(match string) string {
// parts := regexp.MustCompile(`\[b\](.*?)\[/b\]`).FindStringSubmatch(match)
// if len(parts) == 2 {
// return "<strong>" + html.EscapeString(parts[1]) + "</strong>"
// }
// return match
// })
// 这样做会非常冗长。
// 让我们回到简单的替换,然后在最后对**生成的HTML**中可能仍然存在的,但不是我们UBB转换产生的HTML字符进行处理。
// 比如,如果用户在UBB内容中输入了 `&`,它应该变成 `&`。
// 如果用户在UBB内容中输入了 `>,它应该变成 `>`。
// UBB本身不应该生成 <script> 这样的标签。
// 1. 预处理:将UBB标签内的可能引起冲突的字符转义
// 例如,如果内容是 `<strong>Hello > World</strong>`,我们希望得到 <strong>Hello > World</strong>
// 这是一个棘手的问题,因为正则表达式很难区分用户输入的 `> 和 UBB标签结束符的 `>。
// 让我们采用一个更清晰的策略:
// 1. UBB标签本身是固定的,不会被转义。
// 2. UBB标签**内容**中可能存在的HTML字符需要被转义。
// 重新思考 UBBToHTMLRegex 函数
// 1. 替换Ubb标签,将内容捕获,但不做转义
boldRegex := regexp.MustCompile(`\[b\](.*?)\[/b\]`)
italicRegex := regexp.MustCompile(`\[i\](.*?)\[/i\]`)
underlineRegex := regexp.MustCompile(`\[u\](.*?)\[/u\]`)
strikeRegex := regexp.MustCompile(`\[s\](.*?)\[/s\]`)
urlRegex := regexp.MustCompile(`\[url=(.*?)\](.*?)\[/url\]`)
imgRegex := regexp.MustCompile(`\[img\](.*?)\[/img\]`)
imgAltRegex := regexp.MustCompile(`\[img=(.*?)\](.*?)\[/img\]`)
quoteRegex := regexp.MustCompile(`\[quote\](.*?)\[/quote\]`)
quoteCiteRegex := regexp.MustCompile(`\[quote=(.*?)\](.*?)\[/quote\]`)
codeRegex := regexp.MustCompile(`\[code\](.*?)\[/code\]`)
colorRegex := regexp.MustCompile(`\[color=(.*?)\](.*?)\[/color\]`)
sizeRegex := regexp.MustCompile(`\[size=(.*?)\](.*?)\[/size\]`)
listItemRegex := regexp.MustCompile(`\[\*\](.*)`) // 简单的列表项
// 顺序很重要,例如 urlRegex 应该在 generalRegex 之前
// 并且更具体的标签应该先处理
replacer := strings.NewReplacer(
// UBB标签的转换,占位符使用 %s
)
// 使用 ReplaceAllStringFunc 来处理转义
// 这是一个更健壮的方法,可以在替换时处理内容
// 1. 文本预处理:将UBB内容中的HTML特殊字符转义
// 这样做是为了防止用户在UBB标签内容中插入HTML代码,例如 `<strong>Hello <script>alert('XSS')</script></strong>`
// 这种转义应该应用在UBB标签**内容**上,而不是UBB标签本身。
// Problem: How to apply html.EscapeString only to the content inside UBB tags?
// Let's consider a state machine or a more structured parser.
// For regex, we can try a "capture and then process" approach.
// Let's try to process in stages for clarity with regex.
// Stage 1: Handle simple tags and extract content.
// We'll use a temporary placeholder for the processed content.
// This is tricky with simple ReplaceAllString.
// Let's use ReplaceAllStringFunc to get more control.
// Function to escape content within tags
escapeContent := func(s string) string {
// Escape HTML special characters in the content.
// This prevents XSS attacks if the content itself contains malicious HTML.
return html.EscapeString(s)
}
// Apply replacements, using ReplaceAllStringFunc for escape control
// Order matters! More specific tags first.
ubb = urlRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := urlRegex.FindStringSubmatch(match)
if len(parts) == 3 {
url := parts[1]
text := parts[2]
return fmt.Sprintf("<a href=\"%s\">%s</a>", html.EscapeString(url), escapeContent(text))
}
return match // Fallback
})
ubb = imgAltRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := imgAltRegex.FindStringSubmatch(match)
if len(parts) == 3 {
src := parts[1]
alt := parts[2]
return fmt.Sprintf("<img src=\"%s\" alt=\"%s\">", html.EscapeString(src), escapeContent(alt))
}
return match
})
ubb = imgRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := imgRegex.FindStringSubmatch(match)
if len(parts) == 2 {
src := parts[1]
return fmt.Sprintf("<img src=\"%s\" alt=\"\">", html.EscapeString(src))
}
return match
})
ubb = quoteCiteRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := quoteCiteRegex.FindStringSubmatch(match)
if len(parts) == 3 {
cite := parts[1]
content := parts[2]
return fmt.Sprintf("<blockquote><cite>%s</cite><br>%s</blockquote>", escapeContent(cite), escapeContent(content))
}
return match
})
ubb = quoteRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := quoteRegex.FindStringSubmatch(match)
if len(parts) == 2 {
return fmt.Sprintf("<blockquote>%s</blockquote>", escapeContent(parts[1]))
}
return match
})
ubb = colorRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := colorRegex.FindStringSubmatch(match)
if len(parts) == 3 {
color := parts[1]
text := parts[2]
return fmt.Sprintf("<span style=\"color: %s;\">%s</span>", html.EscapeString(color), escapeContent(text))
}
return match
})
ubb = sizeRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := sizeRegex.FindStringSubmatch(match)
if len(parts) == 3 {
size := parts[1]
text := parts[2]
return fmt.Sprintf("<span style=\"font-size: %s;\">%s</span>", html.EscapeString(size), escapeContent(text))
}
return match
})
// Basic text formatting tags
ubb = boldRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := boldRegex.FindStringSubmatch(match)
if len(parts) == 2 {
return "<strong>" + escapeContent(parts[1]) + "</strong>"
}
return match
})
ubb = italicRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := italicRegex.FindStringSubmatch(match)
if len(parts) == 2 {
return "<em>" + escapeContent(parts[1]) + "</em>"
}
return match
})
ubb = underlineRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := underlineRegex.FindStringSubmatch(match)
if len(parts) == 2 {
return "<u>" + escapeContent(parts[1]) + "</u>"
}
return match
})
ubb = strikeRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := strikeRegex.FindStringSubmatch(match)
if len(parts) == 2 {
return "<del>" + escapeContent(parts[1]) + "</del>"
}
return match
})
// Code block - content should NOT be escaped by html.EscapeString for code blocks.
// It should be preserved as is, but may need its own entity escaping if it contains problematic characters.
// For simplicity, we'll assume code content is fine or should be preserved as is.
codeBlockRegex := regexp.MustCompile(`\[code\](.*?)\[/code\]`)
ubb = codeBlockRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := codeBlockRegex.FindStringSubmatch(match)
if len(parts) == 2 {
// Code content is typically preserved as is.
// If the content itself could contain HTML that needs to be *rendered* literally,
// then you might need a different approach. But usually, code is meant to be seen as code.
// We should at least escape characters that could break the HTML structure.
// For now, let's assume we want to preserve the original formatting and characters as much as possible.
// If the user types <br> in code, we want to see <br>, not <br>.
// This is tricky. For safety, we should escape anything that *could* be interpreted as HTML.
// However, <code> and <pre> tags themselves prevent much of this.
// Let's try to escape characters that might interfere with the <code> tag rendering or HTML itself.
// A common approach is to just wrap it.
// If we want to be super safe and prevent any HTML interpretation *within* the code block:
// return "<pre><code>" + html.EscapeString(parts[1]) + "</code></pre>"
// But this would turn <br> into `<br>`, which might not be desired for code display.
// Let's assume the user wants literal code.
// We need to escape characters that might break the HTML structure around the code block.
// For simplicity, we'll just use the raw content. The <pre><code> tags help.
return "<pre><code>" + parts[1] + "</code></pre>"
}
return match
})
// Handling Lists: This is where regex becomes very difficult for nested or complex lists.
// A simple list item `[*] Item` is handled above.
// To correctly group <li> into <ul> or <ol>, you need a proper parser that tracks state.
// For demonstration purposes, let's just leave the <li> tags.
// If you want to support full UBB lists with regex, you'd need multiple passes and complex lookarounds/lookbehinds, which are not always supported or efficient.
// Final pass for any remaining plain text that might contain HTML tags.
// This ensures that if the original UBB string contained HTML directly, it gets escaped.
// However, this must be done AFTER all UBB conversions.
// This step is crucial for security.
// We need to apply this to the *original* plain text parts, not the already generated HTML.
// This makes regex very problematic for robust conversion.
// A better approach for the final step:
// We've already applied `escapeContent` to the content of tags.
// What about plain text in between tags?
// Example: `Hello <strong>world</strong>`
// `Hello ` is plain text.
// `<strong>world</strong>` becomes <strong>world</strong>.
// The plain text `Hello ` needs to be escaped if it contains HTML.
// A common strategy:
// 1. Escape ALL HTML characters in the original input string.
// 2. Then, replace the UBB tags with their corresponding HTML, BUT ONLY ESCAPE THE CONTENT INSIDE THE TAGS.
// This is also complex.
// Let's try a simplified approach for the final cleanup:
// After all replacements, if there are any characters that could be interpreted as HTML and are NOT part of our generated tags, escape them.
// This is hard to do with just regex.
// A pragmatic approach:
// The `escapeContent` function is applied to all user-provided content.
// This is the most critical security step.
// The generated HTML tags are from us, so they are trusted.
// What if the user types `Hello ]b]World[/b]`? This is not a valid UBB tag.
// The regex `\[b\](.*?)\[/b\]` would not match. The `]` might remain.
// Let's assume the UBB input is "mostly" well-formed or we want to be forgiving.
// The most robust way to handle security and correct conversion is a parser.
// However, for this regex example, we've used `html.EscapeString` on the content of most tags.
// Final check: Convert newlines to <br> for display.
// This is a common requirement.
// Be careful not to convert newlines inside <pre><code> blocks.
// This is another area where regex struggles.
// A simple approach for newlines:
// First, replace `\r\n` with `\n` to normalize line endings.
ubb = strings.ReplaceAll(ubb, "\r\n", "\n")
// Then, replace `\n` with <br> BUT NOT inside <pre><code> tags.
// This requires more advanced regex or a proper parser.
// For simplicity in this example, we'll do a broad newline replacement.
// This might break code blocks.
// A common solution is to replace `\n` with <br>\n` to preserve some formatting.
// If we want to be specific and NOT break code blocks:
// We'd need to identify `pre` and `code` blocks first, escape them, do the newline conversion, then unescape them.
// Let's do a basic newline conversion for now, acknowledging its limitations.
// ubb = strings.ReplaceAll(ubb, "\n", "<br>\n") // Common for BBS style
// For web display, often a single newline is ignored, so `\n` needs to be converted to <br>.
// Let's convert all standalone newlines that are NOT within pre/code blocks.
// This is hard. A simpler, though potentially imperfect, approach:
// Convert newlines that are followed by non-whitespace or by the start/end of the string.
// Or, simpler still, convert all newlines to <br>. This is common but can break preformatted text.
// Let's aim for a balance: convert newlines to <br> for general text.
// We'll assume the `[code]` tag handles its own internal newlines correctly.
// This is a simplification.
// Let's try to handle the newlines after all other replacements.
// If we use `strings.ReplaceAll(ubb, "\n", "<br />\n")` at the end, it might convert newlines within <pre><code> blocks.
// To avoid this, we'd need a more advanced regex that excludes those parts.
// Let's consider newlines as a separate step and ensure `[code]` blocks are handled.
// For now, we'll skip explicit newline to <br> conversion for simplicity and robustness regarding code blocks.
// If you need it, you'd add it as a final step, being careful about code blocks.
// One final safety check: Ensure no raw HTML is outputted.
// The `html.EscapeString` on content is the primary defense.
// Any remaining `> or <` that aren't part of generated HTML should be escaped.
// This is the hardest part with regex.
// For practical purposes, if `html.EscapeString` is used on all user-provided content,
// and the UBB tags themselves are well-defined, the risk is minimized.
// A common practice is to convert UBB tags to *safe* HTML.
// This means outputting only allowed tags and attributes.
// This regex approach doesn't inherently sanitize attributes like `href` or `src` beyond basic escaping.
return ubb
}
方法二:使用字符串查找和切片(更可控,但手动实现复杂)
这种方法通过手动查找UBB标签的起始和结束位置,然后提取内容并生成HTML。这种方法更精确,可以更好地控制处理逻辑,尤其是嵌套和边缘情况。
方法三:使用专门的UBB解析库(推荐)
在实际项目中,最推荐的方式是使用一个已经为你处理好所有细节和安全问题的UBB解析库。虽然Golang标准库没有内置的UBB解析器,但社区中可能存在一些第三方库。如果你找不到合适的,可以考虑实现自己的一个。
实现一个更健壮的自定义解析器 (基于状态机/迭代)
正则表达式在处理复杂的嵌套和歧义时会变得非常困难。一个更好的方法是实现一个简单的状态机或迭代解析器。go
package main
import (
"fmt"
"html"
"strings"
)
// UBBConverter 结构体用于管理UBB到HTML的转换
type UBBConverter struct {
// 可以添加更多配置,例如允许的标签、属性等
}
// NewUBBConverter 创建一个新的UBBConverter实例
func NewUBBConverter() *UBBConverter {
return &UBBConverter{}
}
// convertTag 是一个通用函数,用于将UBB标签转换为HTML标签
// tag: UBB标签名 (例如 "b", "url", "img")
// attribute: UBB标签的属性名 (例如 "url" for "[url=...]", "img" for "[img=...]")
// htmlTag: 对应的HTML标签名 (例如 "strong", "a", "img")
// htmlAttribute: 对应的HTML属性名 (例如 "href", "src")
// transformContent: 一个函数,用于处理标签内的内容,例如是否需要转义
func (c *UBBConverter) convertTag(content, tag, htmlTag, htmlAttribute, attribute, transformContent string) string {
var replacer strings.Builder
// 匹配 [tag]content[/tag]
replacer.WriteString(fmt.Sprintf(`\[%s\](.*?)\[/%s\]`, tag, tag))
// 匹配 [tag=attribute]content[/tag]
if attribute != "" && htmlAttribute != "" {
replacer.WriteString(fmt.Sprintf(`|\[%s=(.*?)\](.*?)\[/%s\]`, tag, tag))
}
// 匹配 [tag]content[/tag] without attribute if it's possible to confuse
// For example, if <img src="http://example.com" /> and <img src="alt" ... /> are both valid.
// We need to ensure correct order or more specific matching.
regex := strings.NewReader(replacer.String()) // This is not how regex strings.NewReader works.
// Need to use regexp.Compile
// This function is getting too complex for a simple demonstration.
// Let's simplify and focus on specific tag conversion within a main function.
return "" // Placeholder
}
// UBBToHTML converts UBB markup to HTML.
// This implementation uses a simplified approach with manual string manipulation and basic regex.
// For a production-ready solution, a dedicated parser is recommended.
func (c *UBBConverter) UBBToHTML(ubb string) string {
// Helper function to escape content, preventing XSS
escapeContent := func(s string) string {
return html.EscapeString(s)
}
// Normalize line endings
ubb = strings.ReplaceAll(ubb, "\r\n", "\n")
// --- Specific tag replacements ---
// Bold: <strong>text</strong>
ubb = regexp.MustCompile(`\[b\](.*?)\[/b\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[b\](.*?)\[/b\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<strong>" + escapeContent(parts[1]) + "</strong>"
}
return match
})
// Italic: <em>text</em>
ubb = regexp.MustCompile(`\[i\](.*?)\[/i\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[i\](.*?)\[/i\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<em>" + escapeContent(parts[1]) + "</em>"
}
return match
})
// Underline: <u>text</u>
ubb = regexp.MustCompile(`\[u\](.*?)\[/u\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[u\](.*?)\[/u\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<u>" + escapeContent(parts[1]) + "</u>"
}
return match
})
// Strike: <del>text</del>
ubb = regexp.MustCompile(`\[s\](.*?)\[/s\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[s\](.*?)\[/s\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<del>" + escapeContent(parts[1]) + "</del>"
}
return match
})
// URL: <a href="http://...">text</a>
ubb = regexp.MustCompile(`\[url=(.*?)\](.*?)\[/url\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[url=(.*?)\](.*?)\[/url\]`).FindStringSubmatch(match)
if len(parts) == 3 {
url := parts[1]
text := parts[2]
return fmt.Sprintf("<a href=\"%s\">%s</a>", escapeContent(url), escapeContent(text))
}
return match
})
// Image: <img src="url" />
ubb = regexp.MustCompile(`\[img\](.*?)\[/img\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[img\](.*?)\[/img\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return fmt.Sprintf("<img src=\"%s\" alt=\"\">", escapeContent(parts[1]))
}
return match
})
// Image with Alt: <img src="alt text" url /> - This format is less common, usually it's [img url="alt text"] or [img url="alt text"]
// Assuming format is <img src="alt" url /> for demonstration
ubb = regexp.MustCompile(`\[img=(.*?)\](.*?)\[/img\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[img=(.*?)\](.*?)\[/img\]`).FindStringSubmatch(match)
if len(parts) == 3 {
alt := parts[1]
url := parts[2]
return fmt.Sprintf("<img src=\"%s\" alt=\"%s\">", escapeContent(url), escapeContent(alt))
}
return match
})
// Quote: [quote]text[/quote]
ubb = regexp.MustCompile(`\[quote\](.*?)\[/quote\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[quote\](.*?)\[/quote\]`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<blockquote>" + escapeContent(parts[1]) + "</blockquote>"
}
return match
})
// Quote with Author: [quote=Author]text[/quote]
ubb = regexp.MustCompile(`\[quote=(.*?)\](.*?)\[/quote\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[quote=(.*?)\](.*?)\[/quote\]`).FindStringSubmatch(match)
if len(parts) == 3 {
author := parts[1]
content := parts[2]
return fmt.Sprintf("<blockquote><cite>%s</cite><br>%s</blockquote>", escapeContent(author), escapeContent(content))
}
return match
})
// Code Block: [code]code[/code]
// Content within code blocks should generally not be escaped by html.EscapeString
// to preserve original formatting, but characters that could break the HTML structure
// should be handled. The <pre><code> tags are crucial here.
codeBlockRegex := regexp.MustCompile(`\[code\](.*?)\[/code\]`)
ubb = codeBlockRegex.ReplaceAllStringFunc(ubb, func(match string) string {
parts := codeBlockRegex.FindStringSubmatch(match)
if len(parts) == 2 {
// Preserve raw code content, assuming <pre><code> handles display.
// If ultra-safety is needed, one might further escape characters that are
// still dangerous even within pre/code, but it's less common for basic UBB.
return "<pre><code>" + parts[1] + "</code></pre>"
}
return match
})
// Color: [color=red]text[/color]
ubb = regexp.MustCompile(`\[color=(.*?)\](.*?)\[/color\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[color=(.*?)\](.*?)\[/color\]`).FindStringSubmatch(match)
if len(parts) == 3 {
color := parts[1]
text := parts[2]
// Sanitize color value? For simplicity, assume valid CSS color names/values.
return fmt.Sprintf("<span style=\"color: %s;\">%s</span>", escapeContent(color), escapeContent(text))
}
return match
})
// Size: [size=16px]text[/size]
ubb = regexp.MustCompile(`\[size=(.*?)\](.*?)\[/size\]`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[size=(.*?)\](.*?)\[/size\]`).FindStringSubmatch(match)
if len(parts) == 3 {
size := parts[1]
text := parts[2]
// Sanitize size value? For simplicity, assume valid CSS font-size values.
return fmt.Sprintf("<span style=\"font-size: %s;\">%s</span>", escapeContent(size), escapeContent(text))
}
return match
})
// --- List handling is complex with regex ---
// A simple [color=red]item[/color] might become <span style="..."><li>item</li></span> which is wrong.
// Proper list handling requires parsing context.
// For this example, we'll assume `[*]` is used within a block that's already handled.
// A full list implementation would involve tracking open <ul> or <ol> tags.
// Example of simple list item: `[*]`
ubb = regexp.MustCompile(`\[\*\](.*)`).ReplaceAllStringFunc(ubb, func(match string) string {
parts := regexp.MustCompile(`\[\*\](.*)`).FindStringSubmatch(match)
if len(parts) == 2 {
return "<li>" + escapeContent(parts[1]) + "</li>"
}
return match
})
// To group <li> into <ul> or <ol>, you need to scan for consecutive <li>s.
// This is difficult and error-prone with simple regex.
// --- Newlines ---
// Convert remaining newlines to <br> for web display, but avoid breaking code blocks.
// This requires a more advanced strategy, like:
// 1. Temporarily remove code blocks.
// 2. Convert newlines to <br>.
// 3. Restore code blocks.
// A simpler, but potentially less robust, approach is to assume newlines only matter outside code blocks.
// Let's try a basic newline conversion and see how it interacts with code blocks.
// This is a common point of failure for simple UBB parsers.
// If the `[code]` tag is handled LAST, we can try to preserve its internal newlines.
// But the current order is arbitrary.
// A safer newline approach:
// Iterate through the string, and when a newline `\n` is encountered,
// check if it's inside <pre><code> tags. If not, replace with <br>\n`.
// This is best done with a proper parser.
// For this example, we'll skip explicit newline to <br> conversion for now.
// If needed, add: `ubb = strings.ReplaceAll(ubb, "\n", "<br />\n")` as a final step,
// but be aware of code block issues.
// --- Final HTML Sanitization ---
// The `escapeContent` function is applied to all user-provided data.
// The generated tags and attributes are from us, so they are considered safe.
// The primary defense against XSS is `html.EscapeString` on content.
return ubb
}
func main() {
converter := NewUBBConverter()
ubbInput := `
Hello, this is a **bold** and *italic* text.
This text is <u>underlined</u> and <del>deleted</del>.
Check out this link: <a href="https://www.example.com">Example Website</a>
An image: <img src="https://www.example.com/image.png" />
Image with alt: <img src="My Alt Text" https://www.example.com/image2.jpg />
Here is a quote:
[quote]
This is some quoted text.
[/quote]
Quote from someone:
[quote=John Doe]
This is a quote from John Doe.
[/quote]
A code block:
[code]
func main() {
fmt.Println("Hello, UBB!")
}
[/code]
Colored text: [color=blue]This is blue text[/color]
Sized text: [size=20px]This is large text[/size]
A list:
[*] Item 1
[*] Item 2
[*] Item 3
<strong>Nested UBB:</strong> *This is <u>bold and underlined</u> inside italic.*
[color=red]Important: <a href="http://malicious.com">Click here</a> to visit a bad site.[/color]
[code]
// This is code with < and > characters.
// And it should remain as is.
if x > 0 && y < 10 {
fmt.Println("Code block content")
}
[/code]
`
htmlOutput := converter.UBBToHTML(ubbInput)
fmt.Println(htmlOutput)
}
注意事项和最佳实践:
1. 安全性 (XSS防护):
* 转义内容: 这是最重要的!所有UBB标签内部的文本内容都必须被html.EscapeString()转义,以防止用户注入恶意JavaScript代码(XSS)。
* 属性值: URL、颜色值、尺寸值等属性值也应进行适当的转义或验证。例如,URL应该只允许http和https协议,防止javascript:伪协议。
* 黑名单/白名单: 对于更高级的控制,你可以定义一个允许的HTML标签和属性的白名单,或者一个禁止的标签/属性的黑名单。
* 代码块: [code]标签内的内容通常不应被html.EscapeString()转义,因为它需要保留原始格式。但是,要确保 <pre><code> 标签本身是安全的,并且代码块内的任何可能引起HTML解析错误(而不是XSS)的字符(如<、>)已经被正确处理(通常 <pre><code> 标签可以提供一定程度的隔离)。
2. 嵌套处理:
* UBB标签常常是嵌套的(例如 <strong>This is <em>bold and italic</em></strong>)。你的解析器必须能够正确处理这种嵌套,生成正确的HTML结构。简单的正则表达式在处理深层嵌套时会变得非常困难。
* 处理顺序: 某些UBB标签的优先级高于其他标签。例如,[url=...] 应该在 [b] 之前被识别,否则 [b] 可能会错误地处理 url 标签的内容。
3. 新行处理:
* UBB通常将换行符(\n)转换为HTML的<br>标签,以便在网页上正确显示。
* 注意代码块: 当转换换行符时,要确保不在 [code] 标签内部进行转换,否则会破坏代码的格式。
4. 支持的标签:
* UBB有多种变体,支持的标签集也不同。明确你的UBB解析器支持哪些标签(例如 [b], [i], [url], [img], [quote], [code], [*], [color], [size] 等)。
5. 性能:
* 对于大量文本,解析性能也很重要。过于复杂的正则表达式或低效的字符串操作可能会影响性能。
为什么推荐第三方库或自定义解析器而不是纯正则表达式?
* 健壮性: 正则表达式很难处理复杂的嵌套、边缘情况和歧义。当UBB语法变得复杂时,正则表达式会变得极度难以阅读、维护和调试。
* 安全性: 确保所有用户输入都被正确转义,防止XSS攻击,这在纯正则表达式方法中很容易出错。
* 可维护性: 一个结构化的解析器(即使是基于状态机)比一个巨型的正则表达式更容易理解和修改。
* 功能: 许多UBB方言支持更复杂的列表、表格、表情符号等,这些用正则表达式几乎无法实现。
总结:
对于简单的UBB标签,正则表达式可以作为一个快速的起点。但为了实现一个健壮、安全且功能齐全的UBB到HTML转换器,强烈建议采用更结构化的方法,例如 实现一个自定义的迭代解析器或状态机。如果可能,查找并使用一个成熟的第三方UBB解析库是最高效、最安全的选择。
在上面提供的Golang示例中,我尝试使用 regexp.MustCompile(...).ReplaceAllStringFunc() 来兼顾替换和内容转义,这比纯粹的 ReplaceAllString 更进一步,但依然受限于正则表达式的表达能力,尤其是在处理列表嵌套和新行时。