File 1 epub 与漫画管理
目前在使用的漫画管理软件有以下两个komga和kavita,在最近更新中,两者均同时支持了epub的图文混排。
在漫画中,基本上可以分为卷(Volume)章(Chapter)两个单位,卷可以理解为单行本,章一般是连载刊物上的,一卷通常包含一些章节,章节为漫画的最小单位。
而komga和kavita针对卷和章的提取,以及文件的排序,策略是不同的。
kavaita会尝试解析epub的metada,具体而言为calibre的扩展metadata,之后会依据这些metadata对于文件进行排序
而komga则是基于文件名进行字典序排序,但它同样会解析epub的metadata,之后会将解析到的结果格式化后添加到文件名尾部,之后基于字典序排序,因此如果文件名称不规范,komga很容易乱序。
epub结构
epub本质上就是一个zip压缩包。zip压缩包本身是存在顺序的具体而言为
magic_number,file_meta,file_content,file_meta,fiel_content....这样的简单格式,同时可以控制单个文件是否进行压缩,也就是说zip就是一个压缩+归档的格式。
由于有很多文件格式底层为zip,为了方便文件识别,要求此类文件内部第一个文件需要为mimetype,且不能压缩,内容为具体的mime类型,对于epub为application/epub+zip。
对于epub整体可以参见文档 https://www.w3.org/TR/epub-33/
epub内部的href类似于html的href,但是他有个虚拟fs的概念。即整个文件作为一个rootfs来看待。这个rootfs的根目录可以理解为对应epub解压生成的文件夹的根目录。无论多少个..都不能跳出rootfs。可以用下面代码来实现。
undefined
def resolve_path_on_any_platform(self, root_path, rel_path): root = pathlib.PurePosixPath(root_path) rel_path = pathlib.PurePosixPath(rel_path) for p in rel_path.parts: if p == "..": root = root.parent elif p != '.': root = root / p return root.as_posix()
基本解析方式。
首先读META-INF下的container.xml,读取rootfiles (通常为opf文件)。如果有多个,规范要求解析第一个就行了,该文件为opf文件(本质上就是一个xml)。该文件中记录着书籍的meta信息,导航文件位置,资源manifest。以及spine部分,该部分译为书脊,可以说表示了所有(书页)资源的相对顺序。
cbz
所谓cbz,就是comic book on zip,cbr就是comic book on rar。这不是一种标准格式,而是在使用中发展而来的格式。本质上他就是一个zip压缩包。里面为很多图片文件。相关软件解析时,通常会按照文件名的字典序显示图片。相比epub,cbz专注于漫画(事实上欧美很多漫画都有大量的cbz资源),因此可以针对它有很多假设,如每一页均为单独图片文件,每一页除了该文件,不会引用其他资源。结合zip的流式读取特性,可以实现不解压文件,流式传输单页图片来阅读漫画。
但是这个格式有个很明显的问题,就是没有漫画的metadata。一个思路是将metadata编入文件名,但是文件名挺容易被更改的。又是随着一个已经消亡的app,cbz也定义了metdadata文件,是压缩包中的一个名为ComicInfo.xml的文件。具体见这里 https://github.com/anansi-project/comicinfo
webp与ssim
通常漫画格式无损压缩采用png,有损压缩为jpg。谷歌后来推出了一个新的图片格式webp,其优点为其无损模式比png小,有损模式在相同质量的情况下,体积约小30%。
对于jpeg转webp,建议质量设置为75,这是webp的甜品点,可以在ssim在99以上时体积减小30%。
关于ssim计算有个很好的网站 https://github.com/Kaciras/ICAnalyzer
opds与pse与webpub
此章专注于流式传输能力,这样就可以不把整本书下载下来的前提顺畅的阅读(通常电子书是不介意下载整本的,但是漫画相对而言比较大,流式传输还是很有吸引力的)。
首先。存在一个在线书库协议哦opds,就是通过http定义了一个书库。pse(page stream extention)则是在opds上的拓展协议,其针对cbz这类文件的流式传输的扩展,主要用于进度同步,请求页面等,具体可见 https://github.com/anansi-project/opds-pse
那么针对epub,有没有流式传输的方法呢。epub本质就是一个zip压缩了一些html。而epub的html本质上又是在一个虚拟的rootfs下。所以本质上一个epub文件就可以视为一个特殊的网站,只需要添加一些公用的css和js就好了。基于这种思想有了webpub这个项目,它采用一个通用的结构描述整个epub文件内容,具体页面加载依赖于epub本身的html。
具体可见 https://github.com/readium/webpub-manifest 其也可以于opda结合。
转换文档
需求是在moe.vol上下载的漫画在漫画管理软件中解析不太友好,因此需要转化为cbz格式。同时,其生成的epub内图片又不是按照字典序顺序排列的。因此需要解析epub,获取真正的顺序,最后依据顺序重命名文件
1.
解析了epub的spine部分,依据spine的顺序对图片进行命名,确保顺序不错。
2.
递归的转化,可以把所有子文件夹下的epub都转成cbz。
3.
尝试通过epub和文件名生成comicInfo信息,方便kavita和komga索引。
4.
重命名以满足komga默认的字典序规则,最大程度保证komga索引顺序。
5.
将图片转化成webp格式,在SSIM为99.9的情况下减少了大概20%的体积。
6.
添加进度条,更加友好。
环境为python3,最好版本新一点。 依赖第三方库 pillow (转化webp),tqdm(进度条),ebookmeta(解析epub meta),ebooklib(解析epub)
c
# -*- coding: utf-8 -*- # 使用方法,将本文件放置到和待转换文件的同级目录 # import sys, time import zipfile import os import ebookmeta import ebooklib import tqdm from ebooklib import epub from io import BytesIO from PIL import Image import xml.etree.ElementTree as ET import pathlib import re from typing import Tuple, Optional, List, Union class PageInfo: def __init__(self, idx: int): self.image = idx self.Type = "" self.double_page = "" self.image_size = "" self.key = "" self.book_mark = "" self.image_width = "" self.image_height = "" def to_xml_ele(self): ele = ET.Element("Page") # ET.ident(ele) ele.set("Image", str(self.image)) if self.Type: ele.set("Type", self.type) if self.double_page is True: ele.set("DoublePage", "true") elif self.double_page is False: ele.set("DoublePage", "false") if self.image_size: ele.set("ImageSize", self.image_size) if self.key: ele.set("Key", self.key) if self.book_mark: ele.set("Bookmark", self.book_mark) if self.image_width: ele.set("ImageWidth", self.image_width) if self.image_height: ele.set("ImageHeight", self.image_height) return ele class ComicInfo: def __init__(self): self.series = "" self.series_sort = "" self.writer = "" self.publisher = "" self.title = "" self.number = "" self.volume = "" self.language_iso = "zh-CN" self.year = "" self.month = "" self.day = "" self.GTIN = "" self.tags = "" self.notes = "" self.summary = "" self.locations = "" self.pages = [] def add_page(self, page: PageInfo): self.pages.append(page) def merge_with_epub_info(self, meta): if meta.identifier: self.GTIN = meta.identifier if len(meta.author_list): self.writer = ",".join(meta.author_list) if meta.series: self.series = meta.series self.series_sort = meta.series if meta.series_index: self.volume = str(int(float(meta.series_index))) if len(meta.tag_list): self.tags = ",".join(meta.tag_list) if meta.description: self.summary = meta.description if meta.lang: self.language_iso = meta.lang if meta.title: self.title = meta.title self.notes = str(meta) pub_info = meta.publish_info if pub_info.title: self.title = pub_info.title if pub_info.publisher: self.publisher = pub_info.publisher if pub_info.year: self.year = pub_info.year if pub_info.city: self.locations = pub_info.city if pub_info.series: self.series = pub_info.series if pub_info.series_index: self.volume = str(int(float(pub_info.series_index))) if pub_info.isbn: self.GTIN = pub_info.isbn def merge_with_name_info(self, series, vol, chapter, publisher): if series: self.series = series self.series_sort = series if vol: self.volume = str(vol) if chapter: self.number = chapter if publisher: self.publisher = publisher def build_comic_info_xml(self): try: root = ET.Element("ComicInfo") root.attrib["xmlns:xsi"] = "https://www.w3.org/2001/XMLSchema-instance" root.attrib["xmlns:xsd"] = "https://www.w3.org/2001/XMLSchema" def assign(cix_entry: str, md_entry: Optional[Union[str, int]]) -> None: if md_entry is not None and md_entry: et_entry = root.find(cix_entry) if et_entry is not None: et_entry.text = str(md_entry) else: et_entry = ET.SubElement(root, cix_entry) et_entry.text = str(md_entry) # return et_entry else: et_entry = root.find(cix_entry) if et_entry is not None: root.remove(et_entry) assign("Title", self.title) assign("Series", self.series) assign("SeriesSort", self.series_sort) assign("Writer", self.writer) assign("Publisher", self.publisher) assign("Number", self.number) assign("Volume", self.volume) assign("LanguageISO", self.language_iso) assign("Year", self.year) assign("Month", self.month) assign("Day", self.day) assign("GTIN", self.GTIN) assign("Tags", self.tags) assign("Notes", self.notes) assign("Summary", self.summary) assign("Locations", self.locations) if len(self.pages): pages_node = root.find("Pages") if pages_node is not None: pages_node.clear() else: pages_node = ET.SubElement(root, "Pages") for p in self.pages: pages_node.append(p.to_xml_ele()) ET.indent(root) tree = ET.ElementTree(root) return True, ET.tostring(tree.getroot(), encoding="utf-8", xml_declaration=True).decode(), "" except Exception as e: m = f"convert comic info xml failed with {e}" print(m) return False, "", m # name_Vol.01_Ch.001-002_[publisher].epub VOL_CH_RE_PAIR = (re.compile(r"([^_]+)_Vol\.(\d+)_Ch\.([^_]+)_\[([^\]]+)\]\."), (2, 3, 1, 4, -1)) # series:1 vol:2 ch:3 publish:4,subname:-1 # name_Vol.01_[publisher].epub # series:1 vol:2 ch:-1 publish:3,subname:-1 VOL_RE_PAIR = (re.compile(r"([^_]+)_Vol\.(\d+)_\[([^\]]+)\]\."), (2, -1, 1, 3, -1)) # [publisher][series]sub_name第01卷.kepub.epub MOE_SUBNAME_RE = (re.compile(r"\[([^\[]+)\](\[[^\[]+\])(.+)第(\d+)卷"), (4, -1, 2, 1, 3)) # [publisher][series]卷01.kepub.epub # publisher:1 series:2 vol:3,ch:-1,subname:-1 MOE_SUBNAME_RE = (re.compile(r"\[([^\[]+)\]\[([^\[]+)\](.+)第(\d+)卷"), (4, -1, 2, 1, 3)) # [publisher][series]話01-002.kepub.epub # publisher:1 series:2 vol:-1,ch:3,subname:-1 MOE_CH_RE_PAIR = (re.compile(r"\[([^\[]+)\]\[([^\[]+)\]話([\d-]+)"), (-1, 3, 2, 1, -1)) NAME_RULE=[ VOL_CH_RE_PAIR, VOL_RE_PAIR, MOE_CH_RE_PAIR, MOE_SUBNAME_RE, MOE_VOL_RE_PAIR ] class Converter: def __init__(self): self.error_msg = "" pass def produce_metda_data_name(self, path) -> (str, str): cm = ComicInfo() obj_path = pathlib.Path(path) name = str(obj_path.name) res = False for rules in NAME_RULE: res, vol, ch, series, publisher = self.extract_base_info_from_name(name, rules) if res: cm.merge_with_name_info(series, vol, ch, publisher) break if res is False: m = f"filename {path} not support" self.error_msg += m + "\n" res = False print(m) if res: cm.merge_with_name_info(series, vol, ch, publisher) try: metadata = ebookmeta.get_metadata(path) cm.merge_with_epub_info(metadata) except Exception as e: m = f"parse metadata from epub failed with {e}" self.error_msg += m + "\n" print(m) if res: _, name = self.produce_new_name(series, vol, ch, publisher) else: name = "" return cm, name def convert_to_webp(self, img_bytes) -> (bool, bytes): try: img = Image.open(BytesIO(img_bytes)) # import pdb # pdb.set_trace() out = BytesIO() img.save(out, format="webp", quality=80) # img.save(out,format='webp',lossless=True,quality=100,method=6) return True, out.getvalue(), img.size except Exception as e: m = f"convert to webp failed with {e}" self.error_msg += m + "\n" print(m) return False, img_bytes, (-1, -1) def extract_base_info_from_name(self, name, re_pair) -> ( bool, int, str, str, str): # repr, group_index: Tuple[int, int, int, int]) -> # (vol,chapter,series,publisher) not kown use "" or 1000 repr = re_pair[0] group_index = re_pair[1] if len(group_index) != 5: return False, 1, "", "", "", "", "" res = repr.search(name) if res: try: vol = 1000 chapter = "" series = "" publisher = "" vol_idx = group_index[0] chapter_idx = group_index[1] series_idx = group_index[2] publisher_idx = group_index[3] sub_name_idx = group_index[4] if vol_idx != -1: vol = int(float(res.group(vol_idx))) if chapter_idx != -1: chapter = res.group(chapter_idx) if series_idx != -1: series = res.group(series_idx) if publisher_idx != -1: publisher = res.group(publisher_idx) if sub_name_idx != -1: sub_name = res.group(sub_name_idx) if sub_name: series=f"{series}_{sub_name}" return True, vol, chapter, series, publisher except Exception as e: m = f"extract info from {name} use {repr.pattern} Failed for{e}" self.error_msg += m + "\n" print(m) return False, 1, "", "", "" else: return False, 1, "", "", "" def produce_new_name(self, series, vol: int, chapter: str, publisher) -> (bool, str): # vol padding on len 3,chapter padding on 4 try: if not publisher: publisher = "ericma" if "-" in chapter: chapter = [f"{int(float(i)):04}" for i in chapter.split("-")] chapter = "-".join(chapter) elif chapter: chapter = f"{int(float(chapter)):04}" if chapter: return True, f"{series}_[{publisher}]_Vol.{vol:04}_Ch.{chapter}.cbz" else: return True, f"{series}_[{publisher}]_Vol.{vol:04}.cbz" except Exception as e: m = f"build name on ({series},{vol, chapter, publisher}) failed for {e}" self.error_msg += m + "\n" print(m) return False, "" def resolve_path_on_any_platform(self, root_path, rel_path): root = pathlib.PurePosixPath(root_path) rel_path = pathlib.PurePosixPath(rel_path) for p in rel_path.parts: if p == "..": root = root.parent elif p != '.': root = root / p return root.as_posix() def process(self, path): new_name = None try: print(f"process {path}") self.error_msg = "" cm, new_name = self.produce_metda_data_name(path) old_name = pathlib.Path(path).name if not new_name: new_name = path.replace(".epub", ".cbz") else: new_name = path.replace(old_name, new_name) if os.path.exists(new_name): print(f"cbz {new_name} already exists") return True, "" with zipfile.ZipFile(new_name, 'w') as zwrite: # if data: # zwrite.writestr("ComicInfo.xml", data) # ,zipfile.ZIP_DEFLATED) ebook = ebooklib.epub.read_epub(path, options={"ignore_ncx": True}) idx = 1 img_list = [] for ref_id, is_show in ebook.spine: page = ebook.get_item_with_id(ref_id) if type(page) == ebooklib.epub.EpubHtml: xml_content = page.content root_path = str(pathlib.PurePosixPath(page.file_name).parent) ele = ET.fromstring(xml_content) for item in ele.findall(".//"): if "img" in item.tag: if "src" in item.attrib: src = item.attrib["src"] # process imag_path abs_path = self.resolve_path_on_any_platform(root_path, src) img_list.append((idx, abs_path, ref_id, item.attrib)) idx += 1 paddinglen = len(str(len(img_list))) for idx, abs_path, ref_id, attr_dict in tqdm.tqdm(img_list): try: img_block = ebook.get_item_with_href(abs_path) s = pathlib.Path(abs_path).suffix if s in set([".jpg", ".png", ".jpeg"]) or img_block.media_type in set( ["image/jpeg", "image/png"]): res, img_d, shape = self.convert_to_webp(img_block.content) if res: newname = f"{str(idx).rjust(paddinglen, '0')}-{ref_id}.webp" else: newname = f"{str(idx).rjust(paddinglen, '0')}-{ref_id}{s}" page = PageInfo(idx) if "class" in attr_dict: if attr_dict["class"] == "singlePage": page.double_page = False elif attr_dict["class"] == "twoPage": page.double_page = True page.image_size = str(len(img_d)) page.key = ref_id page.image_width = str(shape[0]) page.image_height = str(shape[1]) cm.add_page(page) zwrite.writestr(newname, img_d) # , zipfile.ZIP_DEFLATED) except Exception as e: m = f"process image on {ref_id} name {abs_path} failed with {e} " self.error_msg += m + "\n" if new_name: if os.path.exists(new_name): os.remove(new_name) return False, self.error_msg res, data, msg = cm.build_comic_info_xml() if msg: self.error_msg += msg + "\n" if data: zwrite.writestr("ComicInfo.xml", data, zipfile.ZIP_DEFLATED) return True, self.error_msg except Exception as e: m = f"process {path} failed with {e}" self.error_msg += m + '\n' print(e) if new_name: if os.path.exists(new_name): os.remove(new_name) return False, self.error_msg if __name__ == '__main__': c = Converter() now = os.getcwd() # import pdb # pdb.set_trace() def fn(file_dir): for root, dirs, files in os.walk(file_dir): for f in files: if os.path.splitext(f)[1] == '.epub': # 处理epub yield os.path.relpath(os.path.join(root, f), now) res_warning_dict = dict() res_failed_dict = dict() for filename in fn(now): # 读取当前以及子目录下所有的epub文件 res, msg = c.process(filename) if res: print(f"process {filename} succeed") if msg: res_warning_dict[filename] = msg else: print(f"process {filename} failed") res_failed_dict[filename] = msg print("==============below is convert with some warning ==============") for k, v in res_warning_dict.items(): print(f"> {k}\n {v}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n") print("==============below is convert failed ==============") for k, v in res_failed_dict.items(): print(f"> {k}\n {v}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
ui
c
def main_app(now_dir: str): c = Converter() def get_epub_files(file_dir): for root, dirs, files in os.walk(file_dir): for _file in files: if os.path.splitext(_file)[1] == '.epub': # 处理epub yield os.path.relpath(os.path.join(root, _file), now_dir) res_warning_dict = dict() res_failed_dict = dict() for filename in get_epub_files(now_dir): # 读取当前以及子目录下所有的epub文件 res, msg = c.process(filename) if res: print(f"process {filename} succeed") if msg: res_warning_dict[filename] = msg else: print(f"process {filename} failed") res_failed_dict[filename] = msg return ( '\n'.join([f"【{_f}】:{_m}" for _f, _m in res_warning_dict.items()]), '\n'.join([f"【{_f}】:{_m}" for _f, _m in res_failed_dict.items()]) ) class AppWidgets(QtW.QWidget): def __init__(self, parent=None): super().__init__(parent) self.setLayout(QtW.QVBoxLayout()) # 导入按钮 self.button_load = QtW.QPushButton('选择目录') self.button_load.clicked.connect(self.trans_epub) # 结果显示框 self.text_result = QtW.QTextBrowser() self.text_result.setText('执行结果显示在此处') # 设置组件的布局 self.setup_layout() def layout(self) -> QtW.QVBoxLayout: return super().layout() def setup_layout(self): self.layout().addWidget(self.text_result) self.layout().addWidget(QtW.QLabel('选择文件夹进行转换操作')) self.layout().addWidget(self.button_load) def trans_epub(self): choose_dir = QtW.QFileDialog.getExistingDirectory() if not choose_dir: return suc_list, fail_list = main_app(choose_dir) self.text_result.setText(f"{suc_list}\n\n{fail_list}") class MainWindow(QtW.QMainWindow): def __init__(self, parent=None): super().__init__(parent) self.setWindowTitle('EPUB') self.setMinimumSize(450, 350) self.setCentralWidget(AppWidgets()) if __name__ == '__main__': app = QtW.QApplication() main_win = MainWindow() main_win.show() sys.exit(app.exec())
这段文本主要讨论了漫画管理软件komga和kavita的最新功能,它们现在都支持epub格式的图文混排。漫画通常由卷(Volume)和章(Chapter)组成,卷类似单行本,章常见于连载中。这两种软件在处理卷和章以及文件排序时采取了不同的策略:kavita依赖于epub元数据(特别是calibre扩展元数据)进行排序,而komga基于文件名字典序排序,并会尝试解析元数据并将其添加到文件名后重新排序。
epub本质上是一个ZIP压缩包,包含一个mimetype文件和一系列有序的文件。epub的结构允许通过container.xml找到OPF文件,其中包含了元信息、资源清单和书脊信息,后者定义了内容的相对顺序。
cbz(comic book on zip)是一种非标准格式,用于存储漫画,通常按文件名字典序显示图片。cbz也有一些元数据解决方案,如ComicInfo.xml。
转换文档提到了一个Python脚本,用于将moe.vol下载的漫画从epub格式转换为cbz,同时解决epub内图片顺序问题。该脚本解析epub的spine来确定图片顺序,支持递归转换子文件夹中的epub,并尝试生成ComicInfo信息。转换过程中,图片会被转换为webp格式以减小体积。脚本还包含进度条功能,并使用了诸如pillow、tqdm和ebooklib等Python库。
此外,文本还提及了OPDS(开放出版物分发协议)、PSE(页面流扩展)和Webpub,这些都是用于流式传输电子书或漫画的技术和标准。OPDS提供在线书库访问,PSE扩展了cbz的流式传输能力,而Webpub利用epub的结构实现了流式阅读。
京ICP备2022011262号