1
0
Fork 0

Compare commits

...

26 Commits

Author SHA1 Message Date
Alexander Andreev 43909c2b29
Changelog updated with 0.5.1 changes. 2021-05-04 04:04:22 +04:00
Alexander Andreev acbfaefa9c
Version changed to 0.5.1 in a Makefile. 2021-05-04 03:58:46 +04:00
Alexander Andreev 86ef44aa07
Version changed to 0.5.1. 2021-05-04 03:58:02 +04:00
Alexander Andreev 419fb2b673
Removed excessive comparison of hash. Added message when file cannot be retrieved. 2021-05-04 03:56:59 +04:00
Alexander Andreev 0287d3a132
Turned a string into f-string. 2021-05-04 03:55:32 +04:00
Alexander Andreev 245e33f40d
README updated. lolifox.cc removed. Option --skip-posts added. 2021-05-03 02:45:41 +04:00
Alexander Andreev e092c905b2
Makefile updated to version 0.5.0. 2021-05-03 02:44:37 +04:00
Alexander Andreev 90338073ed
Updated CHANGELOG with version 0.5.0. 2021-05-03 02:44:19 +04:00
Alexander Andreev cdcc184de8
Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version. 2021-05-03 02:43:49 +04:00
Alexander Andreev b335891097
Copyright, date, and version are updated. 2021-05-03 02:41:32 +04:00
Alexander Andreev 1213cef776
Lolifox removed. Added skip_posts handling. 2021-05-03 02:40:57 +04:00
Alexander Andreev 78d4a62c17
IB parsers rewritten accordingly to fixed Parser class. 2021-05-03 02:40:21 +04:00
Alexander Andreev f3ef07af68
Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field. 2021-05-03 02:38:46 +04:00
Alexander Andreev 6373518dc3
Added order=True for FIleInfo to make sure that order of fields is preserved. 2021-05-03 02:36:17 +04:00
Alexander Andreev caf18a1bf0
Added option --skip-posts and messages are now takes just one line. 2021-05-03 02:35:31 +04:00
Alexander Andreev 751549f575
A new generalised class for all imageboards based on Tinyboard or having identical API. 2021-05-03 02:34:38 +04:00
Alexander Andreev 38b5740d73
Removing lolifox.cc parser because this board is dead. 2021-05-03 02:33:52 +04:00
Alexander Andreev 2f9d26427c
Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args. 2021-05-03 02:33:14 +04:00
Alexander Andreev e7cf2e7c4b
Added a missing return True statement in _check_file 2021-05-03 02:30:31 +04:00
Alexander Andreev 4f6f56ae7b
Version in a Makefile is changed to 0.4.1. 2021-04-28 02:50:38 +04:00
Alexander Andreev 503eb9959b
Version updated to 0.4.1. 2021-04-28 02:49:59 +04:00
Alexander Andreev cb2e0d77f7
Changelog update for 0.4.1. 2021-04-28 02:49:26 +04:00
Alexander Andreev 93e442939a
Dvach's stickers handling. 2021-04-28 02:48:36 +04:00
Alexander Andreev 6022c9929a
Added HTTP and URL exceptions handling. 2021-04-28 02:47:41 +04:00
Alexander Andreev f79abcc310 In classifiers licence was fixed and added more topics related to a program. 2020-11-25 03:37:24 +04:00
Alexander Andreev 9cdb510325 A little fix for README. 2020-11-25 03:36:31 +04:00
16 changed files with 280 additions and 285 deletions

View File

@ -1,5 +1,38 @@
# Changelog # Changelog
## 0.5.1 - 2021-05-04
## Added
- Message when a file cannot be retrieved.
## Fixed
- Removed excessive hash comparison when files has same name;
- A string forgotten to set to be a f-string, so now it displays a reason of why
thread wasn't found.
## 0.5.0 - 2021-05-03
## Added
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
or `--skip-posts <number>` to set how much posts you want to skip.
## Changed
- Better, minified messages;
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
future easy extension with way less repeating.
- Added a general class `TinyboardLikeParser` that implements post parser for
all imageboards based on it or the ones that have identical JSON API. From now
on all such generalisation classes will end with `*LikeParser`;
- Changed `file_base_url` for 8kun.top.
## Removed
- Support for Lolifox, since it's gone.
## 0.4.1 - 2020-12-08
## Fixed
- Now HTTPException from http.client and URLError from urllib.request
are handled;
- 2ch.hk's stickers handling.
## 0.4.0 - 2020-11-18 ## 0.4.0 - 2020-11-18
### Added ### Added
- For 2ch.hk check for if a file is a sticker was added; - For 2ch.hk check for if a file is a sticker was added;

View File

@ -1,7 +1,7 @@
build: scrapthechan README.md setup.cfg build: scrapthechan README.md setup.cfg
python setup.py sdist bdist_wheel python setup.py sdist bdist_wheel
install: install:
python -m pip install --upgrade dist/scrapthechan-0.4.0-py3-none-any.whl --user python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
uninstall: uninstall:
# We change directory so pip uninstall will run, it'll fail otherwise. # We change directory so pip uninstall will run, it'll fail otherwise.
@cd ~/ @cd ~/

View File

@ -24,12 +24,15 @@ separately. E.g. `4chan b 1100500`.
`-o`, `--output-dir` -- output directory where all files will be dumped to. `-o`, `--output-dir` -- output directory where all files will be dumped to.
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag `-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
disables this behaviour. An exclamation mark `!` in a name is for so this file flag disables this behaviour. An exclamation mark `!` in a name is for so this
will be on the top of a directory listing. file will be on the top of a directory listing.
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints `-S <num>`, `--skip-posts <num>` -- skip given number of posts.
help for a program.
`-v`, `--version` prints the version of the program.
`-h`, `--help` prints help for a program.
# Supported imageboards # Supported imageboards
@ -37,4 +40,8 @@ help for a program.
- [lainchan.org](https://lainchan.org) since 0.1.0 - [lainchan.org](https://lainchan.org) since 0.1.0
- [2ch.hk](https://2ch.hk) since 0.1.0 - [2ch.hk](https://2ch.hk) since 0.1.0
- [8kun.top](https://8kun.top) since 0.2.2 - [8kun.top](https://8kun.top) since 0.2.2
- [lolifox.cc](https://lolifox.cc) since 0.3.0
# TODO
- Sane rewrite of a program;
- Thread watcher.

View File

@ -1,8 +1,8 @@
__date__ = "18 November 2020" __date__ = "4 May 2021"
__version__ = "0.4.0" __version__ = "0.5.1"
__author__ = "Alexander \"Arav\" Andreev" __author__ = "Alexander \"Arav\" Andreev"
__email__ = "me@arav.top" __email__ = "me@arav.top"
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>" __copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
__license__ = \ __license__ = \
"""This program is licensed under the terms of the MIT license. """This program is licensed under the terms of the MIT license.
For a copy see COPYING file in a directory of the program, or For a copy see COPYING file in a directory of the program, or

View File

@ -3,7 +3,7 @@ from os import makedirs
from os.path import join, exists from os.path import join, exists
from re import search from re import search
from sys import argv from sys import argv
from typing import List from typing import List, Optional
from scrapthechan import VERSION from scrapthechan import VERSION
from scrapthechan.parser import Parser, ThreadNotFoundError from scrapthechan.parser import Parser, ThreadNotFoundError
@ -15,17 +15,18 @@ from scrapthechan.scrapers.threadedscraper import ThreadedScraper
__all__ = ["main"] __all__ = ["main"]
USAGE = \ USAGE: str = \
f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD) f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
Options: Options:
\t-h,--help -- print this help and exit; \t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit; \t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default \t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory: \t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>; \t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This \t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour; \t option disables this behaviour;
\t-S,--skip-posts <num> -- skip given number of posts.
Arguments: Arguments:
\tURL -- URL of a thread; \tURL -- URL of a thread;
@ -37,15 +38,15 @@ Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
""" """
def parse_common_arguments(args: str) -> dict: def parse_common_arguments(args: str) -> Optional[dict]:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)" r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
args = search(r, args) args = search(r, args)
if not args is None: if not args is None:
args = args.groupdict() args = args.groupdict()
return { return {
"help": not args["help"] is None, "help": not args["help"] is None,
"version": not args["version"] is None } "version": not args["version"] is None }
return None return None
def parse_arguments(args: str) -> dict: def parse_arguments(args: str) -> dict:
rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)" rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)"
@ -53,10 +54,12 @@ def parse_arguments(args: str) -> dict:
if not link is None: if not link is None:
link = link.groupdict() link = link.groupdict()
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args) out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
return { return {
"site": None if link is None else link["site"], "site": None if link is None else link["site"],
"board": None if link is None else link["board"], "board": None if link is None else link["board"],
"thread": None if link is None else link["thread"], "thread": None if link is None else link["thread"],
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
"no-op": not search(r"-N|--no-op", args) is None, "no-op": not search(r"-N|--no-op", args) is None,
"output-dir": None if out_dir is None \ "output-dir": None if out_dir is None \
else out_dir.groupdict()["outdir"] } else out_dir.groupdict()["outdir"] }
@ -82,17 +85,21 @@ def main() -> None:
exit() exit()
try: try:
parser = get_parser_by_site(args["site"], args["board"], args["thread"]) if not args["skip-posts"] is None:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"], args["skip-posts"])
else:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"])
except NotImplementedError as ex: except NotImplementedError as ex:
print(f"{str(ex)}.") print(f"{str(ex)}.")
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}") print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
exit() exit()
except ThreadNotFoundError: except ThreadNotFoundError as e:
print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \ print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
"is no longer exist.") f"not found. Reason: {e.reason}")
exit() exit()
files_count = len(parser.files) files_count = len(parser.files)
if not args["output-dir"] is None: if not args["output-dir"] is None:
@ -101,23 +108,22 @@ def main() -> None:
save_dir = join(parser.imageboard, parser.board, save_dir = join(parser.imageboard, parser.board,
parser.thread) parser.thread)
print(f"There are {files_count} files in " \ print(f"{files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}." \ f"{args['site']}/{args['board']}/{args['thread']}. " \
f"They will be saved in {save_dir}.") f"They're going to {save_dir}. ", end="")
makedirs(save_dir, exist_ok=True) makedirs(save_dir, exist_ok=True)
if not args["no-op"]: if not args["no-op"]:
print("Writing OP... ", end='')
if parser.op is None: if parser.op is None:
print("No text's there.") print("OP's empty.")
elif not exists(join(save_dir, "!op.txt")): elif not exists(join(save_dir, "!op.txt")):
with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf: with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
opf.write(f"{parser.op}\n") opf.write(f"{parser.op}\n")
print("Done.") print("OP's written.")
else: else:
print("Exists.") print("OP exists.")
scraper = ThreadedScraper(save_dir, parser.files, \ scraper = ThreadedScraper(save_dir, parser.files, \

View File

@ -5,7 +5,7 @@ from dataclasses import dataclass
__all__ = ["FileInfo"] __all__ = ["FileInfo"]
@dataclass(frozen=True) @dataclass(frozen=True, order=True)
class FileInfo: class FileInfo:
"""Stores information about a file. """Stores information about a file.

View File

@ -4,7 +4,7 @@ from itertools import chain
from json import loads from json import loads
from re import findall, match from re import findall, match
from typing import List, Optional from typing import List, Optional
from urllib.request import urlopen, Request from urllib.request import urlopen, Request, HTTPError
from scrapthechan import USER_AGENT from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
@ -14,7 +14,12 @@ __all__ = ["Parser", "ThreadNotFoundError"]
class ThreadNotFoundError(Exception): class ThreadNotFoundError(Exception):
pass def __init__(self, reason: str = ""):
self._reason = reason
@property
def reason(self) -> str:
return self._reason
class Parser: class Parser:
@ -25,28 +30,42 @@ class Parser:
Arguments: Arguments:
board -- is a name of a board on an image board; board -- is a name of a board on an image board;
thread -- is a name of a thread inside a board; thread -- is an id of a thread inside a board;
posts -- is a list of posts in form of dictionaries exported from a JSON;
skip_posts -- number of posts to skip. skip_posts -- number of posts to skip.
All the extracted files will be stored as the `FileInfo` objects.""" All the extracted files will be stored as the `FileInfo` objects."""
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
__url_file_link: str = None
def __init__(self, board: str, thread: str, posts: List[dict], def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
self._board = board
self._thread = thread self._board: str = board
self._op_post = posts[0] self._thread: str = thread
if not skip_posts is None: self._posts = self._extract_posts_list(self._get_json())
posts = posts[skip_posts:] self._op_post: dict = self._posts[0]
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
self._files = list(chain.from_iterable(filter(None, \ self._files = list(chain.from_iterable(filter(None, \
map(self._parse_post, posts)))) map(self._parse_post, self._posts))))
@property
def json_thread_url(self) -> str:
raise NotImplementedError
@property
def file_base_url(self) -> str:
raise NotImplementedError
@property
def subject_field(self) -> str:
return "sub"
@property
def comment_field(self) -> str:
return "com"
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
"""Returns image board's name.""" """Returns image board's name."""
return NotImplementedError raise NotImplementedError
@property @property
def board(self) -> str: def board(self) -> str:
@ -62,22 +81,40 @@ class Parser:
def op(self) -> str: def op(self) -> str:
"""Returns OP's post as combination of subject and comment separated """Returns OP's post as combination of subject and comment separated
by a new line.""" by a new line."""
raise NotImplementedError op = ""
if self.subject_field in self._op_post:
op = f"{self._op_post[self.subject_field]}\n"
if self.comment_field in self._op_post:
op += self._op_post[self.comment_field]
return op if not op == "" else None
@property @property
def files(self) -> List[FileInfo]: def files(self) -> List[FileInfo]:
"""Returns a list of retrieved files as `FileInfo` objects.""" """Returns a list of retrieved files as `FileInfo` objects."""
return self._files return self._files
def _get_json(self, thread_url: str) -> dict: def _extract_posts_list(self, lst: List) -> List[dict]:
"""Gets JSON version of a thread and converts it in a dictionary.""" """This method must be overridden in child classes where you specify
a path in a JSON document where posts are stored. E.g., on 4chan this is
['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
return lst
def _get_json(self) -> dict:
"""Retrieves a JSON representation of a thread and converts it in
a dictionary."""
try: try:
thread_url = self.json_thread_url.format(board=self._board, \
thread=self._thread)
req = Request(thread_url, headers={'User-Agent': USER_AGENT}) req = Request(thread_url, headers={'User-Agent': USER_AGENT})
with urlopen(req) as url: with urlopen(req) as url:
return loads(url.read().decode('utf-8')) return loads(url.read().decode('utf-8'))
except: except HTTPError as e:
raise ThreadNotFoundError raise ThreadNotFoundError(str(e))
except Exception as e:
raise e
def _parse_post(self, post: dict) -> List[FileInfo]: def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
"""Parses a single post and extracts files into `FileInfo` object.""" """Parses a single post and extracts files into `FileInfo` object.
Single object is wrapped in a list for convenient insertion into
a list."""
raise NotImplementedError raise NotImplementedError

View File

@ -1,6 +1,6 @@
"""Here are defined the JSON parsers for imageboards.""" """Here are defined the JSON parsers for imageboards."""
from re import search from re import search
from typing import List from typing import List, Optional
from scrapthechan.parser import Parser from scrapthechan.parser import Parser
@ -8,33 +8,31 @@ from scrapthechan.parser import Parser
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"] __all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \ SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
"8kun.top", "lolifox.cc"] "8kun.top"]
def get_parser_by_url(url: str) -> Parser: def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
"""Parses URL and extracts from it site name, board and thread. """Parses URL and extracts from it site name, board and thread.
And then returns initialised Parser object for detected imageboard.""" And then returns initialised Parser object for detected imageboard."""
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
site, board, thread = search(URLRX, url).groups() site, board, thread = search(URLRX, url).groups()
return get_parser_by_site(site, board, thread) return get_parser_by_site(site, board, thread, skip_posts)
def get_parser_by_site(site: str, board: str, thread: str) -> Parser: def get_parser_by_site(site: str, board: str, thread: str,
skip_posts: Optional[int] = None) -> Parser:
"""Returns an initialised parser for `site` with `board` and `thread`.""" """Returns an initialised parser for `site` with `board` and `thread`."""
if '4chan' in site: if '4chan' in site:
from .fourchan import FourChanParser from .fourchan import FourChanParser
return FourChanParser(board, thread) return FourChanParser(board, thread, skip_posts)
elif 'lainchan' in site: elif 'lainchan' in site:
from .lainchan import LainchanParser from .lainchan import LainchanParser
return LainchanParser(board, thread) return LainchanParser(board, thread, skip_posts)
elif '2ch' in site: elif '2ch' in site:
from .dvach import DvachParser from .dvach import DvachParser
return DvachParser(board, thread) return DvachParser(board, thread, skip_posts)
elif '8kun' in site: elif '8kun' in site:
from .eightkun import EightKunParser from .eightkun import EightKunParser
return EightKunParser(board, thread) return EightKunParser(board, thread, skip_posts)
elif 'lolifox' in site:
from .lolifox import LolifoxParser
return LolifoxParser(board, thread)
else: else:
raise NotImplementedError(f"Parser for {site} is not implemented") raise NotImplementedError(f"Parser for {site} is not implemented")

View File

@ -10,41 +10,54 @@ __all__ = ["DvachParser"]
class DvachParser(Parser): class DvachParser(Parser):
"""JSON parser for 2ch.hk image board.""" """JSON parser for 2ch.hk image board."""
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
__url_file_link = "https://2ch.hk"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['threads'][0]['posts']
super(DvachParser, self).__init__(board, thread, posts, skip_posts) @property
def json_thread_url(self) -> str:
return "https://2ch.hk/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://2ch.hk"
@property
def subject_field(self) -> str:
return "subject"
@property
def comment_field(self) -> str:
return "comment"
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "2ch.hk" return "2ch.hk"
@property def _extract_posts_list(self, lst: List) -> List[dict]:
def op(self) -> Optional[str]: return lst['threads'][0]['posts']
op = ""
if 'subject' in self._op_post:
op = f"{self._op_post['subject']}\n"
if 'comment' in self._op_post:
op += self._op_post['comment']
return op if not op == "" else None
def _parse_post(self, post) -> Optional[List[FileInfo]]: def _parse_post(self, post) -> Optional[List[FileInfo]]:
if not 'files' in post: return None if not 'files' in post: return None
files = [] files = []
for f in post['files']: for f in post['files']:
if 'sticker' in f: if not 'sticker' in f:
continue if match(r"^image\.\w+$", f['fullname']) is None:
if match(r"^image\.\w+$", f['fullname']) is None: fullname = f['fullname']
fullname = f['fullname'] else:
fullname = f['name']
else: else:
fullname = f['name'] fullname = f['name']
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is # Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
# completely fine to hardcode `hash_algo`. # completely fine to hardcode `hash_algo`.
files.append(FileInfo(fullname, f['size'], if 'md5' in f:
f"{self.__url_file_link}{f['path']}", files.append(FileInfo(fullname, f['size'],
f['md5'], 'md5')) f"{self.file_base_url}{f['path']}",
f['md5'], 'md5'))
else:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
None, None))
return files return files

View File

@ -1,63 +1,25 @@
from re import match from typing import Optional
from typing import List, Optional
from scrapthechan.fileinfo import FileInfo from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
from scrapthechan.parser import Parser
__all__ = ["EightKunParser"] __all__ = ["EightKunParser"]
class EightKunParser(Parser): class EightKunParser(TinyboardLikeParser):
"""JSON parser for 8kun.top image board.""" """JSON parser for 8kun.top image board."""
__url_thread_json = "https://8kun.top/{board}/res/{thread}.json"
__url_file_link = "https://media.8kun.top/file_store/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(EightKunParser, self).__init__(board, thread, posts, skip_posts)
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "8kun.top" return "8kun.top"
@property @property
def op(self) -> Optional[str]: def json_thread_url(self) -> str:
op = "" return "https://8kun.top/{board}/res/{thread}.json"
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]: @property
if not 'tim' in post: return None def file_base_url(self) -> str:
return "https://media.8kun.top/file_dl/{filename}"
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

View File

@ -1,51 +1,25 @@
from re import match from typing import Optional
from typing import List, Optional
from scrapthechan.fileinfo import FileInfo from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
from scrapthechan.parser import Parser
__all__ = ["FourChanParser"] __all__ = ["FourChanParser"]
class FourChanParser(Parser): class FourChanParser(TinyboardLikeParser):
"""JSON parser for 4chan.org image board.""" """JSON parser for 4chan.org image board."""
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "4chan.org" return "4chan.org"
@property @property
def op(self) -> Optional[str]: def json_thread_url(self) -> str:
op = "" return "https://a.4cdn.org/{board}/thread/{thread}.json"
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]: @property
if not 'tim' in post: return None def file_base_url(self) -> str:
return "https://i.4cdn.org/{board}/{filename}"
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
# Hash algorithm is hardcoded since it is highly unlikely that it will
# be changed in foreseeable future. And if it'll change then this line
# will be necessarily updated anyway.
return [FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5')]

View File

@ -1,66 +1,25 @@
from re import match from typing import Optional
from typing import List, Optional
from scrapthechan.parser import Parser from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
from scrapthechan.fileinfo import FileInfo
__all__ = ["LainchanParser"] __all__ = ["LainchanParser"]
class LainchanParser(Parser): class LainchanParser(TinyboardLikeParser):
"""JSON parser for lainchan.org image board. """JSON parser for lainchan.org image board."""
JSON structure is identical to 4chan.org's, so this parser is just inherited
from 4chan.org's parser and only needed things are redefined.
"""
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "lainchan.org" return "lainchan.org"
@property @property
def op(self) -> Optional[str]: def json_thread_url(self) -> str:
op = "" return "https://lainchan.org/{board}/res/{thread}.json"
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]: @property
if not 'tim' in post: return None def file_base_url(self) -> str:
return "https://lainchan.org/{board}/src/{filename}"
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

View File

@ -4,37 +4,21 @@ from typing import List, Optional
from scrapthechan.parser import Parser from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
__all__ = ["LolifoxParser"]
__all__ = ["TinyboardLikeParser"]
class LolifoxParser(Parser): class TinyboardLikeParser(Parser):
"""JSON parser for lolifox.cc image board. """Base parser for imageboards that are based on Tinyboard, or have similar
JSON structure is identical to lainchan.org. JSON API."""
"""
__url_thread_json = "https://lolifox.cc/{board}/res/{thread}.json"
__url_file_link = "https://lolifox.cc/{board}/src/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(LolifoxParser, self).__init__(board, thread, posts, skip_posts)
@property def _extract_posts_list(self, lst: List) -> List[dict]:
def imageboard(self) -> str: return lst['posts']
return "lolifox.cc"
@property def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
def op(self) -> Optional[str]:
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]:
if not 'tim' in post: return None if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}" dlfname = f"{post['tim']}{post['ext']}"
@ -46,8 +30,9 @@ class LolifoxParser(Parser):
filename = f"{post['filename']}{post['ext']}" filename = f"{post['filename']}{post['ext']}"
files = [] files = []
files.append(FileInfo(filename, post['fsize'], files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname), self.file_base_url.format(board=self.board, filename=dlfname),
post['md5'], 'md5')) post['md5'], 'md5'))
if "extra_files" in post: if "extra_files" in post:
@ -58,8 +43,9 @@ class LolifoxParser(Parser):
filename = dlfname filename = dlfname
else: else:
filename = f"{post['filename']}{post['ext']}" filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \ dlurl = self.file_base_url.format(board=self.board, \
filename=dlfname) filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \ files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5')) dlurl, f['md5'], 'md5'))
return files return files

View File

@ -5,8 +5,9 @@ from os import remove, stat
from os.path import exists, join, getsize from os.path import exists, join, getsize
import re import re
from typing import List, Callable from typing import List, Callable
from urllib.request import urlretrieve, URLopener, HTTPError from urllib.request import urlretrieve, URLopener, HTTPError, URLError
import hashlib import hashlib
from http.client import HTTPException
from scrapthechan import USER_AGENT from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
@ -66,6 +67,8 @@ class Scraper:
def _hash_file(self, filepath: str, hash_algorithm: str = "md5", def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
blocksize: int = 1048576) -> (str, str): blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file.""" """Compute hash of a file."""
if hash_algorithm is None:
return None
hash_func = hashlib.new(hash_algorithm) hash_func = hashlib.new(hash_algorithm)
with open(filepath, 'rb') as f: with open(filepath, 'rb') as f:
buf = f.read(blocksize) buf = f.read(blocksize)
@ -82,8 +85,10 @@ class Scraper:
if not (f.size == computed_size \ if not (f.size == computed_size \
or f.size == round(computed_size / 1024)): or f.size == round(computed_size / 1024)):
return False return False
hexdig, dig = self._hash_file(filepath, f.hash_algorithm) if not f.hash_algorithm is None:
return f.hash_value == hexdig or f.hash_value == dig hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return f.hash_value == hexdig or f.hash_value == dig
return True
def _download_file(self, f: FileInfo): def _download_file(self, f: FileInfo):
"""Download a single file.""" """Download a single file."""
@ -101,20 +106,32 @@ class Scraper:
while retries > 0: while retries > 0:
self._url_opener.retrieve(f.download_url, filepath) self._url_opener.retrieve(f.download_url, filepath)
if not self._check_file(f, filepath): if not self._check_file(f, filepath):
print(filepath, f.size, f.hash_value)
remove(filepath) remove(filepath)
retries -= 1 retries -= 1
else: else:
break break
if retries == 0:
print(f"Cannot retrieve {f.download_url}, {filepath}.")
return
if is_same_filename: if is_same_filename:
f1_hexdig, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm) _, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
f2_hexdig, f2_dig = self._hash_file(filepath, f.hash_algorithm) _, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_hexdig == f2_hexdig or f1_dig == f2_dig: if f1_dig == f2_dig:
remove(filepath) remove(filepath)
except FileNotFoundError as e:
print("File Not Found", filepath)
except HTTPError as e: except HTTPError as e:
print("HTTP Error", e.code, e.reason, f.download_url) print("HTTP Error", e.code, e.reason, f.download_url)
if exists(filepath): if exists(filepath):
remove(filepath) remove(filepath)
except HTTPException:
print("HTTP Exception for", f.download_url)
if exists(filepath):
remove(filepath)
except URLError as e:
print("URL Error for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionResetError: except ConnectionResetError:
print("Connection reset for", f.download_url) print("Connection reset for", f.download_url)
if exists(filepath): if exists(filepath):

View File

@ -7,25 +7,26 @@ from multiprocessing.pool import ThreadPool
from scrapthechan.scraper import Scraper from scrapthechan.scraper import Scraper
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
__all__ = ["ThreadedScraper"] __all__ = ["ThreadedScraper"]
class ThreadedScraper(Scraper): class ThreadedScraper(Scraper):
def __init__(self, save_directory: str, files: List[FileInfo], def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None: download_progress_callback: Callable[[int], None] = None) -> None:
super(ThreadedScraper, self).__init__(save_directory, files, super().__init__(save_directory, files, download_progress_callback)
download_progress_callback) self._files_downloaded = 0
self._files_downloaded = 0 self._files_downloaded_mutex = Lock()
self._files_downloaded_mutex = Lock()
def run(self): def run(self):
pool = ThreadPool(cpu_count() * 2) pool = ThreadPool(cpu_count() * 2)
pool.map(self._thread_run, self._files) pool.map(self._thread_run, self._files)
pool.close() pool.close()
pool.join() pool.join()
def _thread_run(self, f: FileInfo): def _thread_run(self, f: FileInfo):
with self._files_downloaded_mutex: if not self._progress_callback is None:
self._files_downloaded += 1 with self._files_downloaded_mutex:
if not self._progress_callback is None: self._files_downloaded += 1
self._progress_callback(self._files_downloaded) self._progress_callback(self._files_downloaded)
self._download_file(f) self._download_file(f)

View File

@ -1,7 +1,7 @@
[metadata] [metadata]
name = scrapthechan name = scrapthechan
version = attr: scrapthechan.__version__ version = attr: scrapthechan.__version__
description = Scrap the files posted in a thread on an imageboard. description = Scrap the files from the imageboards.
long_description = file: README.md long_description = file: README.md
long_description_content_type = text/markdown long_description_content_type = text/markdown
author = Alexander "Arav" Andreev author = Alexander "Arav" Andreev
@ -14,18 +14,20 @@ keywords =
2ch.hk 2ch.hk
lainchan.org lainchan.org
8kun.top 8kun.top
lolifox.cc
license = MIT license = MIT
license_file = COPYING license_file = COPYING
classifiers = classifiers =
Development Status :: 2 - Pre-Alpha Development Status :: 3 - Alpha
Environment :: Console Environment :: Console
Intended Audience :: End Users/Desktop Intended Audience :: End Users/Desktop
License :: Other/Proprietary License License :: OSI Approved :: MIT License
Natural Language :: English Natural Language :: English
Operating System :: OS Independent Operating System :: OS Independent
Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8 Topic :: Communications :: BBS
Topic :: Internet :: WWW/HTTP
Topic :: Internet :: WWW/HTTP :: Dynamic Content :: Message Boards
Topic :: Text Processing
Topic :: Utilities Topic :: Utilities
[options] [options]