1
0
Fork 0

Compare commits

...

52 Commits

Author SHA1 Message Date
Alexander Andreev 43909c2b29
Changelog updated with 0.5.1 changes. 2021-05-04 04:04:22 +04:00
Alexander Andreev acbfaefa9c
Version changed to 0.5.1 in a Makefile. 2021-05-04 03:58:46 +04:00
Alexander Andreev 86ef44aa07
Version changed to 0.5.1. 2021-05-04 03:58:02 +04:00
Alexander Andreev 419fb2b673
Removed excessive comparison of hash. Added message when file cannot be retrieved. 2021-05-04 03:56:59 +04:00
Alexander Andreev 0287d3a132
Turned a string into f-string. 2021-05-04 03:55:32 +04:00
Alexander Andreev 245e33f40d
README updated. lolifox.cc removed. Option --skip-posts added. 2021-05-03 02:45:41 +04:00
Alexander Andreev e092c905b2
Makefile updated to version 0.5.0. 2021-05-03 02:44:37 +04:00
Alexander Andreev 90338073ed
Updated CHANGELOG with version 0.5.0. 2021-05-03 02:44:19 +04:00
Alexander Andreev cdcc184de8
Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version. 2021-05-03 02:43:49 +04:00
Alexander Andreev b335891097
Copyright, date, and version are updated. 2021-05-03 02:41:32 +04:00
Alexander Andreev 1213cef776
Lolifox removed. Added skip_posts handling. 2021-05-03 02:40:57 +04:00
Alexander Andreev 78d4a62c17
IB parsers rewritten accordingly to fixed Parser class. 2021-05-03 02:40:21 +04:00
Alexander Andreev f3ef07af68
Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field. 2021-05-03 02:38:46 +04:00
Alexander Andreev 6373518dc3
Added order=True for FIleInfo to make sure that order of fields is preserved. 2021-05-03 02:36:17 +04:00
Alexander Andreev caf18a1bf0
Added option --skip-posts and messages are now takes just one line. 2021-05-03 02:35:31 +04:00
Alexander Andreev 751549f575
A new generalised class for all imageboards based on Tinyboard or having identical API. 2021-05-03 02:34:38 +04:00
Alexander Andreev 38b5740d73
Removing lolifox.cc parser because this board is dead. 2021-05-03 02:33:52 +04:00
Alexander Andreev 2f9d26427c
Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args. 2021-05-03 02:33:14 +04:00
Alexander Andreev e7cf2e7c4b
Added a missing return True statement in _check_file 2021-05-03 02:30:31 +04:00
Alexander Andreev 4f6f56ae7b
Version in a Makefile is changed to 0.4.1. 2021-04-28 02:50:38 +04:00
Alexander Andreev 503eb9959b
Version updated to 0.4.1. 2021-04-28 02:49:59 +04:00
Alexander Andreev cb2e0d77f7
Changelog update for 0.4.1. 2021-04-28 02:49:26 +04:00
Alexander Andreev 93e442939a
Dvach's stickers handling. 2021-04-28 02:48:36 +04:00
Alexander Andreev 6022c9929a
Added HTTP and URL exceptions handling. 2021-04-28 02:47:41 +04:00
Alexander Andreev f79abcc310 In classifiers licence was fixed and added more topics related to a program. 2020-11-25 03:37:24 +04:00
Alexander Andreev 9cdb510325 A little fix for README. 2020-11-25 03:36:31 +04:00
Alexander Andreev 986fdbe7a7 Handling of no arguments passed. 2020-11-19 01:30:47 +04:00
Alexander Andreev 2e6352cb13 Updated changelog. 2020-11-19 01:26:35 +04:00
Alexander Andreev 7b2fcf0899 Improved error handling, retries for damaged files. 2020-11-19 01:26:19 +04:00
Alexander Andreev 21837c5335 Updated changelog. 2020-11-19 00:09:56 +04:00
Alexander Andreev b970973018 ConnectionResetError handling. 2020-11-19 00:09:39 +04:00
Alexander Andreev 6dab626084 Version is changed to 0.4.0. 2020-11-18 23:51:18 +04:00
Alexander Andreev 86b6278657 Updated changelog and readme. 2020-11-18 23:50:58 +04:00
Alexander Andreev 7754a90313 FileInfo is now a frozen dataclass for efficiency. 2020-11-18 23:48:38 +04:00
Alexander Andreev bb47b50c5f _is_file_ok now is _check_file and modified to be more efficient. Also added check for if files happened to share same name and size, but IB said wrong hash. 2020-11-18 23:47:26 +04:00
Alexander Andreev 8403fcf0f2 Now op file is explicitly in utf-8. 2020-11-18 23:45:06 +04:00
Alexander Andreev 647a787974 FIxed arguments for a match function. 2020-11-18 23:44:36 +04:00
Alexander Andreev 6a54b88498 sub and com ->subject and comment. Fixed arguments for match function. 2020-11-18 23:43:43 +04:00
Alexander Andreev 2043fc277f No right to fuck up! Shit... Forgot third part of a version. 2020-09-09 04:39:33 +04:00
Alexander Andreev a106d5b739 Added support for lolifox.cc. Fixed User-Agent usage, so it applied correctly everywhere now. 2020-09-09 04:34:41 +04:00
Alexander Andreev 7825b53121 Did a minor refactoring. Also combined two first lines that are printed for a thread into one. 2020-07-20 04:32:30 +04:00
Alexander Andreev b26152f3ca Moved User-Agent off to __init__ in its own variable. 2020-07-20 04:31:27 +04:00
Alexander Andreev 9ad9fcfd6f Added supported IBs to readme. 2020-07-20 04:13:39 +04:00
Alexander Andreev 2fcd4f0aa7 Updated usage, so I don't have to edit it every time I add a new IB. 2020-07-20 04:13:12 +04:00
Alexander Andreev bfaa9d2778 Reduced summary. Changed URL. Edited keywords to actual domains. 2020-07-20 04:11:38 +04:00
Alexander Andreev 371c6623e9 Updated changelog. 2020-07-20 03:51:41 +04:00
Alexander Andreev 520d88c76a Parser for 8kun.top added. And I changed compares in __init__. 2020-07-20 03:45:51 +04:00
Alexander Andreev 93d2904a4f Regex limited to up to 4 characters after first dot occured. 2020-07-20 03:44:48 +04:00
Alexander Andreev 6df9e573aa Updated version to 0.2.2 2020-07-20 03:43:36 +04:00
Alexander Andreev f21ff0aff5 Oh, fuck me. What a typo... xD 2020-07-20 02:55:54 +04:00
Alexander Andreev c0282f3934 Changelog updated. 2020-07-18 05:10:31 +04:00
Alexander Andreev 4db2e1dc75 A little change of output. 2020-07-18 05:04:06 +04:00
17 changed files with 549 additions and 329 deletions

View File

@ -1,5 +1,92 @@
# Changelog # Changelog
## 0.5.1 - 2021-05-04
## Added
- Message when a file cannot be retrieved.
## Fixed
- Removed excessive hash comparison when files has same name;
- A string forgotten to set to be a f-string, so now it displays a reason of why
thread wasn't found.
## 0.5.0 - 2021-05-03
## Added
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
or `--skip-posts <number>` to set how much posts you want to skip.
## Changed
- Better, minified messages;
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
future easy extension with way less repeating.
- Added a general class `TinyboardLikeParser` that implements post parser for
all imageboards based on it or the ones that have identical JSON API. From now
on all such generalisation classes will end with `*LikeParser`;
- Changed `file_base_url` for 8kun.top.
## Removed
- Support for Lolifox, since it's gone.
## 0.4.1 - 2020-12-08
## Fixed
- Now HTTPException from http.client and URLError from urllib.request
are handled;
- 2ch.hk's stickers handling.
## 0.4.0 - 2020-11-18
### Added
- For 2ch.hk check for if a file is a sticker was added;
- Encoding for `!op.txt` file was explicitly set to `utf-8`;
- Handling of connection errors was added so now program won't crash if file
doesn't exist or not accessible for any other reason and if any damaged files
was created then they will be removed;
- Added 3 retries if file was damaged during downloading;
- To a scraper was added matching of hashes of two files that happen to share
same name and size, but hash reported by an imageboard is not the same as of
a file. It results in excessive downloading and hash calculations. Hopefully,
that only the case for 2ch.hk.
### Changed
- FileInfo class is now a frozen dataclass for memory efficiency.
### Fixed
- Found that arguments for match function that matches for `image.ext` pattern
were mixed up in places all over the parsers;
- Also for 2ch.hk checking for if `sub` and `com` was changed to `subject` and
`comment`.
## 0.3.0 - 2020-09-09
### Added
- Parser for lolifox.cc.
### Removed
- BasicScraper. Not needed anymore, there is a faster threaded version.
### Fixed
- Now User-Agent is correctly applied everywhere.
## 0.2.2 - 2020-07-20
### Added
- Parser for 8kun.top.
### Changed
- The way of comparison if that site is supported to just looking for a
substring.
- Edited regex that checks if filename is just an "image.ext" so it only checks
if after "image." only goes 1 to 4 characters.
### Notes
- Consider that issue with size on 2ch.hk. Usually it really tells the size in
kB. The problem is that sometimes it just wrong.
## 0.2.1 - 2020-07-18
### Changed
- Now program tells you what thread doesn't exist or about to be scraped. That
is useful in batch processing with scripts.
## 0.2.0 - 2020-07-18 ## 0.2.0 - 2020-07-18
### Added ### Added
- Threaded version of the scraper, so now it is fast as heck! - Threaded version of the scraper, so now it is fast as heck!

View File

@ -1,7 +1,7 @@
build: scrapthechan README.md setup.cfg build: scrapthechan README.md setup.cfg
python setup.py sdist bdist_wheel python setup.py sdist bdist_wheel
install: install:
python -m pip install --upgrade dist/scrapthechan-0.2.0-py3-none-any.whl --user python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
uninstall: uninstall:
# We change directory so pip uninstall will run, it'll fail otherwise. # We change directory so pip uninstall will run, it'll fail otherwise.
@cd ~/ @cd ~/

View File

@ -1,8 +1,8 @@
This is a tool for scraping files from imageboards' threads. This is a tool for scraping files from imageboards' threads.
It extracts the files from a JSON version of a thread. And then downloads 'em It extracts the files from a JSON representation of a thread. And then downloads
in a specified output directory or if it isn't specified then creates following 'em in a specified output directory or if it isn't specified then creates
directory hierarchy in a working directory: following directory hierarchy in a working directory:
<imageboard name> <imageboard name>
|-<board name> |-<board name>
@ -24,9 +24,24 @@ separately. E.g. `4chan b 1100500`.
`-o`, `--output-dir` -- output directory where all files will be dumped to. `-o`, `--output-dir` -- output directory where all files will be dumped to.
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag `-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
disables this behaviour. I desided to put an `!` in a name so this file will be flag disables this behaviour. An exclamation mark `!` in a name is for so this
on the top in a directory listing. file will be on the top of a directory listing.
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints `-S <num>`, `--skip-posts <num>` -- skip given number of posts.
help for a program.
`-v`, `--version` prints the version of the program.
`-h`, `--help` prints help for a program.
# Supported imageboards
- [4chan.org](https://4chan.org) since 0.1.0
- [lainchan.org](https://lainchan.org) since 0.1.0
- [2ch.hk](https://2ch.hk) since 0.1.0
- [8kun.top](https://8kun.top) since 0.2.2
# TODO
- Sane rewrite of a program;
- Thread watcher.

View File

@ -1,13 +1,16 @@
__date__ = "18 Jule 2020" __date__ = "4 May 2021"
__version__ = "0.2.0" __version__ = "0.5.1"
__author__ = "Alexander \"Arav\" Andreev" __author__ = "Alexander \"Arav\" Andreev"
__email__ = "me@arav.top" __email__ = "me@arav.top"
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>" __copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
__license__ = \ __license__ = \
"""This program is licensed under the terms of the MIT license. """This program is licensed under the terms of the MIT license.
For a copy see COPYING file in a directory of the program, or For a copy see COPYING file in a directory of the program, or
see <https://opensource.org/licenses/MIT>""" see <https://opensource.org/licenses/MIT>"""
USER_AGENT = f"ScrapTheChan/{__version__}"
VERSION = \ VERSION = \
f"ScrapTheChan ver. {__version__} ({__date__})\n\n{__copyright__}\n"\ f"ScrapTheChan ver. {__version__} ({__date__})\n{__copyright__}\n"\
f"\n{__license__}" f"\n{__license__}"

View File

@ -3,30 +3,30 @@ from os import makedirs
from os.path import join, exists from os.path import join, exists
from re import search from re import search
from sys import argv from sys import argv
from typing import List from typing import List, Optional
from scrapthechan import VERSION from scrapthechan import VERSION
from scrapthechan.parser import Parser, ParserThreadNotFoundError from scrapthechan.parser import Parser, ThreadNotFoundError
from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \ from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \
SUPPORTED_IMAGEBOARDS SUPPORTED_IMAGEBOARDS
#from scrapthechan.scrapers.basicscraper import BasicScraper
from scrapthechan.scrapers.threadedscraper import ThreadedScraper from scrapthechan.scrapers.threadedscraper import ThreadedScraper
__all__ = ["main"] __all__ = ["main"]
USAGE = \ USAGE: str = \
"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD) f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
Options: Options:
\t-h,--help -- print this help and exit; \t-h,--help -- print this help and exit;
\t-v,--version -- print program's version and exit; \t-v,--version -- print program's version and exit;
\t-o,--output-dir -- directory where to place scraped files. By default \t-o,--output-dir -- directory where to place scraped files. By default
\t following structure will be created in current directory: \t following structure will be created in current directory:
\t <imageboard>/<board>/<thread>; \t <imageboard>/<board>/<thread>;
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This \t-N,--no-op -- by default OP's post will be written in !op.txt file. This
\t option disables this behaviour; \t option disables this behaviour;
\t-S,--skip-posts <num> -- skip given number of posts.
Arguments: Arguments:
\tURL -- URL of a thread; \tURL -- URL of a thread;
@ -34,19 +34,19 @@ Arguments:
\tBOARD -- short name of a board. E.g. b; \tBOARD -- short name of a board. E.g. b;
\tTHREAD -- ID of a thread. E.g. 100500. \tTHREAD -- ID of a thread. E.g. 100500.
Supported imageboards: 4chan.org, 2ch.hk, lainchan.org. Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
""" """
def parse_common_arguments(args: str) -> dict: def parse_common_arguments(args: str) -> Optional[dict]:
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)" r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
argd = search(r, args) args = search(r, args)
if not argd is None: if not args is None:
argd = argd.groupdict() args = args.groupdict()
return { return {
"help": not argd["help"] is None, "help": not args["help"] is None,
"version": not argd["version"] is None } "version": not args["version"] is None }
return None return None
def parse_arguments(args: str) -> dict: def parse_arguments(args: str) -> dict:
rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)" rlink = r"^(https?:\/\/)?(?P<site>[\w.-]+)[ \/](?P<board>\w+)(\S+)?[ \/](?P<thread>\w+)"
@ -54,15 +54,21 @@ def parse_arguments(args: str) -> dict:
if not link is None: if not link is None:
link = link.groupdict() link = link.groupdict()
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args) out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
return { return {
"site": None if link is None else link["site"], "site": None if link is None else link["site"],
"board": None if link is None else link["board"], "board": None if link is None else link["board"],
"thread": None if link is None else link["thread"], "thread": None if link is None else link["thread"],
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
"no-op": not search(r"-N|--no-op", args) is None, "no-op": not search(r"-N|--no-op", args) is None,
"output-dir": None if out_dir is None \ "output-dir": None if out_dir is None \
else out_dir.groupdict()["outdir"] } else out_dir.groupdict()["outdir"] }
def main() -> None: def main() -> None:
if len(argv) == 1:
print(USAGE)
exit()
cargs = parse_common_arguments(' '.join(argv[1:])) cargs = parse_common_arguments(' '.join(argv[1:]))
if not cargs is None: if not cargs is None:
if cargs["help"]: if cargs["help"]:
@ -79,19 +85,22 @@ def main() -> None:
exit() exit()
try: try:
parser = get_parser_by_site(args["site"], args["board"], args["thread"]) if not args["skip-posts"] is None:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"], args["skip-posts"])
else:
parser = get_parser_by_site(args["site"], args["board"],
args["thread"])
except NotImplementedError as ex: except NotImplementedError as ex:
print(f"{str(ex)}.") print(f"{str(ex)}.")
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}") print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
exit() exit()
except ParserThreadNotFoundError: except ThreadNotFoundError as e:
print(f"Thread is no longer exist.") print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
f"not found. Reason: {e.reason}")
exit() exit()
flen = len(parser.files) files_count = len(parser.files)
print(f"There are {flen} files in this thread.")
if not args["output-dir"] is None: if not args["output-dir"] is None:
save_dir = args["output-dir"] save_dir = args["output-dir"]
@ -99,25 +108,26 @@ def main() -> None:
save_dir = join(parser.imageboard, parser.board, save_dir = join(parser.imageboard, parser.board,
parser.thread) parser.thread)
print(f"They will be saved in {save_dir}.") print(f"{files_count} files in " \
f"{args['site']}/{args['board']}/{args['thread']}. " \
f"They're going to {save_dir}. ", end="")
makedirs(save_dir, exist_ok=True) makedirs(save_dir, exist_ok=True)
if not args["no-op"]: if not args["no-op"]:
print("Writing OP... ", end='')
if parser.op is None: if parser.op is None:
print("No text's there.") print("OP's empty.")
elif not exists(join(save_dir, "!op.txt")): elif not exists(join(save_dir, "!op.txt")):
with open(join(save_dir, "!op.txt"), 'w') as opf: with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
opf.write(f"{parser.op}\n") opf.write(f"{parser.op}\n")
print("Done.") print("OP's written.")
else: else:
print("Exists.") print("OP exists.")
scraper = ThreadedScraper(save_dir, parser.files, \ scraper = ThreadedScraper(save_dir, parser.files, \
lambda i: print(f"{i}/{flen}", end="\r")) lambda i: print(f"{i}/{files_count}", end="\r"))
scraper.run() scraper.run()

View File

@ -1,23 +1,23 @@
"""FileInfo object stores all needed information about a file.""" """FileInfo object stores information about a file."""
from dataclasses import dataclass
__all__ = ["FileInfo"] __all__ = ["FileInfo"]
@dataclass(frozen=True, order=True)
class FileInfo: class FileInfo:
"""Stores all needed information about a file. """Stores information about a file.
Arguments: Fields:
- `name` -- name of a file; - `name` -- name of a file;
- `size` -- size of a file; - `size` -- size of a file;
- `dlurl` -- full download URL for a file; - `download_url` -- full download URL for a file;
- `hash_value` -- hash sum of a file; - `hash_value` -- hash sum of a file;
- `hash_algo` -- hash algorithm used (e.g. md5). - `hash_algorithm` -- hash algorithm used (e.g. md5).
""" """
def __init__(self, name: str, size: int, dlurl: str, name: str
hash_value: str, hash_algo: str) -> None: size: int
self.name = name download_url: str
self.size = size hash_value: str
self.dlurl = dlurl hash_algorithm: str
self.hash_value = hash_value
self.hash_algo = hash_algo

View File

@ -4,16 +4,22 @@ from itertools import chain
from json import loads from json import loads
from re import findall, match from re import findall, match
from typing import List, Optional from typing import List, Optional
from urllib.request import urlopen, urlretrieve from urllib.request import urlopen, Request, HTTPError
from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
__all__ = ["Parser", "ParserThreadNotFoundError"] __all__ = ["Parser", "ThreadNotFoundError"]
class ParserThreadNotFoundError(Exception): class ThreadNotFoundError(Exception):
pass def __init__(self, reason: str = ""):
self._reason = reason
@property
def reason(self) -> str:
return self._reason
class Parser: class Parser:
@ -24,28 +30,42 @@ class Parser:
Arguments: Arguments:
board -- is a name of a board on an image board; board -- is a name of a board on an image board;
thread -- is a name of a thread inside a board; thread -- is an id of a thread inside a board;
posts -- is a list of posts in form of dictionaries exported from a JSON;
skip_posts -- number of posts to skip. skip_posts -- number of posts to skip.
All the extracted files will be stored as the `FileInfo` objects.""" All the extracted files will be stored as the `FileInfo` objects."""
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
__url_file_link: str = None
def __init__(self, board: str, thread: str, posts: List[dict], def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
self._board = board
self._thread = thread self._board: str = board
self._op_post = posts[0] self._thread: str = thread
if not skip_posts is None: self._posts = self._extract_posts_list(self._get_json())
posts = posts[skip_posts:] self._op_post: dict = self._posts[0]
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
self._files = list(chain.from_iterable(filter(None, \ self._files = list(chain.from_iterable(filter(None, \
map(self._parse_post, posts)))) map(self._parse_post, self._posts))))
@property
def json_thread_url(self) -> str:
raise NotImplementedError
@property
def file_base_url(self) -> str:
raise NotImplementedError
@property
def subject_field(self) -> str:
return "sub"
@property
def comment_field(self) -> str:
return "com"
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
"""Returns image board's name.""" """Returns image board's name."""
return NotImplementedError raise NotImplementedError
@property @property
def board(self) -> str: def board(self) -> str:
@ -61,21 +81,40 @@ class Parser:
def op(self) -> str: def op(self) -> str:
"""Returns OP's post as combination of subject and comment separated """Returns OP's post as combination of subject and comment separated
by a new line.""" by a new line."""
raise NotImplementedError op = ""
if self.subject_field in self._op_post:
op = f"{self._op_post[self.subject_field]}\n"
if self.comment_field in self._op_post:
op += self._op_post[self.comment_field]
return op if not op == "" else None
@property @property
def files(self) -> List[FileInfo]: def files(self) -> List[FileInfo]:
"""Returns a list of retrieved files as `FileInfo` objects.""" """Returns a list of retrieved files as `FileInfo` objects."""
return self._files return self._files
def _get_json(self, thread_url: str) -> dict: def _extract_posts_list(self, lst: List) -> List[dict]:
"""Gets JSON version of a thread and converts it in a dictionary.""" """This method must be overridden in child classes where you specify
try: a path in a JSON document where posts are stored. E.g., on 4chan this is
with urlopen(thread_url) as url: ['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
return loads(url.read().decode('utf-8')) return lst
except:
raise ParserThreadNotFoundError
def _parse_post(self, post: dict) -> List[FileInfo]: def _get_json(self) -> dict:
"""Parses a single post and extracts files into `FileInfo` object.""" """Retrieves a JSON representation of a thread and converts it in
a dictionary."""
try:
thread_url = self.json_thread_url.format(board=self._board, \
thread=self._thread)
req = Request(thread_url, headers={'User-Agent': USER_AGENT})
with urlopen(req) as url:
return loads(url.read().decode('utf-8'))
except HTTPError as e:
raise ThreadNotFoundError(str(e))
except Exception as e:
raise e
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
"""Parses a single post and extracts files into `FileInfo` object.
Single object is wrapped in a list for convenient insertion into
a list."""
raise NotImplementedError raise NotImplementedError

View File

@ -1,6 +1,6 @@
"""Here are defined the JSON parsers for imageboards.""" """Here are defined the JSON parsers for imageboards."""
from re import search from re import search
from typing import List from typing import List, Optional
from scrapthechan.parser import Parser from scrapthechan.parser import Parser
@ -8,27 +8,31 @@ from scrapthechan.parser import Parser
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"] __all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk"] URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
"8kun.top"]
def get_parser_by_url(url: str) -> Parser: def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
"""Parses URL and extracts from it site name, board and thread. """Parses URL and extracts from it site name, board and thread.
And then returns initialised Parser object for detected imageboard.""" And then returns initialised Parser object for detected imageboard."""
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
site, board, thread = search(URLRX, url).groups() site, board, thread = search(URLRX, url).groups()
return get_parser_by_site(site, board, thread) return get_parser_by_site(site, board, thread, skip_posts)
def get_parser_by_site(site: str, board: str, thread: str) -> Parser: def get_parser_by_site(site: str, board: str, thread: str,
skip_posts: Optional[int] = None) -> Parser:
"""Returns an initialised parser for `site` with `board` and `thread`.""" """Returns an initialised parser for `site` with `board` and `thread`."""
if site in ['boards.4chan.org', 'boards.4channel.org', if '4chan' in site:
'4chan', '4chan.org']:
from .fourchan import FourChanParser from .fourchan import FourChanParser
return FourChanParser(board, thread) return FourChanParser(board, thread, skip_posts)
elif site in ['lainchan.org', 'lainchan']: elif 'lainchan' in site:
from .lainchan import LainchanParser from .lainchan import LainchanParser
return LainchanParser(board, thread) return LainchanParser(board, thread, skip_posts)
elif site in ['2ch.hk', '2ch']: elif '2ch' in site:
from .dvach import DvachParser from .dvach import DvachParser
return DvachParser(board, thread) return DvachParser(board, thread, skip_posts)
elif '8kun' in site:
from .eightkun import EightKunParser
return EightKunParser(board, thread, skip_posts)
else: else:
raise NotImplementedError(f"Parser for {site} is not implemented") raise NotImplementedError(f"Parser for {site} is not implemented")

View File

@ -10,39 +10,54 @@ __all__ = ["DvachParser"]
class DvachParser(Parser): class DvachParser(Parser):
"""JSON parser for 2ch.hk image board.""" """JSON parser for 2ch.hk image board."""
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
__url_file_link = "https://2ch.hk"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['threads'][0]['posts']
super(DvachParser, self).__init__(board, thread, posts, skip_posts) @property
def json_thread_url(self) -> str:
return "https://2ch.hk/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://2ch.hk"
@property
def subject_field(self) -> str:
return "subject"
@property
def comment_field(self) -> str:
return "comment"
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "2ch.hk" return "2ch.hk"
@property def _extract_posts_list(self, lst: List) -> List[dict]:
def op(self) -> Optional[str]: return lst['threads'][0]['posts']
op = ""
if 'sub' in self._op_post:
op = f"{self._op_post['subject']}\n"
if 'com' in self._op_post:
op += self._op_post['comment']
return op if not op == "" else None
def _parse_post(self, post) -> Optional[List[FileInfo]]: def _parse_post(self, post) -> Optional[List[FileInfo]]:
if not 'files' in post: return None if not 'files' in post: return None
files = [] files = []
for f in post['files']: for f in post['files']:
if match(f['fullname'], r"^image\.\w+$") is None: if not 'sticker' in f:
fullname = f['fullname'] if match(r"^image\.\w+$", f['fullname']) is None:
fullname = f['fullname']
else:
fullname = f['name']
else: else:
fullname = f['name'] fullname = f['name']
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is # Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
# completely fine to hardcode `hash_algo`. # completely fine to hardcode `hash_algo`.
files.append(FileInfo(fullname, f['size'], if 'md5' in f:
f"{self.__url_file_link}{f['path']}", files.append(FileInfo(fullname, f['size'],
f['md5'], 'md5')) f"{self.file_base_url}{f['path']}",
f['md5'], 'md5'))
else:
files.append(FileInfo(fullname, f['size'],
f"{self.file_base_url}{f['path']}",
None, None))
return files return files

View File

@ -0,0 +1,25 @@
from typing import Optional
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
__all__ = ["EightKunParser"]
class EightKunParser(TinyboardLikeParser):
"""JSON parser for 8kun.top image board."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
super().__init__(board, thread, skip_posts)
@property
def imageboard(self) -> str:
return "8kun.top"
@property
def json_thread_url(self) -> str:
return "https://8kun.top/{board}/res/{thread}.json"
@property
def file_base_url(self) -> str:
return "https://media.8kun.top/file_dl/{filename}"

View File

@ -1,51 +1,25 @@
from re import match from typing import Optional
from typing import List, Optional
from scrapthechan.fileinfo import FileInfo from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
from scrapthechan.parser import Parser
__all__ = ["FourChanParser"] __all__ = ["FourChanParser"]
class FourChanParser(Parser): class FourChanParser(TinyboardLikeParser):
"""JSON parser for 4chan.org image board.""" """JSON parser for 4chan.org image board."""
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "4chan.org" return "4chan.org"
@property @property
def op(self) -> Optional[str]: def json_thread_url(self) -> str:
op = "" return "https://a.4cdn.org/{board}/thread/{thread}.json"
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post: dict) -> List[FileInfo]: @property
if not 'tim' in post: return None def file_base_url(self) -> str:
return "https://i.4cdn.org/{board}/{filename}"
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
# Hash algorithm is hardcoded since it is highly unlikely that it will
# be changed in foreseeable future. And if it'll change then this line
# will be necessarily updated anyway.
return [FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5')]

View File

@ -1,66 +1,25 @@
from re import match from typing import Optional
from typing import List, Optional
from scrapthechan.parser import Parser from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
from scrapthechan.fileinfo import FileInfo
__all__ = ["LainchanParser"] __all__ = ["LainchanParser"]
class LainchanParser(Parser): class LainchanParser(TinyboardLikeParser):
"""JSON parser for lainchan.org image board. """JSON parser for lainchan.org image board."""
JSON structure is identical to 4chan.org's, so this parser is just inherited
from 4chan.org's parser and only needed things are redefined.
"""
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
def __init__(self, board: str, thread: str, def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None: skip_posts: Optional[int] = None) -> None:
posts = self._get_json(self.__url_thread_json.format(board=board, \ super().__init__(board, thread, skip_posts)
thread=thread))['posts']
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
@property @property
def imageboard(self) -> str: def imageboard(self) -> str:
return "lainchan.org" return "lainchan.org"
@property @property
def op(self) -> Optional[str]: def json_thread_url(self) -> str:
op = "" return "https://lainchan.org/{board}/res/{thread}.json"
if 'sub' in self._op_post:
op = f"{self._op_post['sub']}\n"
if 'com' in self._op_post:
op += self._op_post['com']
return op if not op == "" else None
def _parse_post(self, post) -> List[FileInfo]: @property
if not 'tim' in post: return None def file_base_url(self) -> str:
return "https://lainchan.org/{board}/src/{filename}"
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.__url_file_link.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(post['filename'], r"^image\.\w+$") is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.__url_file_link.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

View File

@ -0,0 +1,51 @@
from re import match
from typing import List, Optional
from scrapthechan.parser import Parser
from scrapthechan.fileinfo import FileInfo
__all__ = ["TinyboardLikeParser"]
class TinyboardLikeParser(Parser):
"""Base parser for imageboards that are based on Tinyboard, or have similar
JSON API."""
def __init__(self, board: str, thread: str,
skip_posts: Optional[int] = None) -> None:
super().__init__(board, thread, skip_posts)
def _extract_posts_list(self, lst: List) -> List[dict]:
return lst['posts']
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
if not 'tim' in post: return None
dlfname = f"{post['tim']}{post['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
files = []
files.append(FileInfo(filename, post['fsize'],
self.file_base_url.format(board=self.board, filename=dlfname),
post['md5'], 'md5'))
if "extra_files" in post:
for f in post["extra_files"]:
dlfname = f"{f['tim']}{f['ext']}"
if "filename" in post:
if match(r"^image\.\w+$", post['filename']) is None:
filename = dlfname
else:
filename = f"{post['filename']}{post['ext']}"
dlurl = self.file_base_url.format(board=self.board, \
filename=dlfname)
files.append(FileInfo(filename, f['fsize'], \
dlurl, f['md5'], 'md5'))
return files

View File

@ -1,96 +1,146 @@
"""Base Scraper implementation.""" """Base class for all scrapers that will actually do the job."""
from base64 import b64encode from base64 import b64encode
from os import remove, stat from os import remove, stat
from os.path import exists, join, getsize from os.path import exists, join, getsize
import re import re
from typing import List, Callable from typing import List, Callable
from urllib.request import urlretrieve, URLopener from urllib.request import urlretrieve, URLopener, HTTPError, URLError
import hashlib import hashlib
from http.client import HTTPException
from scrapthechan import __version__ from scrapthechan import USER_AGENT
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
__all__ = ["Scraper"] __all__ = ["Scraper"]
class Scraper: class Scraper:
"""Base scraper implementation. """Base class for all scrapers that will actually do the job.
Arguments: Arguments:
save_directory -- a path to a directory where file will be save_directory -- a path to a directory where file will be
saved; saved;
files -- a list of FileInfo objects; files -- a list of FileInfo objects;
download_progress_callback -- a callback function that will be called download_progress_callback -- a callback function that will be called
for each file started downloading. for each file started downloading.
""" """
def __init__(self, save_directory: str, files: List[FileInfo], def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None: download_progress_callback: Callable[[int], None] = None) -> None:
self._save_directory = save_directory self._save_directory = save_directory
self._files = files self._files = files
self._url_opener = URLopener() self._url_opener = URLopener()
self._url_opener.version = f"ScrapTheChan/{__version__}" self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
self._progress_callback = download_progress_callback self._url_opener.version = USER_AGENT
self._progress_callback = download_progress_callback
def run(self): def run(self):
raise NotImplementedError raise NotImplementedError
def _same_filename(self, filename: str, path: str) -> str: def _same_filename(self, filename: str, path: str) -> str:
"""Check if there is a file with same name. If so then add incremental """Check if there is a file with same name. If so then add incremental
number enclosed in brackets to a name of a new one.""" number enclosed in brackets to a name of a new one."""
newname = filename newname = filename
while exists(join(path, newname)): while exists(join(path, newname)):
has_extension = newname.rfind(".") != -1 has_extension = newname.rfind(".") != -1
if has_extension: if has_extension:
l, r = newname.rsplit(".", 1) l, r = newname.rsplit(".", 1)
lbracket = l.rfind("(") lbracket = l.rfind("(")
if lbracket == -1: if lbracket == -1:
newname = f"{l}(1).{r}" newname = f"{l}(1).{r}"
else: else:
num = l[lbracket+1:-1] num = l[lbracket+1:-1]
if num.isnumeric(): if num.isnumeric():
newname = f"{l[:lbracket]}({int(num)+1}).{r}" newname = f"{l[:lbracket]}({int(num)+1}).{r}"
else: else:
newname = f"{l}(1).{r}" newname = f"{l}(1).{r}"
else: else:
lbracket = l.rfind("(") lbracket = l.rfind("(")
if lbracket == -1: if lbracket == -1:
newname = f"{newname}(1)" newname = f"{newname}(1)"
else: else:
num = newname[lbracket+1:-1] num = newname[lbracket+1:-1]
if num.isnumeric(): if num.isnumeric():
newname = f"{newname[:lbracket]}({int(num)+1})" newname = f"{newname[:lbracket]}({int(num)+1})"
return newname return newname
def _hash_file(self, filename: str, hash_algo: str = "md5", def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
blocksize: int = 1048576) -> (str, str): blocksize: int = 1048576) -> (str, str):
"""Compute hash of a file.""" """Compute hash of a file."""
hash_func = hashlib.new(hash_algo) if hash_algorithm is None:
with open(filename, 'rb') as f: return None
buf = f.read(blocksize) hash_func = hashlib.new(hash_algorithm)
while len(buf) > 0: with open(filepath, 'rb') as f:
hash_func.update(buf) buf = f.read(blocksize)
buf = f.read(blocksize) while len(buf) > 0:
return hash_func.hexdigest(), hash_func.digest() hash_func.update(buf)
buf = f.read(blocksize)
return hash_func.hexdigest(), b64encode(hash_func.digest()).decode()
def _is_file_ok(self, f: FileInfo, filepath: str) -> bool: def _check_file(self, f: FileInfo, filepath: str) -> bool:
"""Check if a file exist and isn't broken.""" """Check if a file exist and isn't broken."""
if not exists(filepath): if not exists(filepath):
return False return False
computed_size = getsize(filepath) computed_size = getsize(filepath)
is_size_match = f.size == computed_size \ if not (f.size == computed_size \
or f.size == round(computed_size / 1024) or f.size == round(computed_size / 1024)):
hexdig, dig = self._hash_file(filepath, f.hash_algo) return False
is_hash_match = f.hash_value == hexdig \ if not f.hash_algorithm is None:
or f.hash_value == b64encode(dig).decode() hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
return is_size_match and is_hash_match return f.hash_value == hexdig or f.hash_value == dig
return True
def _download_file(self, f: FileInfo): def _download_file(self, f: FileInfo):
"""Download a single file.""" """Download a single file."""
filepath = join(self._save_directory, f.name) is_same_filename = False
if self._is_file_ok(f, filepath): filepath = join(self._save_directory, f.name)
return True orig_filepath = filepath
elif exists(filepath): if self._check_file(f, filepath):
filepath = join(self._save_directory, \ return
self._same_filename(f.name, self._save_directory)) elif exists(filepath):
self._url_opener.retrieve(f.dlurl, filepath) is_same_filename = True
filepath = join(self._save_directory, \
self._same_filename(f.name, self._save_directory))
try:
retries = 3
while retries > 0:
self._url_opener.retrieve(f.download_url, filepath)
if not self._check_file(f, filepath):
remove(filepath)
retries -= 1
else:
break
if retries == 0:
print(f"Cannot retrieve {f.download_url}, {filepath}.")
return
if is_same_filename:
_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
if f1_dig == f2_dig:
remove(filepath)
except FileNotFoundError as e:
print("File Not Found", filepath)
except HTTPError as e:
print("HTTP Error", e.code, e.reason, f.download_url)
if exists(filepath):
remove(filepath)
except HTTPException:
print("HTTP Exception for", f.download_url)
if exists(filepath):
remove(filepath)
except URLError as e:
print("URL Error for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionResetError:
print("Connection reset for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionRefusedError:
print("Connection refused for", f.download_url)
if exists(filepath):
remove(filepath)
except ConnectionAbortedError:
print("Connection aborted for", f.download_url)
if exists(filepath):
remove(filepath)

View File

@ -1,15 +0,0 @@
"""Implementation of basic sequential one-threaded scraper that downloads
files one by one."""
from scrapthechan.scraper import Scraper
__all__ = ["BasicScraper"]
class BasicScraper(Scraper):
def run(self):
"""Download files one by one."""
for i, f in enumerate(self._files, start=1):
if not self._progress_callback is None:
self._progress_callback(i)
self._download_file(f)

View File

@ -7,25 +7,26 @@ from multiprocessing.pool import ThreadPool
from scrapthechan.scraper import Scraper from scrapthechan.scraper import Scraper
from scrapthechan.fileinfo import FileInfo from scrapthechan.fileinfo import FileInfo
__all__ = ["ThreadedScraper"] __all__ = ["ThreadedScraper"]
class ThreadedScraper(Scraper): class ThreadedScraper(Scraper):
def __init__(self, save_directory: str, files: List[FileInfo], def __init__(self, save_directory: str, files: List[FileInfo],
download_progress_callback: Callable[[int], None] = None) -> None: download_progress_callback: Callable[[int], None] = None) -> None:
super(ThreadedScraper, self).__init__(save_directory, files, super().__init__(save_directory, files, download_progress_callback)
download_progress_callback) self._files_downloaded = 0
self._files_downloaded = 0 self._files_downloaded_mutex = Lock()
self._files_downloaded_mutex = Lock()
def run(self): def run(self):
pool = ThreadPool(cpu_count() * 2) pool = ThreadPool(cpu_count() * 2)
pool.map(self._thread_run, self._files) pool.map(self._thread_run, self._files)
pool.close() pool.close()
pool.join() pool.join()
def _thread_run(self, f: FileInfo): def _thread_run(self, f: FileInfo):
with self._files_downloaded_mutex: if not self._progress_callback is None:
self._files_downloaded += 1 with self._files_downloaded_mutex:
if not self._progress_callback is None: self._files_downloaded += 1
self._progress_callback(self._files_downloaded) self._progress_callback(self._files_downloaded)
self._download_file(f) self._download_file(f)

View File

@ -1,31 +1,33 @@
[metadata] [metadata]
name = scrapthechan name = scrapthechan
version = attr: scrapthechan.__version__ version = attr: scrapthechan.__version__
description = description = Scrap the files from the imageboards.
Scrap the files posted in a thread on an imageboard. Currently supports
4chan.org, lainchan.org and 2ch.hk.
long_description = file: README.md long_description = file: README.md
long_description_content_type = text/markdown long_description_content_type = text/markdown
author = Alexander "Arav" Andreev author = Alexander "Arav" Andreev
author_email = me@arav.top author_email = me@arav.top
url = https://arav.top url = https://git.arav.top/Arav/ScrapTheChan
keywords = keywords =
scraper scraper
imageboard imageboard
4chan 4chan.org
2ch 2ch.hk
lainchan lainchan.org
8kun.top
license = MIT license = MIT
license_file = COPYING license_file = COPYING
classifiers = classifiers =
Development Status :: 2 - Pre-Alpha Development Status :: 3 - Alpha
Environment :: Console Environment :: Console
Intended Audience :: End Users/Desktop Intended Audience :: End Users/Desktop
License :: Other/Proprietary License License :: OSI Approved :: MIT License
Natural Language :: English Natural Language :: English
Operating System :: OS Independent Operating System :: OS Independent
Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8 Topic :: Communications :: BBS
Topic :: Internet :: WWW/HTTP
Topic :: Internet :: WWW/HTTP :: Dynamic Content :: Message Boards
Topic :: Text Processing
Topic :: Utilities Topic :: Utilities
[options] [options]