Compare commits
52 Commits
Author | SHA1 | Date |
---|---|---|
Alexander Andreev | 43909c2b29 | |
Alexander Andreev | acbfaefa9c | |
Alexander Andreev | 86ef44aa07 | |
Alexander Andreev | 419fb2b673 | |
Alexander Andreev | 0287d3a132 | |
Alexander Andreev | 245e33f40d | |
Alexander Andreev | e092c905b2 | |
Alexander Andreev | 90338073ed | |
Alexander Andreev | cdcc184de8 | |
Alexander Andreev | b335891097 | |
Alexander Andreev | 1213cef776 | |
Alexander Andreev | 78d4a62c17 | |
Alexander Andreev | f3ef07af68 | |
Alexander Andreev | 6373518dc3 | |
Alexander Andreev | caf18a1bf0 | |
Alexander Andreev | 751549f575 | |
Alexander Andreev | 38b5740d73 | |
Alexander Andreev | 2f9d26427c | |
Alexander Andreev | e7cf2e7c4b | |
Alexander Andreev | 4f6f56ae7b | |
Alexander Andreev | 503eb9959b | |
Alexander Andreev | cb2e0d77f7 | |
Alexander Andreev | 93e442939a | |
Alexander Andreev | 6022c9929a | |
Alexander Andreev | f79abcc310 | |
Alexander Andreev | 9cdb510325 | |
Alexander Andreev | 986fdbe7a7 | |
Alexander Andreev | 2e6352cb13 | |
Alexander Andreev | 7b2fcf0899 | |
Alexander Andreev | 21837c5335 | |
Alexander Andreev | b970973018 | |
Alexander Andreev | 6dab626084 | |
Alexander Andreev | 86b6278657 | |
Alexander Andreev | 7754a90313 | |
Alexander Andreev | bb47b50c5f | |
Alexander Andreev | 8403fcf0f2 | |
Alexander Andreev | 647a787974 | |
Alexander Andreev | 6a54b88498 | |
Alexander Andreev | 2043fc277f | |
Alexander Andreev | a106d5b739 | |
Alexander Andreev | 7825b53121 | |
Alexander Andreev | b26152f3ca | |
Alexander Andreev | 9ad9fcfd6f | |
Alexander Andreev | 2fcd4f0aa7 | |
Alexander Andreev | bfaa9d2778 | |
Alexander Andreev | 371c6623e9 | |
Alexander Andreev | 520d88c76a | |
Alexander Andreev | 93d2904a4f | |
Alexander Andreev | 6df9e573aa | |
Alexander Andreev | f21ff0aff5 | |
Alexander Andreev | c0282f3934 | |
Alexander Andreev | 4db2e1dc75 |
87
CHANGELOG.md
87
CHANGELOG.md
|
@ -1,5 +1,92 @@
|
||||||
# Changelog
|
# Changelog
|
||||||
|
|
||||||
|
## 0.5.1 - 2021-05-04
|
||||||
|
## Added
|
||||||
|
- Message when a file cannot be retrieved.
|
||||||
|
|
||||||
|
## Fixed
|
||||||
|
- Removed excessive hash comparison when files has same name;
|
||||||
|
- A string forgotten to set to be a f-string, so now it displays a reason of why
|
||||||
|
thread wasn't found.
|
||||||
|
|
||||||
|
## 0.5.0 - 2021-05-03
|
||||||
|
## Added
|
||||||
|
- Now program makes use of skip_posts argument. Use CLI option `-S <number>`
|
||||||
|
or `--skip-posts <number>` to set how much posts you want to skip.
|
||||||
|
|
||||||
|
## Changed
|
||||||
|
- Better, minified messages;
|
||||||
|
- Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
|
||||||
|
future easy extension with way less repeating.
|
||||||
|
- Added a general class `TinyboardLikeParser` that implements post parser for
|
||||||
|
all imageboards based on it or the ones that have identical JSON API. From now
|
||||||
|
on all such generalisation classes will end with `*LikeParser`;
|
||||||
|
- Changed `file_base_url` for 8kun.top.
|
||||||
|
|
||||||
|
|
||||||
|
## Removed
|
||||||
|
- Support for Lolifox, since it's gone.
|
||||||
|
|
||||||
|
## 0.4.1 - 2020-12-08
|
||||||
|
## Fixed
|
||||||
|
- Now HTTPException from http.client and URLError from urllib.request
|
||||||
|
are handled;
|
||||||
|
- 2ch.hk's stickers handling.
|
||||||
|
|
||||||
|
## 0.4.0 - 2020-11-18
|
||||||
|
### Added
|
||||||
|
- For 2ch.hk check for if a file is a sticker was added;
|
||||||
|
- Encoding for `!op.txt` file was explicitly set to `utf-8`;
|
||||||
|
- Handling of connection errors was added so now program won't crash if file
|
||||||
|
doesn't exist or not accessible for any other reason and if any damaged files
|
||||||
|
was created then they will be removed;
|
||||||
|
- Added 3 retries if file was damaged during downloading;
|
||||||
|
- To a scraper was added matching of hashes of two files that happen to share
|
||||||
|
same name and size, but hash reported by an imageboard is not the same as of
|
||||||
|
a file. It results in excessive downloading and hash calculations. Hopefully,
|
||||||
|
that only the case for 2ch.hk.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- FileInfo class is now a frozen dataclass for memory efficiency.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Found that arguments for match function that matches for `image.ext` pattern
|
||||||
|
were mixed up in places all over the parsers;
|
||||||
|
- Also for 2ch.hk checking for if `sub` and `com` was changed to `subject` and
|
||||||
|
`comment`.
|
||||||
|
|
||||||
|
## 0.3.0 - 2020-09-09
|
||||||
|
### Added
|
||||||
|
- Parser for lolifox.cc.
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- BasicScraper. Not needed anymore, there is a faster threaded version.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Now User-Agent is correctly applied everywhere.
|
||||||
|
|
||||||
|
|
||||||
|
## 0.2.2 - 2020-07-20
|
||||||
|
### Added
|
||||||
|
- Parser for 8kun.top.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- The way of comparison if that site is supported to just looking for a
|
||||||
|
substring.
|
||||||
|
- Edited regex that checks if filename is just an "image.ext" so it only checks
|
||||||
|
if after "image." only goes 1 to 4 characters.
|
||||||
|
|
||||||
|
### Notes
|
||||||
|
- Consider that issue with size on 2ch.hk. Usually it really tells the size in
|
||||||
|
kB. The problem is that sometimes it just wrong.
|
||||||
|
|
||||||
|
|
||||||
|
## 0.2.1 - 2020-07-18
|
||||||
|
### Changed
|
||||||
|
- Now program tells you what thread doesn't exist or about to be scraped. That
|
||||||
|
is useful in batch processing with scripts.
|
||||||
|
|
||||||
|
|
||||||
## 0.2.0 - 2020-07-18
|
## 0.2.0 - 2020-07-18
|
||||||
### Added
|
### Added
|
||||||
- Threaded version of the scraper, so now it is fast as heck!
|
- Threaded version of the scraper, so now it is fast as heck!
|
||||||
|
|
2
Makefile
2
Makefile
|
@ -1,7 +1,7 @@
|
||||||
build: scrapthechan README.md setup.cfg
|
build: scrapthechan README.md setup.cfg
|
||||||
python setup.py sdist bdist_wheel
|
python setup.py sdist bdist_wheel
|
||||||
install:
|
install:
|
||||||
python -m pip install --upgrade dist/scrapthechan-0.2.0-py3-none-any.whl --user
|
python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
|
||||||
uninstall:
|
uninstall:
|
||||||
# We change directory so pip uninstall will run, it'll fail otherwise.
|
# We change directory so pip uninstall will run, it'll fail otherwise.
|
||||||
@cd ~/
|
@cd ~/
|
||||||
|
|
31
README.md
31
README.md
|
@ -1,8 +1,8 @@
|
||||||
This is a tool for scraping files from imageboards' threads.
|
This is a tool for scraping files from imageboards' threads.
|
||||||
|
|
||||||
It extracts the files from a JSON version of a thread. And then downloads 'em
|
It extracts the files from a JSON representation of a thread. And then downloads
|
||||||
in a specified output directory or if it isn't specified then creates following
|
'em in a specified output directory or if it isn't specified then creates
|
||||||
directory hierarchy in a working directory:
|
following directory hierarchy in a working directory:
|
||||||
|
|
||||||
<imageboard name>
|
<imageboard name>
|
||||||
|-<board name>
|
|-<board name>
|
||||||
|
@ -24,9 +24,24 @@ separately. E.g. `4chan b 1100500`.
|
||||||
|
|
||||||
`-o`, `--output-dir` -- output directory where all files will be dumped to.
|
`-o`, `--output-dir` -- output directory where all files will be dumped to.
|
||||||
|
|
||||||
`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag
|
`-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
|
||||||
disables this behaviour. I desided to put an `!` in a name so this file will be
|
flag disables this behaviour. An exclamation mark `!` in a name is for so this
|
||||||
on the top in a directory listing.
|
file will be on the top of a directory listing.
|
||||||
|
|
||||||
`-v`, `--version` prints the version of the program, and `-h`, `--help` prints
|
`-S <num>`, `--skip-posts <num>` -- skip given number of posts.
|
||||||
help for a program.
|
|
||||||
|
`-v`, `--version` prints the version of the program.
|
||||||
|
|
||||||
|
`-h`, `--help` prints help for a program.
|
||||||
|
|
||||||
|
# Supported imageboards
|
||||||
|
|
||||||
|
- [4chan.org](https://4chan.org) since 0.1.0
|
||||||
|
- [lainchan.org](https://lainchan.org) since 0.1.0
|
||||||
|
- [2ch.hk](https://2ch.hk) since 0.1.0
|
||||||
|
- [8kun.top](https://8kun.top) since 0.2.2
|
||||||
|
|
||||||
|
# TODO
|
||||||
|
|
||||||
|
- Sane rewrite of a program;
|
||||||
|
- Thread watcher.
|
||||||
|
|
|
@ -1,13 +1,16 @@
|
||||||
__date__ = "18 Jule 2020"
|
__date__ = "4 May 2021"
|
||||||
__version__ = "0.2.0"
|
__version__ = "0.5.1"
|
||||||
__author__ = "Alexander \"Arav\" Andreev"
|
__author__ = "Alexander \"Arav\" Andreev"
|
||||||
__email__ = "me@arav.top"
|
__email__ = "me@arav.top"
|
||||||
__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>"
|
__copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
|
||||||
__license__ = \
|
__license__ = \
|
||||||
"""This program is licensed under the terms of the MIT license.
|
"""This program is licensed under the terms of the MIT license.
|
||||||
For a copy see COPYING file in a directory of the program, or
|
For a copy see COPYING file in a directory of the program, or
|
||||||
see <https://opensource.org/licenses/MIT>"""
|
see <https://opensource.org/licenses/MIT>"""
|
||||||
|
|
||||||
|
|
||||||
|
USER_AGENT = f"ScrapTheChan/{__version__}"
|
||||||
|
|
||||||
VERSION = \
|
VERSION = \
|
||||||
f"ScrapTheChan ver. {__version__} ({__date__})\n\n{__copyright__}\n"\
|
f"ScrapTheChan ver. {__version__} ({__date__})\n{__copyright__}\n"\
|
||||||
f"\n{__license__}"
|
f"\n{__license__}"
|
||||||
|
|
|
@ -3,21 +3,20 @@ from os import makedirs
|
||||||
from os.path import join, exists
|
from os.path import join, exists
|
||||||
from re import search
|
from re import search
|
||||||
from sys import argv
|
from sys import argv
|
||||||
from typing import List
|
from typing import List, Optional
|
||||||
|
|
||||||
from scrapthechan import VERSION
|
from scrapthechan import VERSION
|
||||||
from scrapthechan.parser import Parser, ParserThreadNotFoundError
|
from scrapthechan.parser import Parser, ThreadNotFoundError
|
||||||
from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \
|
from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \
|
||||||
SUPPORTED_IMAGEBOARDS
|
SUPPORTED_IMAGEBOARDS
|
||||||
#from scrapthechan.scrapers.basicscraper import BasicScraper
|
|
||||||
from scrapthechan.scrapers.threadedscraper import ThreadedScraper
|
from scrapthechan.scrapers.threadedscraper import ThreadedScraper
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["main"]
|
__all__ = ["main"]
|
||||||
|
|
||||||
|
|
||||||
USAGE = \
|
USAGE: str = \
|
||||||
"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
|
f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
|
||||||
|
|
||||||
Options:
|
Options:
|
||||||
\t-h,--help -- print this help and exit;
|
\t-h,--help -- print this help and exit;
|
||||||
|
@ -27,6 +26,7 @@ Options:
|
||||||
\t <imageboard>/<board>/<thread>;
|
\t <imageboard>/<board>/<thread>;
|
||||||
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
|
\t-N,--no-op -- by default OP's post will be written in !op.txt file. This
|
||||||
\t option disables this behaviour;
|
\t option disables this behaviour;
|
||||||
|
\t-S,--skip-posts <num> -- skip given number of posts.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
\tURL -- URL of a thread;
|
\tURL -- URL of a thread;
|
||||||
|
@ -34,18 +34,18 @@ Arguments:
|
||||||
\tBOARD -- short name of a board. E.g. b;
|
\tBOARD -- short name of a board. E.g. b;
|
||||||
\tTHREAD -- ID of a thread. E.g. 100500.
|
\tTHREAD -- ID of a thread. E.g. 100500.
|
||||||
|
|
||||||
Supported imageboards: 4chan.org, 2ch.hk, lainchan.org.
|
Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
def parse_common_arguments(args: str) -> dict:
|
def parse_common_arguments(args: str) -> Optional[dict]:
|
||||||
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
|
r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
|
||||||
argd = search(r, args)
|
args = search(r, args)
|
||||||
if not argd is None:
|
if not args is None:
|
||||||
argd = argd.groupdict()
|
args = args.groupdict()
|
||||||
return {
|
return {
|
||||||
"help": not argd["help"] is None,
|
"help": not args["help"] is None,
|
||||||
"version": not argd["version"] is None }
|
"version": not args["version"] is None }
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def parse_arguments(args: str) -> dict:
|
def parse_arguments(args: str) -> dict:
|
||||||
|
@ -54,15 +54,21 @@ def parse_arguments(args: str) -> dict:
|
||||||
if not link is None:
|
if not link is None:
|
||||||
link = link.groupdict()
|
link = link.groupdict()
|
||||||
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
|
out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
|
||||||
|
skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
|
||||||
return {
|
return {
|
||||||
"site": None if link is None else link["site"],
|
"site": None if link is None else link["site"],
|
||||||
"board": None if link is None else link["board"],
|
"board": None if link is None else link["board"],
|
||||||
"thread": None if link is None else link["thread"],
|
"thread": None if link is None else link["thread"],
|
||||||
|
"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
|
||||||
"no-op": not search(r"-N|--no-op", args) is None,
|
"no-op": not search(r"-N|--no-op", args) is None,
|
||||||
"output-dir": None if out_dir is None \
|
"output-dir": None if out_dir is None \
|
||||||
else out_dir.groupdict()["outdir"] }
|
else out_dir.groupdict()["outdir"] }
|
||||||
|
|
||||||
def main() -> None:
|
def main() -> None:
|
||||||
|
if len(argv) == 1:
|
||||||
|
print(USAGE)
|
||||||
|
exit()
|
||||||
|
|
||||||
cargs = parse_common_arguments(' '.join(argv[1:]))
|
cargs = parse_common_arguments(' '.join(argv[1:]))
|
||||||
if not cargs is None:
|
if not cargs is None:
|
||||||
if cargs["help"]:
|
if cargs["help"]:
|
||||||
|
@ -79,19 +85,22 @@ def main() -> None:
|
||||||
exit()
|
exit()
|
||||||
|
|
||||||
try:
|
try:
|
||||||
parser = get_parser_by_site(args["site"], args["board"], args["thread"])
|
if not args["skip-posts"] is None:
|
||||||
|
parser = get_parser_by_site(args["site"], args["board"],
|
||||||
|
args["thread"], args["skip-posts"])
|
||||||
|
else:
|
||||||
|
parser = get_parser_by_site(args["site"], args["board"],
|
||||||
|
args["thread"])
|
||||||
except NotImplementedError as ex:
|
except NotImplementedError as ex:
|
||||||
print(f"{str(ex)}.")
|
print(f"{str(ex)}.")
|
||||||
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
|
print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
|
||||||
exit()
|
exit()
|
||||||
except ParserThreadNotFoundError:
|
except ThreadNotFoundError as e:
|
||||||
print(f"Thread is no longer exist.")
|
print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
|
||||||
|
f"not found. Reason: {e.reason}")
|
||||||
exit()
|
exit()
|
||||||
|
|
||||||
flen = len(parser.files)
|
files_count = len(parser.files)
|
||||||
|
|
||||||
|
|
||||||
print(f"There are {flen} files in this thread.")
|
|
||||||
|
|
||||||
if not args["output-dir"] is None:
|
if not args["output-dir"] is None:
|
||||||
save_dir = args["output-dir"]
|
save_dir = args["output-dir"]
|
||||||
|
@ -99,25 +108,26 @@ def main() -> None:
|
||||||
save_dir = join(parser.imageboard, parser.board,
|
save_dir = join(parser.imageboard, parser.board,
|
||||||
parser.thread)
|
parser.thread)
|
||||||
|
|
||||||
print(f"They will be saved in {save_dir}.")
|
print(f"{files_count} files in " \
|
||||||
|
f"{args['site']}/{args['board']}/{args['thread']}. " \
|
||||||
|
f"They're going to {save_dir}. ", end="")
|
||||||
|
|
||||||
makedirs(save_dir, exist_ok=True)
|
makedirs(save_dir, exist_ok=True)
|
||||||
|
|
||||||
|
|
||||||
if not args["no-op"]:
|
if not args["no-op"]:
|
||||||
print("Writing OP... ", end='')
|
|
||||||
if parser.op is None:
|
if parser.op is None:
|
||||||
print("No text's there.")
|
print("OP's empty.")
|
||||||
elif not exists(join(save_dir, "!op.txt")):
|
elif not exists(join(save_dir, "!op.txt")):
|
||||||
with open(join(save_dir, "!op.txt"), 'w') as opf:
|
with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
|
||||||
opf.write(f"{parser.op}\n")
|
opf.write(f"{parser.op}\n")
|
||||||
print("Done.")
|
print("OP's written.")
|
||||||
else:
|
else:
|
||||||
print("Exists.")
|
print("OP exists.")
|
||||||
|
|
||||||
|
|
||||||
scraper = ThreadedScraper(save_dir, parser.files, \
|
scraper = ThreadedScraper(save_dir, parser.files, \
|
||||||
lambda i: print(f"{i}/{flen}", end="\r"))
|
lambda i: print(f"{i}/{files_count}", end="\r"))
|
||||||
scraper.run()
|
scraper.run()
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,23 +1,23 @@
|
||||||
"""FileInfo object stores all needed information about a file."""
|
"""FileInfo object stores information about a file."""
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
__all__ = ["FileInfo"]
|
__all__ = ["FileInfo"]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True, order=True)
|
||||||
class FileInfo:
|
class FileInfo:
|
||||||
"""Stores all needed information about a file.
|
"""Stores information about a file.
|
||||||
|
|
||||||
Arguments:
|
Fields:
|
||||||
- `name` -- name of a file;
|
- `name` -- name of a file;
|
||||||
- `size` -- size of a file;
|
- `size` -- size of a file;
|
||||||
- `dlurl` -- full download URL for a file;
|
- `download_url` -- full download URL for a file;
|
||||||
- `hash_value` -- hash sum of a file;
|
- `hash_value` -- hash sum of a file;
|
||||||
- `hash_algo` -- hash algorithm used (e.g. md5).
|
- `hash_algorithm` -- hash algorithm used (e.g. md5).
|
||||||
"""
|
"""
|
||||||
def __init__(self, name: str, size: int, dlurl: str,
|
name: str
|
||||||
hash_value: str, hash_algo: str) -> None:
|
size: int
|
||||||
self.name = name
|
download_url: str
|
||||||
self.size = size
|
hash_value: str
|
||||||
self.dlurl = dlurl
|
hash_algorithm: str
|
||||||
self.hash_value = hash_value
|
|
||||||
self.hash_algo = hash_algo
|
|
||||||
|
|
|
@ -4,16 +4,22 @@ from itertools import chain
|
||||||
from json import loads
|
from json import loads
|
||||||
from re import findall, match
|
from re import findall, match
|
||||||
from typing import List, Optional
|
from typing import List, Optional
|
||||||
from urllib.request import urlopen, urlretrieve
|
from urllib.request import urlopen, Request, HTTPError
|
||||||
|
|
||||||
|
from scrapthechan import USER_AGENT
|
||||||
from scrapthechan.fileinfo import FileInfo
|
from scrapthechan.fileinfo import FileInfo
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Parser", "ParserThreadNotFoundError"]
|
__all__ = ["Parser", "ThreadNotFoundError"]
|
||||||
|
|
||||||
|
|
||||||
class ParserThreadNotFoundError(Exception):
|
class ThreadNotFoundError(Exception):
|
||||||
pass
|
def __init__(self, reason: str = ""):
|
||||||
|
self._reason = reason
|
||||||
|
|
||||||
|
@property
|
||||||
|
def reason(self) -> str:
|
||||||
|
return self._reason
|
||||||
|
|
||||||
|
|
||||||
class Parser:
|
class Parser:
|
||||||
|
@ -24,28 +30,42 @@ class Parser:
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
board -- is a name of a board on an image board;
|
board -- is a name of a board on an image board;
|
||||||
thread -- is a name of a thread inside a board;
|
thread -- is an id of a thread inside a board;
|
||||||
posts -- is a list of posts in form of dictionaries exported from a JSON;
|
|
||||||
skip_posts -- number of posts to skip.
|
skip_posts -- number of posts to skip.
|
||||||
|
|
||||||
All the extracted files will be stored as the `FileInfo` objects."""
|
All the extracted files will be stored as the `FileInfo` objects."""
|
||||||
__url_thread_json: str = "https://example.org/{board}/{thread}.json"
|
|
||||||
__url_file_link: str = None
|
|
||||||
|
|
||||||
def __init__(self, board: str, thread: str, posts: List[dict],
|
def __init__(self, board: str, thread: str,
|
||||||
skip_posts: Optional[int] = None) -> None:
|
skip_posts: Optional[int] = None) -> None:
|
||||||
self._board = board
|
|
||||||
self._thread = thread
|
self._board: str = board
|
||||||
self._op_post = posts[0]
|
self._thread: str = thread
|
||||||
if not skip_posts is None:
|
self._posts = self._extract_posts_list(self._get_json())
|
||||||
posts = posts[skip_posts:]
|
self._op_post: dict = self._posts[0]
|
||||||
|
self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
|
||||||
self._files = list(chain.from_iterable(filter(None, \
|
self._files = list(chain.from_iterable(filter(None, \
|
||||||
map(self._parse_post, posts))))
|
map(self._parse_post, self._posts))))
|
||||||
|
|
||||||
|
@property
|
||||||
|
def json_thread_url(self) -> str:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
@property
|
||||||
|
def file_base_url(self) -> str:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
@property
|
||||||
|
def subject_field(self) -> str:
|
||||||
|
return "sub"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def comment_field(self) -> str:
|
||||||
|
return "com"
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def imageboard(self) -> str:
|
def imageboard(self) -> str:
|
||||||
"""Returns image board's name."""
|
"""Returns image board's name."""
|
||||||
return NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def board(self) -> str:
|
def board(self) -> str:
|
||||||
|
@ -61,21 +81,40 @@ class Parser:
|
||||||
def op(self) -> str:
|
def op(self) -> str:
|
||||||
"""Returns OP's post as combination of subject and comment separated
|
"""Returns OP's post as combination of subject and comment separated
|
||||||
by a new line."""
|
by a new line."""
|
||||||
raise NotImplementedError
|
op = ""
|
||||||
|
if self.subject_field in self._op_post:
|
||||||
|
op = f"{self._op_post[self.subject_field]}\n"
|
||||||
|
if self.comment_field in self._op_post:
|
||||||
|
op += self._op_post[self.comment_field]
|
||||||
|
return op if not op == "" else None
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def files(self) -> List[FileInfo]:
|
def files(self) -> List[FileInfo]:
|
||||||
"""Returns a list of retrieved files as `FileInfo` objects."""
|
"""Returns a list of retrieved files as `FileInfo` objects."""
|
||||||
return self._files
|
return self._files
|
||||||
|
|
||||||
def _get_json(self, thread_url: str) -> dict:
|
def _extract_posts_list(self, lst: List) -> List[dict]:
|
||||||
"""Gets JSON version of a thread and converts it in a dictionary."""
|
"""This method must be overridden in child classes where you specify
|
||||||
try:
|
a path in a JSON document where posts are stored. E.g., on 4chan this is
|
||||||
with urlopen(thread_url) as url:
|
['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
|
||||||
return loads(url.read().decode('utf-8'))
|
return lst
|
||||||
except:
|
|
||||||
raise ParserThreadNotFoundError
|
|
||||||
|
|
||||||
def _parse_post(self, post: dict) -> List[FileInfo]:
|
def _get_json(self) -> dict:
|
||||||
"""Parses a single post and extracts files into `FileInfo` object."""
|
"""Retrieves a JSON representation of a thread and converts it in
|
||||||
|
a dictionary."""
|
||||||
|
try:
|
||||||
|
thread_url = self.json_thread_url.format(board=self._board, \
|
||||||
|
thread=self._thread)
|
||||||
|
req = Request(thread_url, headers={'User-Agent': USER_AGENT})
|
||||||
|
with urlopen(req) as url:
|
||||||
|
return loads(url.read().decode('utf-8'))
|
||||||
|
except HTTPError as e:
|
||||||
|
raise ThreadNotFoundError(str(e))
|
||||||
|
except Exception as e:
|
||||||
|
raise e
|
||||||
|
|
||||||
|
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
|
||||||
|
"""Parses a single post and extracts files into `FileInfo` object.
|
||||||
|
Single object is wrapped in a list for convenient insertion into
|
||||||
|
a list."""
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
"""Here are defined the JSON parsers for imageboards."""
|
"""Here are defined the JSON parsers for imageboards."""
|
||||||
from re import search
|
from re import search
|
||||||
from typing import List
|
from typing import List, Optional
|
||||||
|
|
||||||
from scrapthechan.parser import Parser
|
from scrapthechan.parser import Parser
|
||||||
|
|
||||||
|
@ -8,27 +8,31 @@ from scrapthechan.parser import Parser
|
||||||
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
|
__all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
|
||||||
|
|
||||||
|
|
||||||
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk"]
|
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
|
||||||
|
SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
|
||||||
|
"8kun.top"]
|
||||||
|
|
||||||
|
|
||||||
def get_parser_by_url(url: str) -> Parser:
|
def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
|
||||||
"""Parses URL and extracts from it site name, board and thread.
|
"""Parses URL and extracts from it site name, board and thread.
|
||||||
And then returns initialised Parser object for detected imageboard."""
|
And then returns initialised Parser object for detected imageboard."""
|
||||||
URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
|
|
||||||
site, board, thread = search(URLRX, url).groups()
|
site, board, thread = search(URLRX, url).groups()
|
||||||
return get_parser_by_site(site, board, thread)
|
return get_parser_by_site(site, board, thread, skip_posts)
|
||||||
|
|
||||||
def get_parser_by_site(site: str, board: str, thread: str) -> Parser:
|
def get_parser_by_site(site: str, board: str, thread: str,
|
||||||
|
skip_posts: Optional[int] = None) -> Parser:
|
||||||
"""Returns an initialised parser for `site` with `board` and `thread`."""
|
"""Returns an initialised parser for `site` with `board` and `thread`."""
|
||||||
if site in ['boards.4chan.org', 'boards.4channel.org',
|
if '4chan' in site:
|
||||||
'4chan', '4chan.org']:
|
|
||||||
from .fourchan import FourChanParser
|
from .fourchan import FourChanParser
|
||||||
return FourChanParser(board, thread)
|
return FourChanParser(board, thread, skip_posts)
|
||||||
elif site in ['lainchan.org', 'lainchan']:
|
elif 'lainchan' in site:
|
||||||
from .lainchan import LainchanParser
|
from .lainchan import LainchanParser
|
||||||
return LainchanParser(board, thread)
|
return LainchanParser(board, thread, skip_posts)
|
||||||
elif site in ['2ch.hk', '2ch']:
|
elif '2ch' in site:
|
||||||
from .dvach import DvachParser
|
from .dvach import DvachParser
|
||||||
return DvachParser(board, thread)
|
return DvachParser(board, thread, skip_posts)
|
||||||
|
elif '8kun' in site:
|
||||||
|
from .eightkun import EightKunParser
|
||||||
|
return EightKunParser(board, thread, skip_posts)
|
||||||
else:
|
else:
|
||||||
raise NotImplementedError(f"Parser for {site} is not implemented")
|
raise NotImplementedError(f"Parser for {site} is not implemented")
|
||||||
|
|
|
@ -10,39 +10,54 @@ __all__ = ["DvachParser"]
|
||||||
class DvachParser(Parser):
|
class DvachParser(Parser):
|
||||||
"""JSON parser for 2ch.hk image board."""
|
"""JSON parser for 2ch.hk image board."""
|
||||||
|
|
||||||
__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
|
|
||||||
__url_file_link = "https://2ch.hk"
|
|
||||||
|
|
||||||
def __init__(self, board: str, thread: str,
|
def __init__(self, board: str, thread: str,
|
||||||
skip_posts: Optional[int] = None) -> None:
|
skip_posts: Optional[int] = None) -> None:
|
||||||
posts = self._get_json(self.__url_thread_json.format(board=board, \
|
super().__init__(board, thread, skip_posts)
|
||||||
thread=thread))['threads'][0]['posts']
|
|
||||||
super(DvachParser, self).__init__(board, thread, posts, skip_posts)
|
@property
|
||||||
|
def json_thread_url(self) -> str:
|
||||||
|
return "https://2ch.hk/{board}/res/{thread}.json"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def file_base_url(self) -> str:
|
||||||
|
return "https://2ch.hk"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def subject_field(self) -> str:
|
||||||
|
return "subject"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def comment_field(self) -> str:
|
||||||
|
return "comment"
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def imageboard(self) -> str:
|
def imageboard(self) -> str:
|
||||||
return "2ch.hk"
|
return "2ch.hk"
|
||||||
|
|
||||||
@property
|
def _extract_posts_list(self, lst: List) -> List[dict]:
|
||||||
def op(self) -> Optional[str]:
|
return lst['threads'][0]['posts']
|
||||||
op = ""
|
|
||||||
if 'sub' in self._op_post:
|
|
||||||
op = f"{self._op_post['subject']}\n"
|
|
||||||
if 'com' in self._op_post:
|
|
||||||
op += self._op_post['comment']
|
|
||||||
return op if not op == "" else None
|
|
||||||
|
|
||||||
def _parse_post(self, post) -> Optional[List[FileInfo]]:
|
def _parse_post(self, post) -> Optional[List[FileInfo]]:
|
||||||
if not 'files' in post: return None
|
if not 'files' in post: return None
|
||||||
|
|
||||||
files = []
|
files = []
|
||||||
|
|
||||||
for f in post['files']:
|
for f in post['files']:
|
||||||
if match(f['fullname'], r"^image\.\w+$") is None:
|
if not 'sticker' in f:
|
||||||
|
if match(r"^image\.\w+$", f['fullname']) is None:
|
||||||
fullname = f['fullname']
|
fullname = f['fullname']
|
||||||
else:
|
else:
|
||||||
fullname = f['name']
|
fullname = f['name']
|
||||||
|
else:
|
||||||
|
fullname = f['name']
|
||||||
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
|
# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
|
||||||
# completely fine to hardcode `hash_algo`.
|
# completely fine to hardcode `hash_algo`.
|
||||||
|
if 'md5' in f:
|
||||||
files.append(FileInfo(fullname, f['size'],
|
files.append(FileInfo(fullname, f['size'],
|
||||||
f"{self.__url_file_link}{f['path']}",
|
f"{self.file_base_url}{f['path']}",
|
||||||
f['md5'], 'md5'))
|
f['md5'], 'md5'))
|
||||||
|
else:
|
||||||
|
files.append(FileInfo(fullname, f['size'],
|
||||||
|
f"{self.file_base_url}{f['path']}",
|
||||||
|
None, None))
|
||||||
return files
|
return files
|
||||||
|
|
|
@ -0,0 +1,25 @@
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
|
||||||
|
|
||||||
|
__all__ = ["EightKunParser"]
|
||||||
|
|
||||||
|
|
||||||
|
class EightKunParser(TinyboardLikeParser):
|
||||||
|
"""JSON parser for 8kun.top image board."""
|
||||||
|
|
||||||
|
def __init__(self, board: str, thread: str,
|
||||||
|
skip_posts: Optional[int] = None) -> None:
|
||||||
|
super().__init__(board, thread, skip_posts)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def imageboard(self) -> str:
|
||||||
|
return "8kun.top"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def json_thread_url(self) -> str:
|
||||||
|
return "https://8kun.top/{board}/res/{thread}.json"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def file_base_url(self) -> str:
|
||||||
|
return "https://media.8kun.top/file_dl/{filename}"
|
|
@ -1,51 +1,25 @@
|
||||||
from re import match
|
from typing import Optional
|
||||||
from typing import List, Optional
|
|
||||||
|
|
||||||
from scrapthechan.fileinfo import FileInfo
|
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
|
||||||
from scrapthechan.parser import Parser
|
|
||||||
|
|
||||||
__all__ = ["FourChanParser"]
|
__all__ = ["FourChanParser"]
|
||||||
|
|
||||||
|
|
||||||
class FourChanParser(Parser):
|
class FourChanParser(TinyboardLikeParser):
|
||||||
"""JSON parser for 4chan.org image board."""
|
"""JSON parser for 4chan.org image board."""
|
||||||
|
|
||||||
__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
|
|
||||||
__url_file_link = "https://i.4cdn.org/{board}/{filename}"
|
|
||||||
|
|
||||||
def __init__(self, board: str, thread: str,
|
def __init__(self, board: str, thread: str,
|
||||||
skip_posts: Optional[int] = None) -> None:
|
skip_posts: Optional[int] = None) -> None:
|
||||||
posts = self._get_json(self.__url_thread_json.format(board=board, \
|
super().__init__(board, thread, skip_posts)
|
||||||
thread=thread))['posts']
|
|
||||||
super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def imageboard(self) -> str:
|
def imageboard(self) -> str:
|
||||||
return "4chan.org"
|
return "4chan.org"
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def op(self) -> Optional[str]:
|
def json_thread_url(self) -> str:
|
||||||
op = ""
|
return "https://a.4cdn.org/{board}/thread/{thread}.json"
|
||||||
if 'sub' in self._op_post:
|
|
||||||
op = f"{self._op_post['sub']}\n"
|
|
||||||
if 'com' in self._op_post:
|
|
||||||
op += self._op_post['com']
|
|
||||||
return op if not op == "" else None
|
|
||||||
|
|
||||||
def _parse_post(self, post: dict) -> List[FileInfo]:
|
@property
|
||||||
if not 'tim' in post: return None
|
def file_base_url(self) -> str:
|
||||||
|
return "https://i.4cdn.org/{board}/{filename}"
|
||||||
dlfname = f"{post['tim']}{post['ext']}"
|
|
||||||
|
|
||||||
if "filename" in post:
|
|
||||||
if match(post['filename'], r"^image\.\w+$") is None:
|
|
||||||
filename = dlfname
|
|
||||||
else:
|
|
||||||
filename = f"{post['filename']}{post['ext']}"
|
|
||||||
|
|
||||||
# Hash algorithm is hardcoded since it is highly unlikely that it will
|
|
||||||
# be changed in foreseeable future. And if it'll change then this line
|
|
||||||
# will be necessarily updated anyway.
|
|
||||||
return [FileInfo(filename, post['fsize'],
|
|
||||||
self.__url_file_link.format(board=self.board, filename=dlfname),
|
|
||||||
post['md5'], 'md5')]
|
|
||||||
|
|
|
@ -1,66 +1,25 @@
|
||||||
from re import match
|
from typing import Optional
|
||||||
from typing import List, Optional
|
|
||||||
|
|
||||||
from scrapthechan.parser import Parser
|
from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
|
||||||
from scrapthechan.fileinfo import FileInfo
|
|
||||||
|
|
||||||
__all__ = ["LainchanParser"]
|
__all__ = ["LainchanParser"]
|
||||||
|
|
||||||
|
|
||||||
class LainchanParser(Parser):
|
class LainchanParser(TinyboardLikeParser):
|
||||||
"""JSON parser for lainchan.org image board.
|
"""JSON parser for lainchan.org image board."""
|
||||||
JSON structure is identical to 4chan.org's, so this parser is just inherited
|
|
||||||
from 4chan.org's parser and only needed things are redefined.
|
|
||||||
"""
|
|
||||||
|
|
||||||
__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
|
|
||||||
__url_file_link = "https://lainchan.org/{board}/src/{filename}"
|
|
||||||
|
|
||||||
def __init__(self, board: str, thread: str,
|
def __init__(self, board: str, thread: str,
|
||||||
skip_posts: Optional[int] = None) -> None:
|
skip_posts: Optional[int] = None) -> None:
|
||||||
posts = self._get_json(self.__url_thread_json.format(board=board, \
|
super().__init__(board, thread, skip_posts)
|
||||||
thread=thread))['posts']
|
|
||||||
super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def imageboard(self) -> str:
|
def imageboard(self) -> str:
|
||||||
return "lainchan.org"
|
return "lainchan.org"
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def op(self) -> Optional[str]:
|
def json_thread_url(self) -> str:
|
||||||
op = ""
|
return "https://lainchan.org/{board}/res/{thread}.json"
|
||||||
if 'sub' in self._op_post:
|
|
||||||
op = f"{self._op_post['sub']}\n"
|
|
||||||
if 'com' in self._op_post:
|
|
||||||
op += self._op_post['com']
|
|
||||||
return op if not op == "" else None
|
|
||||||
|
|
||||||
def _parse_post(self, post) -> List[FileInfo]:
|
@property
|
||||||
if not 'tim' in post: return None
|
def file_base_url(self) -> str:
|
||||||
|
return "https://lainchan.org/{board}/src/{filename}"
|
||||||
dlfname = f"{post['tim']}{post['ext']}"
|
|
||||||
|
|
||||||
if "filename" in post:
|
|
||||||
if match(post['filename'], r"^image\.\w+$") is None:
|
|
||||||
filename = dlfname
|
|
||||||
else:
|
|
||||||
filename = f"{post['filename']}{post['ext']}"
|
|
||||||
|
|
||||||
files = []
|
|
||||||
files.append(FileInfo(filename, post['fsize'],
|
|
||||||
self.__url_file_link.format(board=self.board, filename=dlfname),
|
|
||||||
post['md5'], 'md5'))
|
|
||||||
|
|
||||||
if "extra_files" in post:
|
|
||||||
for f in post["extra_files"]:
|
|
||||||
dlfname = f"{f['tim']}{f['ext']}"
|
|
||||||
if "filename" in post:
|
|
||||||
if match(post['filename'], r"^image\.\w+$") is None:
|
|
||||||
filename = dlfname
|
|
||||||
else:
|
|
||||||
filename = f"{post['filename']}{post['ext']}"
|
|
||||||
dlurl = self.__url_file_link.format(board=self.board, \
|
|
||||||
filename=dlfname)
|
|
||||||
files.append(FileInfo(filename, f['fsize'], \
|
|
||||||
dlurl, f['md5'], 'md5'))
|
|
||||||
return files
|
|
||||||
|
|
|
@ -0,0 +1,51 @@
|
||||||
|
from re import match
|
||||||
|
from typing import List, Optional
|
||||||
|
|
||||||
|
from scrapthechan.parser import Parser
|
||||||
|
from scrapthechan.fileinfo import FileInfo
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["TinyboardLikeParser"]
|
||||||
|
|
||||||
|
|
||||||
|
class TinyboardLikeParser(Parser):
|
||||||
|
"""Base parser for imageboards that are based on Tinyboard, or have similar
|
||||||
|
JSON API."""
|
||||||
|
def __init__(self, board: str, thread: str,
|
||||||
|
skip_posts: Optional[int] = None) -> None:
|
||||||
|
super().__init__(board, thread, skip_posts)
|
||||||
|
|
||||||
|
def _extract_posts_list(self, lst: List) -> List[dict]:
|
||||||
|
return lst['posts']
|
||||||
|
|
||||||
|
def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
|
||||||
|
if not 'tim' in post: return None
|
||||||
|
|
||||||
|
dlfname = f"{post['tim']}{post['ext']}"
|
||||||
|
|
||||||
|
if "filename" in post:
|
||||||
|
if match(r"^image\.\w+$", post['filename']) is None:
|
||||||
|
filename = dlfname
|
||||||
|
else:
|
||||||
|
filename = f"{post['filename']}{post['ext']}"
|
||||||
|
|
||||||
|
files = []
|
||||||
|
|
||||||
|
files.append(FileInfo(filename, post['fsize'],
|
||||||
|
self.file_base_url.format(board=self.board, filename=dlfname),
|
||||||
|
post['md5'], 'md5'))
|
||||||
|
|
||||||
|
if "extra_files" in post:
|
||||||
|
for f in post["extra_files"]:
|
||||||
|
dlfname = f"{f['tim']}{f['ext']}"
|
||||||
|
if "filename" in post:
|
||||||
|
if match(r"^image\.\w+$", post['filename']) is None:
|
||||||
|
filename = dlfname
|
||||||
|
else:
|
||||||
|
filename = f"{post['filename']}{post['ext']}"
|
||||||
|
dlurl = self.file_base_url.format(board=self.board, \
|
||||||
|
filename=dlfname)
|
||||||
|
files.append(FileInfo(filename, f['fsize'], \
|
||||||
|
dlurl, f['md5'], 'md5'))
|
||||||
|
|
||||||
|
return files
|
|
@ -1,21 +1,22 @@
|
||||||
"""Base Scraper implementation."""
|
"""Base class for all scrapers that will actually do the job."""
|
||||||
|
|
||||||
from base64 import b64encode
|
from base64 import b64encode
|
||||||
from os import remove, stat
|
from os import remove, stat
|
||||||
from os.path import exists, join, getsize
|
from os.path import exists, join, getsize
|
||||||
import re
|
import re
|
||||||
from typing import List, Callable
|
from typing import List, Callable
|
||||||
from urllib.request import urlretrieve, URLopener
|
from urllib.request import urlretrieve, URLopener, HTTPError, URLError
|
||||||
import hashlib
|
import hashlib
|
||||||
|
from http.client import HTTPException
|
||||||
|
|
||||||
from scrapthechan import __version__
|
from scrapthechan import USER_AGENT
|
||||||
from scrapthechan.fileinfo import FileInfo
|
from scrapthechan.fileinfo import FileInfo
|
||||||
|
|
||||||
__all__ = ["Scraper"]
|
__all__ = ["Scraper"]
|
||||||
|
|
||||||
|
|
||||||
class Scraper:
|
class Scraper:
|
||||||
"""Base scraper implementation.
|
"""Base class for all scrapers that will actually do the job.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
save_directory -- a path to a directory where file will be
|
save_directory -- a path to a directory where file will be
|
||||||
|
@ -29,7 +30,8 @@ class Scraper:
|
||||||
self._save_directory = save_directory
|
self._save_directory = save_directory
|
||||||
self._files = files
|
self._files = files
|
||||||
self._url_opener = URLopener()
|
self._url_opener = URLopener()
|
||||||
self._url_opener.version = f"ScrapTheChan/{__version__}"
|
self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
|
||||||
|
self._url_opener.version = USER_AGENT
|
||||||
self._progress_callback = download_progress_callback
|
self._progress_callback = download_progress_callback
|
||||||
|
|
||||||
def run(self):
|
def run(self):
|
||||||
|
@ -62,35 +64,83 @@ class Scraper:
|
||||||
newname = f"{newname[:lbracket]}({int(num)+1})"
|
newname = f"{newname[:lbracket]}({int(num)+1})"
|
||||||
return newname
|
return newname
|
||||||
|
|
||||||
def _hash_file(self, filename: str, hash_algo: str = "md5",
|
def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
|
||||||
blocksize: int = 1048576) -> (str, str):
|
blocksize: int = 1048576) -> (str, str):
|
||||||
"""Compute hash of a file."""
|
"""Compute hash of a file."""
|
||||||
hash_func = hashlib.new(hash_algo)
|
if hash_algorithm is None:
|
||||||
with open(filename, 'rb') as f:
|
return None
|
||||||
|
hash_func = hashlib.new(hash_algorithm)
|
||||||
|
with open(filepath, 'rb') as f:
|
||||||
buf = f.read(blocksize)
|
buf = f.read(blocksize)
|
||||||
while len(buf) > 0:
|
while len(buf) > 0:
|
||||||
hash_func.update(buf)
|
hash_func.update(buf)
|
||||||
buf = f.read(blocksize)
|
buf = f.read(blocksize)
|
||||||
return hash_func.hexdigest(), hash_func.digest()
|
return hash_func.hexdigest(), b64encode(hash_func.digest()).decode()
|
||||||
|
|
||||||
def _is_file_ok(self, f: FileInfo, filepath: str) -> bool:
|
def _check_file(self, f: FileInfo, filepath: str) -> bool:
|
||||||
"""Check if a file exist and isn't broken."""
|
"""Check if a file exist and isn't broken."""
|
||||||
if not exists(filepath):
|
if not exists(filepath):
|
||||||
return False
|
return False
|
||||||
computed_size = getsize(filepath)
|
computed_size = getsize(filepath)
|
||||||
is_size_match = f.size == computed_size \
|
if not (f.size == computed_size \
|
||||||
or f.size == round(computed_size / 1024)
|
or f.size == round(computed_size / 1024)):
|
||||||
hexdig, dig = self._hash_file(filepath, f.hash_algo)
|
return False
|
||||||
is_hash_match = f.hash_value == hexdig \
|
if not f.hash_algorithm is None:
|
||||||
or f.hash_value == b64encode(dig).decode()
|
hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
|
||||||
return is_size_match and is_hash_match
|
return f.hash_value == hexdig or f.hash_value == dig
|
||||||
|
return True
|
||||||
|
|
||||||
def _download_file(self, f: FileInfo):
|
def _download_file(self, f: FileInfo):
|
||||||
"""Download a single file."""
|
"""Download a single file."""
|
||||||
|
is_same_filename = False
|
||||||
filepath = join(self._save_directory, f.name)
|
filepath = join(self._save_directory, f.name)
|
||||||
if self._is_file_ok(f, filepath):
|
orig_filepath = filepath
|
||||||
return True
|
if self._check_file(f, filepath):
|
||||||
|
return
|
||||||
elif exists(filepath):
|
elif exists(filepath):
|
||||||
|
is_same_filename = True
|
||||||
filepath = join(self._save_directory, \
|
filepath = join(self._save_directory, \
|
||||||
self._same_filename(f.name, self._save_directory))
|
self._same_filename(f.name, self._save_directory))
|
||||||
self._url_opener.retrieve(f.dlurl, filepath)
|
try:
|
||||||
|
retries = 3
|
||||||
|
while retries > 0:
|
||||||
|
self._url_opener.retrieve(f.download_url, filepath)
|
||||||
|
if not self._check_file(f, filepath):
|
||||||
|
remove(filepath)
|
||||||
|
retries -= 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
if retries == 0:
|
||||||
|
print(f"Cannot retrieve {f.download_url}, {filepath}.")
|
||||||
|
return
|
||||||
|
if is_same_filename:
|
||||||
|
_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
|
||||||
|
_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
|
||||||
|
if f1_dig == f2_dig:
|
||||||
|
remove(filepath)
|
||||||
|
except FileNotFoundError as e:
|
||||||
|
print("File Not Found", filepath)
|
||||||
|
except HTTPError as e:
|
||||||
|
print("HTTP Error", e.code, e.reason, f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
except HTTPException:
|
||||||
|
print("HTTP Exception for", f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
except URLError as e:
|
||||||
|
print("URL Error for", f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
except ConnectionResetError:
|
||||||
|
print("Connection reset for", f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
except ConnectionRefusedError:
|
||||||
|
print("Connection refused for", f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
except ConnectionAbortedError:
|
||||||
|
print("Connection aborted for", f.download_url)
|
||||||
|
if exists(filepath):
|
||||||
|
remove(filepath)
|
||||||
|
|
|
@ -1,15 +0,0 @@
|
||||||
"""Implementation of basic sequential one-threaded scraper that downloads
|
|
||||||
files one by one."""
|
|
||||||
|
|
||||||
from scrapthechan.scraper import Scraper
|
|
||||||
|
|
||||||
__all__ = ["BasicScraper"]
|
|
||||||
|
|
||||||
|
|
||||||
class BasicScraper(Scraper):
|
|
||||||
def run(self):
|
|
||||||
"""Download files one by one."""
|
|
||||||
for i, f in enumerate(self._files, start=1):
|
|
||||||
if not self._progress_callback is None:
|
|
||||||
self._progress_callback(i)
|
|
||||||
self._download_file(f)
|
|
|
@ -7,13 +7,14 @@ from multiprocessing.pool import ThreadPool
|
||||||
from scrapthechan.scraper import Scraper
|
from scrapthechan.scraper import Scraper
|
||||||
from scrapthechan.fileinfo import FileInfo
|
from scrapthechan.fileinfo import FileInfo
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["ThreadedScraper"]
|
__all__ = ["ThreadedScraper"]
|
||||||
|
|
||||||
|
|
||||||
class ThreadedScraper(Scraper):
|
class ThreadedScraper(Scraper):
|
||||||
def __init__(self, save_directory: str, files: List[FileInfo],
|
def __init__(self, save_directory: str, files: List[FileInfo],
|
||||||
download_progress_callback: Callable[[int], None] = None) -> None:
|
download_progress_callback: Callable[[int], None] = None) -> None:
|
||||||
super(ThreadedScraper, self).__init__(save_directory, files,
|
super().__init__(save_directory, files, download_progress_callback)
|
||||||
download_progress_callback)
|
|
||||||
self._files_downloaded = 0
|
self._files_downloaded = 0
|
||||||
self._files_downloaded_mutex = Lock()
|
self._files_downloaded_mutex = Lock()
|
||||||
|
|
||||||
|
@ -24,8 +25,8 @@ class ThreadedScraper(Scraper):
|
||||||
pool.join()
|
pool.join()
|
||||||
|
|
||||||
def _thread_run(self, f: FileInfo):
|
def _thread_run(self, f: FileInfo):
|
||||||
|
if not self._progress_callback is None:
|
||||||
with self._files_downloaded_mutex:
|
with self._files_downloaded_mutex:
|
||||||
self._files_downloaded += 1
|
self._files_downloaded += 1
|
||||||
if not self._progress_callback is None:
|
|
||||||
self._progress_callback(self._files_downloaded)
|
self._progress_callback(self._files_downloaded)
|
||||||
self._download_file(f)
|
self._download_file(f)
|
||||||
|
|
22
setup.cfg
22
setup.cfg
|
@ -1,31 +1,33 @@
|
||||||
[metadata]
|
[metadata]
|
||||||
name = scrapthechan
|
name = scrapthechan
|
||||||
version = attr: scrapthechan.__version__
|
version = attr: scrapthechan.__version__
|
||||||
description =
|
description = Scrap the files from the imageboards.
|
||||||
Scrap the files posted in a thread on an imageboard. Currently supports
|
|
||||||
4chan.org, lainchan.org and 2ch.hk.
|
|
||||||
long_description = file: README.md
|
long_description = file: README.md
|
||||||
long_description_content_type = text/markdown
|
long_description_content_type = text/markdown
|
||||||
author = Alexander "Arav" Andreev
|
author = Alexander "Arav" Andreev
|
||||||
author_email = me@arav.top
|
author_email = me@arav.top
|
||||||
url = https://arav.top
|
url = https://git.arav.top/Arav/ScrapTheChan
|
||||||
keywords =
|
keywords =
|
||||||
scraper
|
scraper
|
||||||
imageboard
|
imageboard
|
||||||
4chan
|
4chan.org
|
||||||
2ch
|
2ch.hk
|
||||||
lainchan
|
lainchan.org
|
||||||
|
8kun.top
|
||||||
license = MIT
|
license = MIT
|
||||||
license_file = COPYING
|
license_file = COPYING
|
||||||
classifiers =
|
classifiers =
|
||||||
Development Status :: 2 - Pre-Alpha
|
Development Status :: 3 - Alpha
|
||||||
Environment :: Console
|
Environment :: Console
|
||||||
Intended Audience :: End Users/Desktop
|
Intended Audience :: End Users/Desktop
|
||||||
License :: Other/Proprietary License
|
License :: OSI Approved :: MIT License
|
||||||
Natural Language :: English
|
Natural Language :: English
|
||||||
Operating System :: OS Independent
|
Operating System :: OS Independent
|
||||||
Programming Language :: Python :: 3.7
|
Programming Language :: Python :: 3.7
|
||||||
Programming Language :: Python :: 3.8
|
Topic :: Communications :: BBS
|
||||||
|
Topic :: Internet :: WWW/HTTP
|
||||||
|
Topic :: Internet :: WWW/HTTP :: Dynamic Content :: Message Boards
|
||||||
|
Topic :: Text Processing
|
||||||
Topic :: Utilities
|
Topic :: Utilities
|
||||||
|
|
||||||
[options]
|
[options]
|
||||||
|
|
Loading…
Reference in New Issue