simplehtmldom 1.9 introduced new functions to recursively remove
nodes from the DOM. This allows removing elements without the need
to re-load the document by using $html->load($html->save()), which
is very inefficient.
Find more information about remove() at
https://simplehtmldom.sourceforge.io/docs/1.9/api/simple_html_dom_node/remove/
This commit adds filters to remove embedded videos and view counts from
all posts. This doesn't remove the preview image for videos, which are
embedded separately.
Hidden elements are used for error conditions and generally made
visible using JavaScript. Since RSS-Bridge doesn't support JS, these
error messages are shown in the final feed. For example:
"It looks like you may be having problems playing this video. If so,
please try restarting your browser."
This commit removes all hidden elements to prevent error messages being
added to the feed.
- "It looks like you may be having problems playing this video. If so,
please try restarting your browser."
FB includes origin information (i.e. "YOUTUBE.COM") as well as
descriptions with embedded media (images and video).
These details are currently being removed by the bridge.
This commit changes implementation to only remove origin information
and keep the media description in place. The media description consists
of two elements - title and description. The title provided by FB is
included in an anchor, which gets replaced by a paragraph with the
same contents to improve readability.
References #912
This commit collects the original contents from a different
tag to prevent this issue. The root cause is unknown but closely
related to the regex.
References #877
The function 'defaultLinkTo' applied to the source HTML does break
regex matches later in the bridge. We need to apply the function
right before adding the contents to the item for the bridge to work
properly.
References #856
This commit adds a new optional parameter 'limit' which can be used
to limit the number of items returned by this bridge (i.e. '&limit=10')
As requested in #669
The URI "https://facebook.com/username?_fb_noscript=1" returns two
posts per user. Some profiles, however, are very active, causing the
bridge to miss items if more than two posts are send within the cache
duration (5 minutes).
The alternative suggested in #669 is to use a different URI:
"https://facebook.com/pg/username/posts?_fb_noscript=1"
While the contents of this URI essentially look the same when viewed
in a browser, it actually returns more than 10 posts depending on the
profile.
References #669
* Debug mode improvements
- Improve debug warning message
- Restore error reporting in debug mode
- Fix 'notice' messages for unset fields
* Add parsing utility functions
html.php
- extractFromDelimiters
- stripWithDelimiters
- stripRecursiveHTMLSection
- markdownToHtml (partial)
bridges
- remove now-duplicate functions
- call functions from html.php instead
* [Anidex] New bridge
Anime torrent tracker
* [Anime-Ultime] Restore thumbnail
* [CNET] Recreate bridge
Full rewrite as the previous one was broken
* [Dilbert] Minor URI fix
Use new self::URI property
* [EstCeQuonMetEnProd] Fix content extraction
Bridge was broken
* [Facebook] Fix "SpSonsSoriSsés" label
... which was taking space in item title
* [Futura-Sciences] Use HTTPS, More cleanup
Use HTTPS as FS now offer HTTPS
Clean additional useless HTML elements
* [GBATemp] Multiple fixes
- Fix categories: missing "break" statements
- Restore thumbnail as enclosure
- Fix date extraction
- Fix user blog post extraction
- Use getSimpleHTMLDOMCached
* [JapanExpo] Fix bridge, HTTPS, thumbnails
- Fix getSimpleHTMLDOMCached call
- Upgrade to HTTPS as JE now offers HTTPS
- Restore thumbnails as enclosures
* [LeMondeInformatique] Fix bridge, HTTPS
- Upgrade to HTTPS as LMI now offers HTTPS
- Restore thumbnails using small images
- Fix content extraction
- Fix text encoding issue
* [Nextgov] Fix content extraction
- Restore thumbnail and use small image
- Field extraction fixes
* [NextInpact] Add categories and filtering by type
- Offer all RSS feeds
- Allow filtering by article type
- Implement extraction for brief articles
- Remove article limit, many brief articles are publied all at once
* [NyaaTorrents] New bridge
Anime torrent tracker
* [Releases3DS] Cache content, restore thumbnail
- Use getSimpleHTMLDOMCached
- Restore thumbnail as enclosure
* [TheHackerNews] Fix bridge
- Fix content extraction including article body
- Restore thumbnail as enclosure
* [WeLiveSecurity] HTTPS, Fix content extraction
- Upgrade to HTTPS as WLS now offers HTTPS
- Fix content extraction including article body
* [WordPress] Reduce timeout, more content selectors
- Reduce timeout to use default one (1h)
- Add new content selector (articleBody)
- Find thumbnail and set as enclosure
- Fix <script> cleanup
* [YGGTorrent] Increase limit, use cache
- Increase item limit as uploads are very frequent
- Use getSimpleHTMLDOMCached
* [ZDNet] Rewrite with FeedExpander
- Upgrade to HTTPS as ZD now offers HTTPS
- Use FeedExpander for secondary fields
- Fix content extraction for article body
* [Main] Handle MIME type for enclosures
Many feed readers will ignore enclosures (e.g. thumbnails) with no MIME type. This commit adds automatic MIME type detection based on file extension (which may be inaccurate but is the only way without fetching the content).
One can force enclosure type using #.ext anchor (hacky, needs improving)
* [FeedExpander] Improve field extraction
- Add support for passing enclosures
- Improve author and uri extraction
- Fix 'notice' PHP error messages
* [Pull] Coding style fixes for #802
* [Pull] Implementing changes for #802
- Fix coding style issues with str append
- Remove useless CACHE_TIMEOUT
- Use count() instead of $limit
- Use defaultLinkTo() + handle strings
- Use http_build_query()
- Fix missing </em>
- Remove error_reporting(0)
- warning CSS (@LogMANOriginal)
- Fix typo in FeedExpander comment
* [Main] More documentation for markdownToHtml
See #802 for more details
The previous context is now labeled 'User', while the new context is
labeled 'Group'. The existing code was not changed, instead new group*
functions were implemented to handle groups.
The general principle of capturing groups is the same as done for users
with adjustments to account for different HTML structures.
Captcha responses are currently not supported for groups! There doesn't
seem to be a way to trigger them consistently, which makes it hard to
handle them properly.
Features of the group context:
- The feed title is based on the group name
- The group URI used for capturing is returned for the feed URI
- Author names and timestamps are reproduced from the source
- Post titles are reproduced from the source if they exist, otherwise
the title is build manually from the author name and the content
- Original contents are included with the feed
- All images are attached as enclosures as well
Closes #
Allows users to paste facebook links as user name. The link must contain
the correct host (www.facebook.com) and a valid path (/user-name/...).
The first part of the path is used for the user name. Errors are returned
in case something went wrong.
References #706
Reviews are provided the same way as summary posts and therefore returned
as separate feed item for each review. This commit adds a new option
'&skip_reviews=on' to skip reviews entirely.
References #706
Requesting a username with a leading slash would cause error 500
because the requested URI would contain two slashes in a row.
For example username "/test" would result in:
https://facebook.com//test
References #628
All formats except HTML return & instead of & in URLs causing
all links with parameters (...&id=...) to break.
Facebook does not return valid HTML URIs but instead provides them
with all special characters encoded (like using htmlspecialchars).
This seems to be related to the page being build almost entirely of
script blocks.
This commit adds htmlspecialchars_decode() to URI and content to
reverse the encoding.
References #550
- Do not add spaces after opening or before closing parenthesis
// Wrong
if( !is_null($var) ) {
...
}
// Right
if(!is_null($var)) {
...
}
- Add space after closing parenthesis
// Wrong
if(true){
...
}
// Right
if(true) {
...
}
- Add body into new line
- Close body in new line
// Wrong
if(true) { ... }
// Right
if(true) {
...
}
Notice: Spaces after keywords are not detected:
// Wrong (not detected)
// -> space after 'if' and missing space after 'else'
if (true) {
...
} else{
...
}
// Right
if(true) {
...
} else {
...
}
This replaces the 'novideo' parameter with 'media_type' in order
to filter for specific content types. Currently supported:
- 'all': Returns all posts (default)
- 'video': Returns only posts including videos
- 'novideo': Returns only posts that don't include videos
References #553
This adds a new option 'novideo' that can be set to 'on' or 'off'
in order to skip posts that include facebook videos (does not work
for linked videos like YouTube). This option is 'off' by default.
References #533
If no accepted languages are specified Facebook will guess your
language. This guess can go horribly wrong if your server does not
provide origin information.
This adds a context header with language information when retrieving
page contents. The accepted languages are read from the list of
accepted languages specified by the web browser of the requester.
References #530
Previously summary posts were ignored which resulted in the last
two posts not showing up in the feed (the latest two are shown in
the summary post).
Now summary posts are treated like regular posts, returning them
as part of the regular feed.
References #502, #505
- returnError, returnServerError, returnClientError ,debugMessage are
moved to lib/error.php
- getContents, getSimpleHTMLDOM, getSimpleHTMLDOMCached are moved to
lib/contents.php
Signed-off-by: Pierre Mazière <pierre.maziere@gmx.com>