rss-bridge/bridges/GQMagazineBridge.php

<?php

/**
 * An extension of the previous SexactuBridge to cover the whole GQMagazine.
 * This one taks a page (as an example sexe/news or journaliste/maia-mazaurette) which is to be configured,
 * reads all the articles visible on that page, and make a stream out of it.
 * @author nicolas-delsaux
 *
 */
class GQMagazineBridge extends BridgeAbstract
{
    const MAINTAINER = 'Riduidel';

    const NAME = 'GQMagazine';

    // URI is no more valid, since we can address the whole gq galaxy
    const URI = 'https://www.gqmagazine.fr';

    const CACHE_TIMEOUT = 7200; // 2h
    const DESCRIPTION = 'GQMagazine section extractor bridge. This bridge allows you get only a specific section.';

    const DEFAULT_DOMAIN = 'www.gqmagazine.fr';

    const PARAMETERS = [ [
        'domain' => [
            'name' => 'Domain to use',
            'required' => true,
            'defaultValue' => self::DEFAULT_DOMAIN
        ],
        'page' => [
            'name' => 'Initial page to load',
            'required' => true,
            'exampleValue' => 'sexe/news'
        ],
        'limit' => self::LIMIT,
    ]];

    const REPLACED_ATTRIBUTES = [
        'href' => 'href',
        'src' => 'src',
        'data-original' => 'src'
    ];

    const POSSIBLE_TITLES = [
        'h2',
        'h3'
    ];

    private function getDomain()
    {
        $domain = $this->getInput('domain');
        if (empty($domain)) {
            $domain = self::DEFAULT_DOMAIN;
        }
        if (strpos($domain, '://') === false) {
            $domain = 'https://' . $domain;
        }
        return $domain;
    }

    public function getURI()
    {
        return $this->getDomain() . '/' . $this->getInput('page');
    }

    private function findTitleOf($link)
    {
        foreach (self::POSSIBLE_TITLES as $tag) {
            $title = $link->parent()->find($tag, 0);
            if ($title !== null) {
                if ($title->plaintext !== null) {
                    return $title->plaintext;
                }
            }
        }
    }

    public function collectData()
    {
        $html = getSimpleHTMLDOM($this->getURI());

        // Since GQ don't want simple class scrapping, let's do it the hard way and ... discover content !
        $main = $html->find('main', 0);
        $limit = $this->getInput('limit') ?? 10;
        foreach ($main->find('a') as $link) {
            if (count($this->items) >= $limit) {
                break;
            }

            $uri = $link->href;
            $date = $link->parent()->find('time', 0);

            $item = [];
            $author = $link->parent()->find('span[itemprop=name]', 0);
            if ($author !== null) {
                $item['author'] = $author->plaintext;
                $item['title'] = $this->findTitleOf($link);
                switch (substr($uri, 0, 1)) {
                    case 'h': // absolute uri
                        $item['uri'] = $uri;
                        break;
                    case '/': // domain relative uri
                        $item['uri'] = $this->getDomain() . $uri;
                        break;
                    default:
                        $item['uri'] = $this->getDomain() . '/' . $uri;
                }
                $article = $this->loadFullArticle($item['uri']);
                if ($article) {
                    $item['content'] = $this->replaceUriInHtmlElement($article);
                } else {
                    $item['content'] = "<strong>Article body couldn't be loaded</strong>. It must be a bug!";
                }
                $short_date = $date->datetime;
                $item['timestamp'] = strtotime($short_date);
                $this->items[] = $item;
            }
        }
    }

    /**
     * Loads the full article and returns the contents
     * @param $uri The article URI
     * @return The article content
     */
    private function loadFullArticle($uri)
    {
        $html = getSimpleHTMLDOMCached($uri);
        return $html->find('article', 0);
    }

    /**
     * Replaces all relative URIs with absolute ones
     * @param $element A simplehtmldom element
     * @return The $element->innertext with all URIs replaced
     */
    private function replaceUriInHtmlElement($element)
    {
        $returned = $element->innertext;
        foreach (self::REPLACED_ATTRIBUTES as $initial => $final) {
            $returned = str_replace($initial . '="/', $final . '="' . self::URI . '/', $returned);
        }
        return $returned;
    }
}
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`<?php`

			`/**`
			`* An extension of the previous SexactuBridge to cover the whole GQMagazine.`
			`* This one taks a page (as an example sexe/news or journaliste/maia-mazaurette) which is to be configured,`
			`* reads all the articles visible on that page, and make a stream out of it.`
			`* @author nicolas-delsaux`
			`*`
			`*/`
			`class GQMagazineBridge extends BridgeAbstract`
			`{`
			`const MAINTAINER = 'Riduidel';`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`const NAME = 'GQMagazine';`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`// URI is no more valid, since we can address the whole gq galaxy`
			`const URI = 'https://www.gqmagazine.fr';`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`const CACHE_TIMEOUT = 7200; // 2h`
			`const DESCRIPTION = 'GQMagazine section extractor bridge. This bridge allows you get only a specific section.';`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`const DEFAULT_DOMAIN = 'www.gqmagazine.fr';`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`const PARAMETERS = [ [`
			`'domain' => [`
			`'name' => 'Domain to use',`
			`'required' => true,`
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`'defaultValue' => self::DEFAULT_DOMAIN`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`],`
			`'page' => [`
			`'name' => 'Initial page to load',`
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`'required' => true,`
			`'exampleValue' => 'sexe/news'`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`],`
feat: add limit options to the slowest bridges 2022-04-10 19:56:24 +03:00			`'limit' => self::LIMIT,`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`]];`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`const REPLACED_ATTRIBUTES = [`
			`'href' => 'href',`
			`'src' => 'src',`
			`'data-original' => 'src'`
			`];`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
[GQMagazineBridge] Fix bridge (#1195) * Fix bridge by changing the way the articles are loaded AND their titles are found 2019-06-28 20:29:32 +03:00			`const POSSIBLE_TITLES = [`
			`'h2',`
			`'h3'`
			`];`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`private function getDomain()`
			`{`
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`$domain = $this->getInput('domain');`
			`if (empty($domain)) {`
			`$domain = self::DEFAULT_DOMAIN;`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00			`}`
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`if (strpos($domain, '://') === false) {`
			`$domain = 'https://' . $domain;`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00			`}`
bridges: Fix bridges to pass unit test (#984) * [DealabsBridge] fixed parameters * [DemonoidBridge] added parameter context names * [DevToBridge] fixed parameters * [ExtremeDownloadBridge] fixed parameters * [GithubIssueBridge] fixed parameters * [InstagramBridge] added parameter context names * [MydealsBridge] fixed parameters * [OnVaSortirBridge] fixed parameters * [ThingyverseBridge] fixed parameters * [HotUKDealsBridge] fixed parameters * [FeedExpanderExample] added proper URI * [GQMagazineBridge] fixed parameters and getDomain() * [MozillaSecurityBridge] fixed filename References #980 2019-01-05 14:29:26 +03:00			`return $domain;`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`public function getURI()`
			`{`
			`return $this->getDomain() . '/' . $this->getInput('page');`
			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
[GQMagazineBridge] Fix bridge (#1195) * Fix bridge by changing the way the articles are loaded AND their titles are found 2019-06-28 20:29:32 +03:00			`private function findTitleOf($link)`
			`{`
			`foreach (self::POSSIBLE_TITLES as $tag) {`
[GQMagazineBridge] Adapt to changes, fixes #1280 2019-09-06 11:51:13 +03:00			`$title = $link->parent()->find($tag, 0);`
[GQMagazineBridge] Fix bridge (#1195) * Fix bridge by changing the way the articles are loaded AND their titles are found 2019-06-28 20:29:32 +03:00			`if ($title !== null) {`
			`if ($title->plaintext !== null) {`
			`return $title->plaintext;`
			`}`
			`}`
			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00			`}`

Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`public function collectData()`
			`{`
bridges: remove redundant "or returnServerError" after getContents/getSimpleHTMLDom/getSimpleHTMLDomCached (#2398) When fetching website contents, exceptions already raise on fetching error 2022-01-02 12:36:09 +03:00			`$html = getSimpleHTMLDOM($this->getURI());`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`// Since GQ don't want simple class scrapping, let's do it the hard way and ... discover content !`
			`$main = $html->find('main', 0);`
feat: add limit options to the slowest bridges 2022-04-10 19:56:24 +03:00			`$limit = $this->getInput('limit') ?? 10;`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`foreach ($main->find('a') as $link) {`
feat: add limit options to the slowest bridges 2022-04-10 19:56:24 +03:00			`if (count($this->items) >= $limit) {`
			`break;`
			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`$uri = $link->href;`
[GQMagazineBridge] Adapt to changes, fixes #1280 2019-09-06 11:51:13 +03:00			`$date = $link->parent()->find('time', 0);`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`$item = [];`
[GQMagazineBridge] Adapt to changes, fixes #1280 2019-09-06 11:51:13 +03:00			`$author = $link->parent()->find('span[itemprop=name]', 0);`
[GQMagazineBridge] Fix bridge (#1195) * Fix bridge by changing the way the articles are loaded AND their titles are found 2019-06-28 20:29:32 +03:00			`if ($author !== null) {`
			`$item['author'] = $author->plaintext;`
			`$item['title'] = $this->findTitleOf($link);`
			`switch (substr($uri, 0, 1)) {`
			`case 'h': // absolute uri`
			`$item['uri'] = $uri;`
			`break;`
			`case '/': // domain relative uri`
			`$item['uri'] = $this->getDomain() . $uri;`
			`break;`
			`default:`
			`$item['uri'] = $this->getDomain() . '/' . $uri;`
			`}`
			`$article = $this->loadFullArticle($item['uri']);`
			`if ($article) {`
			`$item['content'] = $this->replaceUriInHtmlElement($article);`
			`} else {`
			`$item['content'] = "<strong>Article body couldn't be loaded</strong>. It must be a bug!";`
			`}`
			`$short_date = $date->datetime;`
			`$item['timestamp'] = strtotime($short_date);`
			`$this->items[] = $item;`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00			`}`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`}`
			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`/**`
			`* Loads the full article and returns the contents`
			`* @param $uri The article URI`
			`* @return The article content`
			`*/`
			`private function loadFullArticle($uri)`
			`{`
			`$html = getSimpleHTMLDOMCached($uri);`
[GQMagazineBridge] fix retrieve the content of an article at a given url (#2305) 2022-03-25 02:26:38 +03:00			`return $html->find('article', 0);`
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`}`
Reformat codebase v4 (#2872) Reformat code base to PSR12 Co-authored-by: rssbridge <noreply@github.com> 2022-07-01 16:10:30 +03:00
Expanded Sexactu to cover the whole GQ magazine (#861) The bridge has been expanded to better cover the whole GQ magazine. It should support all countries (provided they all use the same absurdly shitty publication system). It is guaranteed to be only tested with sexactu articles (that I now obtain by loading Maïa Mazaurette author page). 2018-10-15 19:09:20 +03:00			`/**`
			`* Replaces all relative URIs with absolute ones`
			`* @param $element A simplehtmldom element`
			`* @return The $element->innertext with all URIs replaced`
			`*/`
			`private function replaceUriInHtmlElement($element)`
			`{`
			`$returned = $element->innertext;`
			`foreach (self::REPLACED_ATTRIBUTES as $initial => $final) {`
			`$returned = str_replace($initial . '="/', $final . '="' . self::URI . '/', $returned);`
			`}`
			`return $returned;`
			`}`
			`}`